22.1 DATA GATHERING TECHNIQUES AND TOOLS
Questions you might ask to begin the troubleshooting process include:
Are there planned maintenance activities happening at this time?
What is the complete error message?
What versions of the OpenView software products are in use?
What versions of the operating system are running on the server and agent?
What server or agent hardware platform is involved?
Where did the error occur? On the server or the node?
When did the error initially occur?
Can the problem be repeated?
Have there been any recent changes to the system (such as new software)?
What is the status of the OpenView processes?
Did any error messages appear in the log files?
Are there any errors in the itochecker report?
Do the processes start and stop properly?
Do you have the current patches installed for the agent, server, and operating system?
Do you have a current system backup?
Did the failed process produce a core file?
The troubleshooting recommendations and resource information presented in this chapter are adopted from known best practices. Due to the dynamic nature of the environment, it is important to check for the most current OpenView problem resolution resources available online at http://support.openview.hp.com. Determine whether the issue you face today may have already been resolved with a patch or documented resolution process. This chapter presents general guidelines to isolate the issues into the correct categories and collect the necessary information to begin the troubleshooting process.
22.1.1 Check for Errors
Error messages from OpenView are reported to the user via a variety of sources. The error message sources are the log files, the graphical user interface, and the shell. In the graphical user interface, the error messages may appear in a pop-up window as the result of an illegal operation or in the message browser within the message text. There are a few log files that contain important information about the normal operation of the system and when necessary error messages when an operation within the OV environment did not complete successfully. For example, after installing an operating system or OV patch you should check the installation log files for any errors.
22.1.1.1 Review the Log Files
The log files that may contain important system operation and error messages are described here for reference purposes. Refer to the Administrator Guide for the platform-specific location of the log files:
opcerrror
Server and agent run time error log (on DCE agent)system.txt
Server and agent run time error log file (on HTTPS-based agent)operating system and subsystem error log files
(check the files that are appropriate in your operating environment)
22.1.1.2 OVO Errors
If an OVO error message has been produced, check the meaning and possible resolutions using the
opcerr command. The message will start with the string OpC" and contain a body and tail as shown in the following example.
# tail /var/opt/OV/log/OpC/opcerror|grep ERROR
07/16/04 10:31:36 ERROR opctrapi (Trap Interceptor)(1907)
[opcevti.cpp:1460]:
Receiving SNMP PDU failed: Lost connection with pmd/ovEvent process
(application disconnected). Trying to reinitialize. (OpC30-204)
# /opt/OV/bin/OpC/utils/opcerr 30 204
MESSAGE OpC30-204:
Receiving SNMP PDU failed: .... Trying to reinitialize.
INSTRUCTION:
The VPO event interceptor could not get a SNMP PDU although it was
informed that there is one available. The SNMP API message <snmp-msg>
gives more information.
The event interceptor tries to reconnect to pmd.
The OVO error messages are organized into categories based upon the number in the body (such as OpC20-xxxx). The OVO error categories with examples are shown in Table 22-1.
Error Category | Error Number | Sample Description |
---|---|---|
Internal Messages | OpC10-0001 | Insufficient memory |
Public Routines | OpC20-0001 | Invalid queue descriptor |
Agent Processes | OpC30-0001 | Invalid request to assemble |
Manager Processes | OpC40-0001 | Can't open pipe [x1] |
Database Access | OpC50-0001 | Database inconsistency detected |
Internal Database Messages | OpC51-0022 | Retry |
Messages used by the commands (API) | OpC53-0150 | Usage: opchistupl <file> |
Configuration upload/download | OpC54-0002 | Unknown option |
Database Install/upgrade | OpC55-0016 | Already exists |
User Interface | OpC60-0005 | User name must be entered |
NT Installation | OpC130-0010 | Setup program started (preinit) |
Security | OpC140-0116 | Secret key for <x1> not found |
Refer to the online Help for the complete Error Messages Reference Guide.
22.1.1.3 Oracle Errors
Some Oracle database error messages have two parts ORA-xxxx. When you need to gather more information about the error, use the utility program $ORACLE_HOME/bin/oerr. This program will produce useful information about the error and troubleshooting tips. The error message categories are shown in Table 22-2.
Message Numbers | Categories |
---|---|
00000-00099 | Oracle Server |
00200-00249 | Control files |
00250-00299 | Archiving and recovery |
00300-00379 | Redo log files |
00440-00485 | Background processes |
00700-00709 | Dictionary cache |
00900-00999 | Parsing of SQL statements |
01100-01250 | The database and its support files |
01400-01489 | SQL execution errors |
01500-01699 | DBA set of SQL commands |
02376-02399 | Resources |
04030-04039 | Memory and the shared pool |
04040-04069 | Stored procedures |
12100-12299 | SQL*Net |
12500-12699 | SQL*Net |
12700-12799 | Use of the multilingual options |
If there is a message in the message browser from the Oracle database, the message text will include the error message number. With this information, you can check the message with the
oerr command as shown in the next example. Sometimes the error messages are very complex and could signal major trouble. If you are not sure what corrective action is required, report the error to your Database Administrator or support vendor.
# $ORACLE_HOME/bin/oerr ORA 01547
01547, 00000, "warning: RECOVER succeeded but OPEN RESETLOGS would get
error below"
// *Cause: Media recovery with one of the incomplete re//covery options
ended without
error. However, if the //ALTER DATABASE OPEN RESETLOGS
command were attempted now, //it
would fail with the specified error. The most likely //cause of this
error is forgetting
to restore one or more //datafiles from a sufficiently old backup
before executing //the
incomplete recovery.
// *Action: Rerun the incomplete media recovery using dif//
ferent datafile backups, a
different control file, or different stop criteria.
Some Oracle error messages are very generic, not fatal and provide codes that can only be interpreted by contacting a DBA.
22.1.2 Check, Stop, or Start the OpenView Processes
The process check is one of the best places to start checking the run-time environment. Ensure that the correct processes are running, determine why they are not, or restart the processes. If the OVO processes will not stop, check the process table with the
ps command. If necessary remove them with the
22.1.2.1 Check Server and Agent Process Status
The processes running during normal operations of the server are as follows:
From the command line of the management server, use the following commands to verify that the correct processes are running:
opcsv
status:
check the management server processes
The results of the opcsv command are shown here:
#opcsv
OVO Management Server status:
-----------------------------
Control Manager opcctlm (3847) is running
Action Manager opcactm (3856) is running
Message Manager opcmsgm (3857) is running
TT & Notify Mgr opcttnsm (3858) is running
Forward Manager opcforwm (3859) is running
Service Engine opcsvcm (3864) is running
Cert. Srv Adapter opccsad (3862) is running
BBC config adapter opcbbcdist (3863) is running
Display Manager opcdispm (3860) is running
Distrib. Manager opcdistm (3861) is running
Open Agent Management status:
-----------------------------
Request Sender ovoareqsdr (3843) is running
Request Handler ovoareqhdlr (3846) is running
Message Receiver (BBC) opcmsgrb (3848) is running
Message Receiver opcmsgrd (3849) is running
Ctrl-Core and Server Extensions status:
---------------------------------------
Control Daemon ovcd (1460) is running
BBC Communications Broker ovbbccb (1467) is running
Config and Deploy ovconfd (1457) is running
Certificate Server ovcs (1469) is running
From the command line of the server, verify that the correct processes are running on the managed node; if necessary, restart the processes:
opcragt-status-all:
Check the status of all configured agentsopcragt-status <node_name>:
Check the status of a specific node
Many of the processes running during normal operation of the agent (depending on the deployed policy) are as follows:
Embedded Performance Component (coda)
The output from the
ovc command is shown here:
# ovc
ovcd Control Daemon CORE (4314) Running
ovbbccb HP OpenView BBC Communications Broker CORE (1467) Running
ovconfd HP OpenView Config and Deploy CORE (4315) Running
ovcs HP OpenView Certificate Server SERVER (4320) Running
opcmsga OVO Message Agent AGENT,EA (4321) Running
opcacta OVO Action Agent AGENT,EA (4322) Running
opcmsgi OVO Message Interceptor AGENT,EA (4323) Running
opcle OVO Logfile Encapsulator AGENT,EA (4325) Running
opcmona OVO Monitor Agent AGENT,EA (4327) Running
opctrapi OVO SNMP Trap Interceptor AGENT,EA (4329) Running
opceca OVO Event Correlation AGENT,EA Stopped
opcecaas ECS Annotate Server AGENT,EA Stopped
coda HP OpenView Performance Core (4331) Running
From the command line of the HTTPS-based managed node, verify that the correct processes are running; if necessary, restart the processes.
Note
the ovc command is available on HTTPs-based nodes only. Check the agent status of DCE-based nodes with the opcagt command.
22.1.3 Utilize the Online Help
During troubleshooting, it is helpful to have all the necessary information and resources at your fingertips. The online resources provided within the OpenView platform make access to important information easy. Inside each graphical window there is a HELP button on the menu. As shown in Figure 22-1, you can obtain help about the typical administrator tasks, icons, and errors. There is also a search engine, a glossary of terms, and instructions on how to use the built-in help.
Figure 22-1. OpenView Operation online help.
# tail /var/opt/OV/log/OpC/opcerror|grep ERROR
07/16/04 10:31:36 ERROR opctrapi (Trap Interceptor)(1907)
[opcevti.cpp:1460]:
Receiving SNMP PDU failed: Lost connection with pmd/ovEvent process
(application disconnected). Trying to reinitialize. (OpC30-204)
# /opt/OV/bin/OpC/utils/opcerr 30 204
MESSAGE OpC30-204:
Receiving SNMP PDU failed: .... Trying to reinitialize.
INSTRUCTION:
The VPO event interceptor could not get a SNMP PDU although it was
informed that there is one available. The SNMP API message <snmp-msg>
gives more information.
The event interceptor tries to reconnect to pmd.
22.1.4 The itochecker Report
OVO provides utilities out-of-the box to assist with troubleshooting. One utility, the itochecker, provides an overall check of your OVO environment. You can use the itochecker report to help isolate a problem. With the itochecker, you can generate a report that will provide important information about the state of the configuration on the management server environment. Read the man page for usage details. In this section, an example is provided of how to create a full report and a display of the results. The report provides a good overall look at the OVO environment. The primary areas of interest are any categories that show errors.
22.1.4.1 Run the itochecker Report
Run the report:
# /opt/
OV /contrib/OpC/itochecker a
Extract an HTML report file from the compressed tar file:
# zcat /tmp/ITO_rpt/ITO_rpt.tar.Z | tar xvf - reportl
View the report in the browser with the command
/opt/netscape/netscape reportl
This listing is the main menu of the report; the hyperlinks guide you to additional details about the specific areas in each category. Use this report to get a quick indication of any signs of trouble within the OVO environment.
ITOCHECKER REPORT
Thu Feb 5 08:22:43 PST 2004
Management Server: nuema
System Environment Check
Name Resolution OK
System Info OK
Number Of Processes / System Load OK
DCE Status and Patchlevel OK
System file permissions OK
OS Patches OKOVO EnvironmentCheck
OVO Version/Package & ECS Designer N/A
Server Processes OK
Kernel Parameters WARNING
OVO Patches OK
Installed OVO filesets OK
OVO Binaries: Version and Patches OK
OVO Libraries: Version and Patches OK
Disk space in DB and OVO Directories OK
Pending Data in Distribution Directory WARNING
OpCdecoded Config Files OK
Cluster Information N/A
core File Information OK
opcinfo and opcerror Files OK
Elements in Server Queues OK
Elements in Agent Queues OK
File permissions and ownershipOKDatabase Check
Database Info OK
Database Queries OKOVO Database Check
Agent Entries in DB <-> Agent Entries in Filesystem OK
Diskspace in Oracle Directory OKNodes Check
Nodes Check OK
Nodes Check Statistics OKJava GUI / Service Navigator
Java Version / Path OK
Content of dir /opt/OV/www/htdocs/ito_op OK
Content of dir /etc/opt/OV/share/conf/OpC/mgmt_sv/opcsvcm OK
Check number of configured services and loggings OK
Output of /opt/OV/contrib/OpC/stacktrace svc OK
Created by itochecker version A.08.00