22.1 DATA GATHERING TECHNIQUES AND TOOLS
Questions you might ask to begin the troubleshooting process include:
- Are there planned maintenance activities happening at this time?
- What is the complete error message?
- What versions of the OpenView software products are in use?
- What versions of the operating system are running on the server and agent?
- What server or agent hardware platform is involved?
- Where did the error occur? On the server or the node?
- When did the error initially occur?
- Can the problem be repeated?
- Have there been any recent changes to the system (such as new software)?
- What is the status of the OpenView processes?
- Did any error messages appear in the log files?
- Are there any errors in the itochecker report?
- Do the processes start and stop properly?
- Do you have the current patches installed for the agent, server, and operating system?
- Do you have a current system backup?
- Did the failed process produce a core file?
- Are unplanned maintenance activities happening now?
The troubleshooting recommendations and resource information presented in this chapter are adopted from known best practices. Due to the dynamic nature of the environment, it is important to check for the most current OpenView problem resolution resources available online at http://support.openview.hp.com. Determine whether the issue you face today may have already been resolved with a patch or documented resolution process. This chapter presents general guidelines to isolate the issues into the correct categories and collect the necessary information to begin the troubleshooting process.
22.1.1 Check for Errors
Error messages from OpenView are reported to the user via a variety of sources. The error message sources are the log files, the graphical user interface, and the shell. In the graphical user interface, the error messages may appear in a pop-up window as the result of an illegal operation or in the message browser within the message text. There are a few log files that contain important information about the normal operation of the system and when necessary error messages when an operation within the OV environment did not complete successfully. For example, after installing an operating system or OV patch you should check the installation log files for any errors.
22.1.1.1 Review the Log Files
The log files that may contain important system operation and error messages are described here for reference purposes. Refer to the Administrator Guide for the platform-specific location of the log files:
- opcerrror
Server and agent run time error log (on DCE agent) - system.txt
Server and agent run time error log file (on HTTPS-based agent) - install.log
Management server installation log file - inst_err.log
Management server install error log file - inst_sum.log
Summary of managed node installation - opccfgupld.log
Configuration upload log file - alert_<SID>.log
Oracle database alert log file - operating system and subsystem error log files
(check the files that are appropriate in your operating environment)
22.1.1.2 OVO Errors
If an OVO error message has been produced, check the meaning and possible resolutions using the opcerr command. The message will start with the string OpC" and contain a body and tail as shown in the following example.
The OVO error messages are organized into categories based upon the number in the body (such as OpC20-xxxx). The OVO error categories with examples are shown in Table 22-1.
# tail /var/opt/OV/log/OpC/opcerror|grep ERROR
07/16/04 10:31:36 ERROR opctrapi (Trap Interceptor)(1907)
[opcevti.cpp:1460]:
Receiving SNMP PDU failed: Lost connection with pmd/ovEvent process
(application disconnected). Trying to reinitialize. (OpC30-204)
# /opt/OV/bin/OpC/utils/opcerr 30 204
MESSAGE OpC30-204:
Receiving SNMP PDU failed: .... Trying to reinitialize.
INSTRUCTION:
The VPO event interceptor could not get a SNMP PDU although it was
informed that there is one available. The SNMP API message <snmp-msg>
gives more information.
The event interceptor tries to reconnect to pmd.
Error Category | Error Number | Sample Description |
---|---|---|
Internal Messages | OpC10-0001 | Insufficient memory |
Public Routines | OpC20-0001 | Invalid queue descriptor |
Agent Processes | OpC30-0001 | Invalid request to assemble |
Manager Processes | OpC40-0001 | Can't open pipe [x1] |
Database Access | OpC50-0001 | Database inconsistency detected |
Internal Database Messages | OpC51-0022 | Retry |
Messages used by the commands (API) | OpC53-0150 | Usage: opchistupl <file> |
Configuration upload/download | OpC54-0002 | Unknown option |
Database Install/upgrade | OpC55-0016 | Already exists |
User Interface | OpC60-0005 | User name must be entered |
NT Installation | OpC130-0010 | Setup program started (preinit) |
Security | OpC140-0116 | Secret key for <x1> not found |
22.1.1.3 Oracle Errors
Some Oracle database error messages have two parts ORA-xxxx. When you need to gather more information about the error, use the utility program $ORACLE_HOME/bin/oerr. This program will produce useful information about the error and troubleshooting tips. The error message categories are shown in Table 22-2.
Message Numbers | Categories |
---|---|
00000-00099 | Oracle Server |
00200-00249 | Control files |
00250-00299 | Archiving and recovery |
00300-00379 | Redo log files |
00440-00485 | Background processes |
00700-00709 | Dictionary cache |
00900-00999 | Parsing of SQL statements |
01100-01250 | The database and its support files |
01400-01489 | SQL execution errors |
01500-01699 | DBA set of SQL commands |
02376-02399 | Resources |
04030-04039 | Memory and the shared pool |
04040-04069 | Stored procedures |
12100-12299 | SQL*Net |
12500-12699 | SQL*Net |
12700-12799 | Use of the multilingual options |
# $ORACLE_HOME/bin/oerr ORA 01547Some Oracle error messages are very generic, not fatal and provide codes that can only be interpreted by contacting a DBA.
01547, 00000, "warning: RECOVER succeeded but OPEN RESETLOGS would get
error below"
// *Cause: Media recovery with one of the incomplete re//covery options
ended without
error. However, if the //ALTER DATABASE OPEN RESETLOGS
command were attempted now, //it
would fail with the specified error. The most likely //cause of this
error is forgetting
to restore one or more //datafiles from a sufficiently old backup
before executing //the
incomplete recovery.
// *Action: Rerun the incomplete media recovery using dif//
ferent datafile backups, a
different control file, or different stop criteria.
22.1.2 Check, Stop, or Start the OpenView Processes
The process check is one of the best places to start checking the run-time environment. Ensure that the correct processes are running, determine why they are not, or restart the processes. If the OVO processes will not stop, check the process table with the ps command. If necessary remove them with the kill command.
22.1.2.1 Check Server and Agent Process Status
The processes running during normal operations of the server are as follows:
- Control Manager (opcctlm)
- Action Manager (opcactm)
- Message Manager (opcmsgm)
- TT & Notify Manager (opcttnsm)
- Forward Manager (opcforwm)
- Service Engine (opcsvcm)
- Certification Server Adaptor (opccsad)
- BBC configuration adaptor (opcbbcdist)
- Display Manager (opcdispm)
- Distribution Manager (opcdistm)
- Request Sender (ovoareqsdr)
- Request Handler (ovoareqhdlr)
- Message Receiver (BBC opcmsgrb)
- Message Receiver (opcmsgrd)
- Control Daemon (ovcd)
- BBC Communications Broker (ovbbccb)
- Configuration and Deploy Component (ovconfd)
- Certificate Server (ovcs)
From the command line of the management server, use the following commands to verify that the correct processes are running:
- opcsv:
Check the management server processes - opcsv status:
check the management server processes - ovstatus c:
Check the OpenView platform (NNM) - opcsv stop:
Stop the server processes - opcsv start:
Start the server processes
The results of the opcsv command are shown here:
From the command line of the server, verify that the correct processes are running on the managed node; if necessary, restart the processes:
#opcsv
OVO Management Server status:
-----------------------------
Control Manager opcctlm (3847) is running
Action Manager opcactm (3856) is running
Message Manager opcmsgm (3857) is running
TT & Notify Mgr opcttnsm (3858) is running
Forward Manager opcforwm (3859) is running
Service Engine opcsvcm (3864) is running
Cert. Srv Adapter opccsad (3862) is running
BBC config adapter opcbbcdist (3863) is running
Display Manager opcdispm (3860) is running
Distrib. Manager opcdistm (3861) is running
Open Agent Management status:
-----------------------------
Request Sender ovoareqsdr (3843) is running
Request Handler ovoareqhdlr (3846) is running
Message Receiver (BBC) opcmsgrb (3848) is running
Message Receiver opcmsgrd (3849) is running
Ctrl-Core and Server Extensions status:
---------------------------------------
Control Daemon ovcd (1460) is running
BBC Communications Broker ovbbccb (1467) is running
Config and Deploy ovconfd (1457) is running
Certificate Server ovcs (1469) is running
- opcragt-status-all:
Check the status of all configured agents - opcragt-status <node_name>:
Check the status of a specific node - opcragt-stop<node_name>:
Stop the agent processes
Many of the processes running during normal operation of the agent (depending on the deployed policy) are as follows:
- Control Daemon (ovcd)
- HTTPS Communication Broker (ovbbccb)
- Configuration and Deploy Component (ovconfd)
- Certificate Server (ovcs)
- Message Agent (opcmsga)
- Action Agent (opcacta)
- Message Interceptor (opcmsgi)
- Logfile Encapsulator (opcle)
- Monitor Agent (opcmona)
- SNMP Trap Interceptor (opctrapi)
- Event Correlation Agent (opceca)
- ECS Annotate Server Agent (opcecaas)
- Embedded Performance Component (coda)
The output from the ovc command is shown here:
From the command line of the HTTPS-based managed node, verify that the correct processes are running; if necessary, restart the processes.
# ovc
ovcd Control Daemon CORE (4314) Running
ovbbccb HP OpenView BBC Communications Broker CORE (1467) Running
ovconfd HP OpenView Config and Deploy CORE (4315) Running
ovcs HP OpenView Certificate Server SERVER (4320) Running
opcmsga OVO Message Agent AGENT,EA (4321) Running
opcacta OVO Action Agent AGENT,EA (4322) Running
opcmsgi OVO Message Interceptor AGENT,EA (4323) Running
opcle OVO Logfile Encapsulator AGENT,EA (4325) Running
opcmona OVO Monitor Agent AGENT,EA (4327) Running
opctrapi OVO SNMP Trap Interceptor AGENT,EA (4329) Running
opceca OVO Event Correlation AGENT,EA Stopped
opcecaas ECS Annotate Server AGENT,EA Stopped
coda HP OpenView Performance Core (4331) Running
- ovc:
Check the status of the agent - ovc status:
Check the agent status - ovc start:
Start all agent processes - ovc stop:
Stop all agent processes except the control agent - ovc kill:
Stop all agent process, including the control agent
Notethe ovc command is available on HTTPs-based nodes only. Check the agent status of DCE-based nodes with the opcagt command.
22.1.3 Utilize the Online Help
During troubleshooting, it is helpful to have all the necessary information and resources at your fingertips. The online resources provided within the OpenView platform make access to important information easy. Inside each graphical window there is a HELP button on the menu. As shown in Figure 22-1, you can obtain help about the typical administrator tasks, icons, and errors. There is also a search engine, a glossary of terms, and instructions on how to use the built-in help.
Figure 22-1. OpenView Operation online help.
# tail /var/opt/OV/log/OpC/opcerror|grep ERROR
07/16/04 10:31:36 ERROR opctrapi (Trap Interceptor)(1907)
[opcevti.cpp:1460]:
Receiving SNMP PDU failed: Lost connection with pmd/ovEvent process
(application disconnected). Trying to reinitialize. (OpC30-204)
# /opt/OV/bin/OpC/utils/opcerr 30 204
MESSAGE OpC30-204:
Receiving SNMP PDU failed: .... Trying to reinitialize.
INSTRUCTION:
The VPO event interceptor could not get a SNMP PDU although it was
informed that there is one available. The SNMP API message <snmp-msg>
gives more information.
The event interceptor tries to reconnect to pmd.
22.1.4 The itochecker Report
OVO provides utilities out-of-the box to assist with troubleshooting. One utility, the itochecker, provides an overall check of your OVO environment. You can use the itochecker report to help isolate a problem. With the itochecker, you can generate a report that will provide important information about the state of the configuration on the management server environment. Read the man page for usage details. In this section, an example is provided of how to create a full report and a display of the results. The report provides a good overall look at the OVO environment. The primary areas of interest are any categories that show errors.
22.1.4.1 Run the itochecker Report
- Run the report:# /opt/OV /contrib/OpC/itochecker a
- Extract an HTML report file from the compressed tar file:# zcat /tmp/ITO_rpt/ITO_rpt.tar.Z | tar xvf - reportl
- View the report in the browser with the command/opt/netscape/netscape reportl
Created by itochecker version A.08.00
ITOCHECKER REPORT
Thu Feb 5 08:22:43 PST 2004
Management Server: nuema
System Environment Check
Name Resolution OK
System Info OK
Number Of Processes / System Load OK
DCE Status and Patchlevel OK
System file permissions OK
OS Patches OK
OVO EnvironmentCheck
OVO Version/Package & ECS Designer N/A
Server Processes OK
Kernel Parameters WARNING
OVO Patches OK
Installed OVO filesets OK
OVO Binaries: Version and Patches OK
OVO Libraries: Version and Patches OK
Disk space in DB and OVO Directories OK
Pending Data in Distribution Directory WARNING
OpCdecoded Config Files OK
Cluster Information N/A
core File Information OK
opcinfo and opcerror Files OK
Elements in Server Queues OK
Elements in Agent Queues OK
File permissions and ownershipOK
Database Check
Database Info OK
Database Queries OK
OVO Database Check
Agent Entries in DB <-> Agent Entries in Filesystem OK
Diskspace in Oracle Directory OK
Nodes Check
Nodes Check OK
Nodes Check Statistics OK
Java GUI / Service Navigator
Java Version / Path OK
Content of dir /opt/OV/www/htdocs/ito_op OK
Content of dir /etc/opt/OV/share/conf/OpC/mgmt_sv/opcsvcm OK
Check number of configured services and loggings OK
Output of /opt/OV/contrib/OpC/stacktrace svc OK