19 Jan 2008

Netdump, figuring out what caused that system crash 

By - Linux No Comments

We have all been there before. Your server crashed, nothing indicates what happened. You check /var/log/messages and all you see is.. well.. nothing. With no sign of what happened, or indication of why it happened you are left to.. wait until it happens again.

On Red Hat based systems, you have an answer. NetDump (diskdump may work as well, more on that another time). Below we will explore the steps required to setup and test netdump.

I. Installation

Netdump configuration requires two computers. One acts as the netdump server and the other one acts as the netdump client.

By default crash dumps will be saved to /var/crash on the netdump server. Ensure you have adequate disk space to cope with at least 1-2 crashes.

The crash dump size is based on memory size of the crashed system, it would be suggested to create a seperate mount point for /var/crash is possible to avoid potentially filling the /var filesystem.

Install the “netdump-server” package on the server and the “netdump” package on the client.

II. Configuration

On the netdump server, as root, type:

# passwd netdump

and supply a password for netdump just like what you do to an ordinary user. Then do the following:

# chkconfig netdump-server on
# service netdump-server start

On the netdump client, edit /etc/sysconfig/netdump.

Uncomment and set the NETDUMPADDR variable to the IP address of the netdump server. For example:

NETDUMPADDR=x.x.x.x ;

Then execute:

# service netdump propagate

and supply the netdump password that was configured on the netdump server.

Finally, execute:

# chkconfig netdump on
# service netdump start

If you experience issues, such as errors communicating with the netdump-server, you may need to change some values in the /etc/sysconfig/netdump configuration file. On a machine we were working with, it had bonded network interfaces and netdump did not work properly with the bond device. We had to specify the DEV entry in the configuration file.

 Update: To be more specific on the bond issue, Bond support was not added to RHEL 4 until 2.6.9-42 or update 4.

III. Testing

To test if the netdump configuration is correct, perform the following on the netdump client (Warning: it will crash the machine!):

# sysctl -w kernel.sysrq=1
#echo c > /proc/sysrq-trigger

This will crash the system and you will see a kernel dump on the netdump server in the directory /var/crash//. You will see the file “vmcore-incomplete” while the client is dumping data to the server. The file is renamed to “vmcore” once it is completed.

The size of “vmcore” will vary and may reach several gigs. On a system with 512Mb of RAM, the above test created a vmcore of approximately 510Mb.

Please note, you should stop all services prior to performing a test, as the machine will crash hard.

No Responses to “Netdump, figuring out what caused that system crash”

Leave a Reply