Netdump, figuring out what caused that system crash
We have all been there before. Your server crashed, nothing indicates what happened. You check /var/log/messages and all you see is.. well.. nothing. With no sign of what happened, or indication of why it happened you are left to.. wait until it happens again.
On Red Hat based systems, you have an answer. NetDump (diskdump may work as well, more on that another time). Below we will explore the steps required to setup and test netdump.
Netdump configuration requires two computers. One acts as the netdump server and the other one acts as the netdump client.
By default crash dumps will be saved to /var/crash on the netdump server. Ensure you have adequate disk space to cope with at least 1-2 crashes.
The crash dump size is based on memory size of the crashed system, it would be suggested to create a seperate mount point for /var/crash is possible to avoid potentially filling the /var filesystem.
Install the “netdump-server” package on the server and the “netdump” package on the client.
On the netdump server, as root, type:
# passwd netdump
and supply a password for netdump just like what you do to an ordinary user. Then do the following:
# chkconfig netdump-server on
# service netdump-server start
On the netdump client, edit /etc/sysconfig/netdump.
Uncomment and set the NETDUMPADDR variable to the IP address of the netdump server. For example:
# service netdump propagate
and supply the netdump password that was configured on the netdump server.
# chkconfig netdump on
# service netdump start
If you experience issues, such as errors communicating with the netdump-server, you may need to change some values in the /etc/sysconfig/netdump configuration file. On a machine we were working with, it had bonded network interfaces and netdump did not work properly with the bond device. We had to specify the DEV entry in the configuration file.
Update: To be more specific on the bond issue, Bond support was not added to RHEL 4 until 2.6.9-42 or update 4.
To test if the netdump configuration is correct, perform the following on the netdump client (Warning: it will crash the machine!):
# sysctl -w kernel.sysrq=1
#echo c > /proc/sysrq-trigger
This will crash the system and you will see a kernel dump on the netdump server in the directory /var/crash//. You will see the file “vmcore-incomplete” while the client is dumping data to the server. The file is renamed to “vmcore” once it is completed.
The size of “vmcore” will vary and may reach several gigs. On a system with 512Mb of RAM, the above test created a vmcore of approximately 510Mb.
Please note, you should stop all services prior to performing a test, as the machine will crash hard.