/* Partykof: Troubleshooting problems in linux, based on a sample for DD-WRT web GUI not responding - Managing information and Technology */
In this blog, I am summarizing some of my work so far and the issues I'm facing everyday in my work as an IT professional.
You are welcome to follow, comment and share with others. If you want to drop me a private note, send me an e-mail


Friday, July 30, 2010

Troubleshooting problems in linux, based on a sample for DD-WRT web GUI not responding

In this post I am going to present a sample troubleshooting procedure for a linux box, where the web interface suddenly stop responding after few weeks of normal operation. I will present the use of basic tools embedded usually in any linux box, and an external monitoring tools based on MRTG.

I use a Linksys WRT54GS wireless router running DD-WRT v24-sp1 mega firmware. It is a small appliance that is based on Broadcom BCM4712 chip and is running a scale down linux OS. Since I installed this version I noticed that once in a while I am unable to access the web interface of the router. The simplest solution was to power cycle the router by unplugging its power plug out, but that meant getting to my router which sits in somewhere in the attic.  I decided to try and figure out what was it that was causing that.

First, I configured SSH access to the router, so I would be able to remotely connect to it, and reboot it in case I needed to. I also configured SNMP monitoring for it, to collect statistics of its performance.
Once the problem reoccurred, I was able to connect to the router and run a simple top command to see what processes are running and see if it can help me figure out the problem.

Figure 1: Console view of top output

Immediately I've noticed that the router load is high, and the process that is causing that was the web server daemon, httpd which was consuming 98.2% of the cpu.
Wondering when the problem started I turned to the RRD graph and noticed that it has been going on for more than 3 weeks, at the beginning of week 28.


Figure 2: Weekly view of router CPU load

In Figure 2, you may clearly notice that the router load has dramatically changed above the load value of 1, which means that the CPU was working at 100% and was queuing processes, which in turn means performance degradation.
I tried correlating the problem to memory or traffic incident at the time the problem started. Figure 3, shows the memory utilization of the router and Figure 4 shows inbound and outbound traffic on the router WAN bridge.


Figure 3: Weekly view of router memory usage
 

Figure 4: Weekly view of traffic on WAN interface

Looking at the beginning of week 28 of both graphs, I found no relation to any issue at the time the problem started or that these parameters would cause this problem.

Another point that might cause an effect is the system's disk capacity, but in such a small router, the whole file system is always presented as 100% full, so this would not present an indication for a problem.

With no luck figuring out the cause of the problem, but only the symptom, I googled it, and guess what, it is a know issue. According to others in the DD-WRT community, the problem is caused from using intensive P2P services, but currently there is no resolution for it, but to use the Mini firmware version.
Since I need the Mega firmware version for VPN and VOIP, I cannot afford to downgrade my router. So the best way is to live with it. To make life easier, I wrote a small script that I can run remotely that will restart the web service, without even having to interactively login to the router.
   #! /bin/sh
   stopservice httpd
   startservice httpd 



You can view a nice reference for doing this procedure in this Link

In summary, although this is only a small linux box, or a router, the basic procedure to identify a problem or its symptoms are the same, you should look at the system at normal operation and compare any irregularity to that steady state. The use of MRTG tools to collect statistics for reference is very important and useful for troubleshooting or capacity planning.

-Partykof

0 comments:

Post a Comment