Pages

Tuesday, May 26, 2015

Analyzing Slow Response – Case 2 (a High CPU Usage)

Recently we observed a strange issue with one of our Tomcat server in Production. The monitors that we configured on these always showed degraded performance. The HTTP Ping monitor that we configured has a Success status when it responds in less than 2 seconds. But from the last couple of days we were seeing alerts as the HTTP Ping Monitor is taking more than 6 Seconds and to respond causing the alert to fail.

1) We started to analyze the performance of the server by using the basic top command which gave us the highest CPU and Memory Used process like,

localHost $ top -n1 | grep -m1 java
5944  root   20   0 7057m 4.3g 7444 S 312.1 55.9  13871:55 java

We obtained the Process that was actually using the High CPU and amount of the Memory being used. In this case the CPU being used is 312.1

2) Next step is we started to find out the Thread ID that is causing the HIGH CPU usage in this process which gave us this

localHost $ top -n1 -H | grep -m1 java | perl -pe 's/\e\[?.*?[\@-~] ?//g' | cut -f1 -d' '
5958

3) In the next process we took the thread dump of the PID 5944 (running Tomcat Process ID ) and obtained the NativeID for the Thread ID using

localHost $ printf "%x" 5958
1746

4) From the Thread Dump we observed that this thread belongs to Garbage Collection
"GC task thread#1 (ParallelGC)" prio=10 tid=0x00007f306c025800 nid=0x1746 runnable

And some times it pointed us to
"VM Thread" prio=10 tid=0x00007f306c134800 nid=0x1749 runnable

The VM thread is a special thread in the HotSpot VM. It performs special tasks in the VM such as garbage collection. If the VM_Operation suggests that the operation is a garbage collection then it is possible that you have encountered as issue such as heap corruption.
The High CPU usage might be  a Garbage collector issue but it can be some thing else that leave object references in the heap in in-correct state.

The main important thing at this point is to collect as much information as possible about the environment and try out possible workarounds.

Now upon checking the GC status we observed that the memory is totally exhausted and the GC is performing high number of Full GC’s

localHost $ jstat -gcutil  5944
  S0      S1     E      O        P        YGC     YGCT       FGC    FGCT          GCT
  0.00   0.00  70.12 100.00  99.66  57691 4817.991 30178 107006.562 111824.553

In the above case , the FGC element tells us the number of Full GC Caused during the life time.

Since the JVM is busy handling the Full GC of this process, it is actually pausing the other threads to perform the GC events hence causing the slow response to the EMS monitor.


More to come, Hope this helps J

No comments :

Post a Comment