Recently we
observed a strange issue with one of our Tomcat server in Production. The monitors
that we configured on these always showed degraded performance. The HTTP Ping monitor
that we configured has a Success status when it responds in less than 2
seconds. But from the last couple of days we were seeing alerts as the HTTP
Ping Monitor is taking more than 6 Seconds and to respond causing the alert to
fail.
1) We started to
analyze the performance of the server by using the basic top command which gave
us the highest CPU and Memory Used process like,
localHost
$ top -n1 | grep -m1 java
5944 root
20 0 7057m 4.3g 7444 S 312.1
55.9 13871:55 java
We obtained the
Process that was actually using the High CPU and amount of the Memory being
used. In this case the CPU being used is 312.1
2) Next step is we
started to find out the Thread ID that is causing the HIGH CPU usage in this process
which gave us this
localHost
$ top -n1 -H | grep -m1 java | perl -pe 's/\e\[?.*?[\@-~] ?//g' | cut -f1 -d' '
5958
3) In the next
process we took the thread dump of the PID 5944 (running Tomcat Process ID )
and obtained the NativeID for the Thread ID using
localHost
$ printf "%x" 5958
1746
4) From the Thread
Dump we observed that this thread belongs to Garbage Collection
"GC
task thread#1 (ParallelGC)" prio=10 tid=0x00007f306c025800 nid=0x1746
runnable
And some times it
pointed us to
"VM
Thread" prio=10 tid=0x00007f306c134800 nid=0x1749 runnable
The VM thread is a
special thread in the HotSpot VM. It performs special tasks in the VM such as
garbage collection. If the VM_Operation suggests that the operation is a
garbage collection then it is possible that you have encountered as issue such
as heap corruption.
The High CPU usage
might be a Garbage collector issue but
it can be some thing else that leave object references in the heap in in-correct
state.
The main important
thing at this point is to collect as much information as possible about the
environment and try out possible workarounds.
Now upon checking
the GC status we observed that the memory is totally exhausted and the GC is performing
high number of Full GC’s
localHost
$ jstat -gcutil 5944
S0 S1
E O P YGC YGCT FGC
FGCT GCT
0.00
0.00 70.12 100.00 99.66
57691 4817.991 30178
107006.562 111824.553
In the above case ,
the FGC element tells us the number of Full GC Caused during the life time.
Since the JVM is
busy handling the Full GC of this process, it is actually pausing the other
threads to perform the GC events hence causing the slow response to the EMS
monitor.
More to come, Hope
this helps J
No comments :
Post a Comment