/ Java EE Support Patterns: Solaris
Showing posts with label Solaris. Show all posts
Showing posts with label Solaris. Show all posts

2.29.2012

HashMap.get High CPU – Case Study

This article will describe the complete root cause analysis and solution of a HashMap High CPU problem (infinite looping) affecting a Weblogic 10.0 environment running on the Java HotSpot VM 1.5.

This case study will again demonstrate this importance of mastering Thread Dump analysis skill and CPU correlation techniques such as Solaris prstat.

Environment specifications

-        Java EE server: Oracle Weblogic Portal 10.0
-        Middleware OS: Solaris 10
-        Java VM: Java HotSpot VM 1.5
-        Platform type: Portal application

Monitoring and troubleshooting tools

-        JVM Thread Dump (HotSpot format)
-        Solaris prstat (CPU contributors analysis)

Problem overview

-        Problem type: High CPU observed from our Weblogic production environment

A high CPU problem was observed from our Solaris physical servers hosting a Weblogic Portal 10 environment. Users also reporting major slowdown of the portal application.

Gathering and validation of facts

As usual, a Java EE problem investigation requires gathering of technical and non-technical facts so we can either derived other facts and/or conclude on the root cause. Before applying a corrective measure, the facts below were verified in order to conclude on the root cause:

·        What is the client impact? HIGH
·        Recent change of the affected platform? No
·        Any recent traffic increase to the affected platform? Yes
·        How does this high CPU manifest itself?  A sudden CPU increase was observed and is not going down; even after load goes down e.g. near zero level.
·        Did an Oracle OSB recycle resolve the problem? Yes, but problem is returning after few hours or few days (unpredictable pattern)

-        Conclusion #1: The high CPU problem appears to be intermittent vs. pure correlation with load
-        Conclusion #2: Since high CPU remains after load goes down, this typically indicates either the presence of some infinite looping or heavy processing Threads

Solaris CPU analysis using prstat

Solaris prstat is a powerful OS command allowing you to obtain the CPU per process but more importantly CPU per Thread within a process. As you can see below from our case study, the CPU utilization was confirmed to go up as high as 100% utilization (saturation level).

## PRSTAT (CPU per Java Thread analysis)
prstat -L -p 8223 1 1

PID USERNAME  SIZE   RSS STATE  PRI NICE  TIME      CPU  PROCESS/LWPID
8223 bea10      2809M 2592M sleep   59    0  14:52:59        38.6% java/494
8223 bea10      2809M 2592M sleep   57    0   12:28:05       22.3% java/325
8223 bea10      2809M 2592M sleep   59    0   11:52:02       28.3% java/412
8223 bea10      2809M 2592M sleep   59    0   5:50:00         0.3% java/84
8223 bea10      2809M 2592M sleep   58    0   2:27:20         0.2% java/43
8223 bea10      2809M 2592M sleep   59    0   1:39:42         0.2% java/41287
8223 bea10      2809M 2592M sleep   59    0   4:41:44         0.2% java/30503
8223 bea10      2809M 2592M sleep   59    0   5:58:32         0.2% java/36116
……………………………………………………………………………………

As you can see from above data, 3 Java Threads were found using together close to 100% of the CPU utilization.

For our root cause analysis, we did focus on Thread #494 (decimal format) corresponding to 0x1ee (HEXA format).

Thread Dump analysis and PRSTAT correlation

Once the culprit Threads were identified, the next step was to correlate this data with the Thread Dump data (which was captured exactly at the same time as prstat).

A quick search within the generated Thread Dump file did reveal the Thread Stack Trace (Weblogic Stuck Thread #125) for 0x1ee as per below.

"[STUCK] ExecuteThread: '125' for queue: 'weblogic.kernel.Default (self-tuning)'" daemon prio=1 tid=0x014c5030 nid=0x1ee runnable [0x536fb000..0x536ffc70]
       at java.util.HashMap.get(HashMap.java:346)
       at org.apache.axis.encoding.TypeMappingImpl.getClassForQName(TypeMappingImpl.java:715)
       at org.apache.axis.encoding.TypeMappingDelegate.getClassForQName(TypeMappingDelegate.java:170)
       at org.apache.axis.encoding.TypeMappingDelegate.getClassForQName(TypeMappingDelegate.java:160)
       at org.apache.axis.encoding.TypeMappingImpl.getDeserializer(TypeMappingImpl.java:454)
       at org.apache.axis.encoding.TypeMappingDelegate.getDeserializer(TypeMappingDelegate.java:108)
       at org.apache.axis.encoding.TypeMappingDelegate.getDeserializer(TypeMappingDelegate.java:102)
       at org.apache.axis.encoding.DeserializationContext.getDeserializer(DeserializationContext.java:457)
       at org.apache.axis.encoding.DeserializationContext.getDeserializerForType(DeserializationContext.java:547)
       at org.apache.axis.encoding.ser.BeanDeserializer.getDeserializer(BeanDeserializer.java:514)
       at org.apache.axis.encoding.ser.BeanDeserializer.onStartChild(BeanDeserializer.java:286)
       at org.apache.axis.encoding.DeserializationContext.startElement(DeserializationContext.java:1035)
       at org.apache.axis.message.SAX2EventRecorder.replay(SAX2EventRecorder.java:165)
       at org.apache.axis.message.MessageElement.publishToHandler(MessageElement.java:1141)
       at org.apache.axis.message.RPCElement.deserialize(RPCElement.java:236)
       at org.apache.axis.message.RPCElement.getParams(RPCElement.java:384)
       at org.apache.axis.client.Call.invoke(Call.java:2467)
       at org.apache.axis.client.Call.invoke(Call.java:2366)
       at org.apache.axis.client.Call.invoke(Call.java:1812)
…………………………………………………………………………………………………………………………………………

Thread Dump analysis – HashMap.get() infinite loop condition!

As you can see from the above Thread Stack Trace, the Thread is currently stuck in an infinite loop over a java.util.HashMap that originates from the Apache Axis TypeMappingImpl Java class.

This finding was quite revealing. The 2 others Threads using high CPU also did reveal infinite looping condition within the same Apache Axis HashMap Object.

Root cause: non Thread safe HashMap in Apache Axis 1.4

Additional research did reveal this known defect affecting Apache Axis 1.4; which is the version that our application was using.

As you may already know, usage of non Thread safe / non synchronized HashMap under concurrent Threads condition is very dangerous and can easily lead to internal HashMap index corruption and / or infinite looping. This is also a golden rule for any middleware software such as Oracle Weblogic, IBM WAS, Red Hat JBoss which rely heavily on HashMap data structures from various Java EE and caching services.

Such best practice is also applicable for any Open Source third party API such as Apache Axis.

The most common solution is to use the ConcurrentHashMap data structure which is designed for that type of concurrent Thread execution context.

Solution

Our team did apply the proposed patch from Apache (synchronize the non Thread safe HashMap) which did resolve the problem. We are also currently looking at upgrading our application to a  newer version of Apache Axis.

Conclusion

I hope this case study has helped you understand how to pinpoint the root cause of high CPU Threads and the importance of proper Thread safe data structure for high concurrent Thread / processing applications.

Please don’t hesitate to post any comment or question.

Find office supplies promo codes to save money for your department's bottom line, so you can spend more on the latest software.

12.12.2011

PRSTAT Solaris – Pinpoint high CPU Java VM Threads fast

The Solaris OS prstat command is widely used but are you really using it to its full potential?

Are you struggling with a high CPU utilization of your middleware Java VM processes but unable to understand what is going on?

How about extracting the proof and facts about the root cause for your clients and / or application development team?

If you are a Java EE production middleware or application support individual then this article is mainly for you and will demonstrate why you should add this command immediately to your list of OS command “friend” list and learn this analysis approach.

This article will provide with a step by step tutorial along with an example to achieve those goals.

For an AIX OS prstat command equivalent and strategy, please visit the other article below:

Step #1 – Generate a PRSTAT snapshot of your affected Java process

Execute the prstat command on the Java process after extracting the Java PID.
You can also run it several iterations in order to identify a pattern.

prstat -L -p <PID> 1 1

Example: prstat -L -p 9116 1 1

PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/LWPID
  9116 bea      3109M 2592M sleep   59    0  21:52:59 8.6% java/76
  9116 bea      3109M 2592M sleep   57    0   4:28:05 0.3% java/40
  9116 bea      3109M 2592M sleep   59    0   6:52:02 0.3% java/10774
  9116 bea      3109M 2592M sleep   59    0   6:50:00 0.3% java/84
  9116 bea      3109M 2592M sleep   58    0   4:27:20 0.2% java/43
  9116 bea      3109M 2592M sleep   59    0   7:39:42 0.2% java/41287
  9116 bea      3109M 2592M sleep   59    0   3:41:44 0.2% java/30503
  9116 bea      3109M 2592M sleep   59    0   5:48:32 0.2% java/36116
……………………………………………………………………………………

-         PID corresponds to your Solaris Java process ID
-         CPU corresponds to the CPU utilization % of a particular Java Thread
-         PROCESS / LWPID corresponds to Light Weight Process ID e.g. your native Java Thread ID

In our example, prstat did identify Thread #76 as our top CPU contributor with 8.6% utilization.

** Please note that If you use regular thread libraries (i.e., you do not have /usr/lib/lwp in the LD_LIBRARY_PATH) in Solaris, a LWP is not mapped to OS thread directly. In that situation, you will have to also take a pstack snapshot (in addition to prstat) from the process.

Step #2 – Generate a Thread Dump snapshot of your affected Java process

Great now you have one or a list of high CPU Thread contributors from your Java process but what should you do next?

PRSTAT data of your Java VM process is nothing without a Thread Dump data snapshot.

It is very important that you generate a Thread Dump via kill -3 <Java PID> at the same time or immediately after the prstat execution.

The Thread Dump will allow you to correlate the native Thread Id with the Java Thread Stack Trace and allow you to understand what high CPU consumption this Thread is currently involved with (infinite or heavy looping, GC Thread, high IO / disk access etc.).

Example: kill -3 9116

The command will generate a JVM Thread Dump from your Java process or middleware standard output log and typically starting with Full thread dump Java HotSpot(TM)… (HotSpot VM format).

Step #3 – Thread Dump and PRSTAT data correlation

It is now time to correlate the generated prstat and Thread Dump data.

First, you have to convert the PRSTAT Thread Id decimal format to HEXA format.

In our example, Thread ID #76 (decimal format) corresponds to 0X4C (HEXA format)

The next step is to simply search within your Thread Dump data and search for a Thread with native Id corresponding to the PRSTAT Thread Id (HEXA f format).

nid=<prstat Thread ID HEXA format>

In our example, such Thread was found as per below (complete case study is also available from this Blog):

"pool-2-thread-1" prio=10 tid=0x01952650 nid=0x4c in Object.wait() [0x537fe000..0x537ff8f0]
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:474)
        at weblogic.common.CompletionRequest.getResult(CompletionRequest.java:109)
        - locked <0xf06ca710> (a weblogic.common.CompletionRequest)
        at weblogic.diagnostics.archive.wlstore.PersistentStoreDataArchive.readRecord(PersistentStoreDataArchive.java:657)
        at weblogic.diagnostics.archive.wlstore.PersistentStoreDataArchive.readRecord(PersistentStoreDataArchive.java:630)
        at weblogic.diagnostics.archive.wlstore.PersistentRecordIterator.fill(PersistentRecordIterator.java:87)
        at weblogic.diagnostics.archive.RecordIterator.fetchMore(RecordIterator.java:157)
        at weblogic.diagnostics.archive.RecordIterator.hasNext(RecordIterator.java:130)
        at com.bea.wli.monitoring.alert.AlertLoggerImpl.getAlerts(AlertLoggerImpl.java:157)
        at com.bea.wli.monitoring.alert.AlertLoggerImpl.updateAlertEvalTime(AlertLoggerImpl.java:140)
        at com.bea.wli.monitoring.alert.AlertLog.updateAlertEvalTime(AlertLog.java:248)
        at com.bea.wli.monitoring.alert.AlertManager._evaluateSingleRule(AlertManager.java:992)
        at com.bea.wli.monitoring.alert.AlertManager.intervalCompleted(AlertManager.java:809)
        at com.bea.wli.monitoring.alert.AlertEvalTask.execute(AlertEvalTask.java:65)
        at com.bea.wli.monitoring.alert.AlertEvalTask.run(AlertEvalTask.java:37)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:417)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:269)
        at java.util.concurrent.FutureTask.run(FutureTask.java:123)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675)
        at java.lang.Thread.run(Thread.java:595)

Step #4 – Analyze the identified Thread Stack trace for root cause

At this point you have down 50% of the work done and should have one or a list of Threads contributing to your high CPU problem.

The final step is for you to analyze the Thread Stack Trace (from bottom - up) and determine what type of processing / computing culprit Thread(s) are involved in.

Find below common high CPU Thread scenarios:

-         Heavy or excessive garbage collection (Threads identified are the actual GC Threads of the HotSpot VM)
-         Heavy or infinite looping (application or middleware code problem, corrupted & looping non Thread safe HashMap etc.)
-         Excessive IO / disk activity (Excessive Class loading or JAR file searches)

Depending how your organization and client is structured; you can also provide this data to your middleware vendor directly (Oracle support, IBM support, Red Hat etc.) or development team for further analysis.

Final words

I hope this article has helped you understand how you can combine the powerful Solaris prstat and Thread Dump analysis to allow you to pinpoint high CPU Java Threads problem.

Please feel free to post any comment or question if you need any help.