Performance problems are one of the biggest challenges to expect when designing and implementing Java EE related technologies. Some of these common problems can be faced when implementing either lightweight or large IT environments; which typically include several distributed systems from Web portals & ordering applications to enterprise service bus (ESB), data warehouse and legacy Mainframe storage systems.It is very important for IT architects and Java EE developers to understand their client environments and ensure that the proposed solutions will not only meet their growing business needs but also ensure a long term scalable & reliable production IT environment; and at the lowest cost possible. Performance problems can disrupt your client business which can result in short & long term loss of revenue.This article will consolidate and share the top 10 causes of Java EE performance problems I have encountered working with IT & Telecom clients over the last 10 years along with high level recommendations.Please note that this article is in-depth but I'm confident that this substantial read will be worth your time.I'm confident that many of you can identify episodes of performance problems following Java EE project deployments. Some of these performance problems could have a very specific and technical explanation but are often symptoms of gaps in the current capacity planning of the production environment.Capacity planning can be defined as a comprehensive and evolutive process measuring and predicting current and future required IT environment capacity. A proper implemented capacity planning process will not only ensure and keep track of current IT production capacity and stability but also ensure that new projects can be deployed with minimal risk in the existing production environment. Such exercise can also conclude that extra capacity (hardware, middleware, JVM, tuning, etc.) is required prior to project deployment.In my experience, this is often the most common "process" problem that can lead to short- and long- term performance problems. The following are some examples.
Problems observed
Possible capacity planning gaps
A newly deployed application triggers an overload to the current Java Heap or Native Heap space (e.g.,java.lang.OutOfMemoryError is observed).
-Lack of understanding of the current JVM Java Heap (YoungGen and OldGen spaces) utilization
-Lack of memory static and / or dynamic footprint calculation of the newly deployed application
-Lack of performance and load testing preventing detection of problems such as Java Heap memory leak
A newly deployed application triggers a significant increase of CPU utilization and performance degradation of the Java EE middleware JVM processes.
-Lack of understanding of the current CPU utilization (e.g., established baseline)
-Lack of understanding of the current JVM garbage collection healthy (new application / extra load can trigger increased GC and CPU)
-Lack of load and performance testing failing to predict the impact on existing CPU utilization
A new Java EE middleware system is deployed to production but unable to handle the anticipated volume.
-Missing or non-adequate performance and load testing performed
-Data and test cases used in performance and load testing not reflecting the real world traffic and business processes
-Not enough bandwidth (or pages are much bigger than capacity planning anticipated)
One key aspect of capacity planning is load and performance testing that everybody should be familiar with. This involves generating load against a production-like environment or the production environment itself in order to:
- Determine how much concurrent users / orders volumes your application(s) can support
- Expose your platform and Java EE application bottlenecks, allowing you to take corrective actions (middleware tuning, code change, infrastructure and capacity improvement, etc.)
There are several technologies out there allowing you to achieve these goals. Some load-testing products allow you to generate load from inside your network from a test lab while other emerging technologies allow you to generate load from the "Cloud".I'm currently exploring the free version of Load Tester, a new load testing tool I found allowing you to record test cases and generate load from inside your network or from the Cloud.
Regardless of the load and performance testing tool that you decide to use, this exercise should be done on a regular basis for any dynamic Java EE environments and as part of a comprehensive and adaptive capacity planning process. When done properly, capacity planning will help increase the service availability of your client IT environment.The second most common cause of performance problems I have observed for Java EE enterprise systems is an inadequate Java EE middleware environment and / or infrastructure. Not making proper decisions at the beginning of new platform can result in major stability problems and increased costs for your client in the long term. For that reason, it is important to spend enough time brainstorming on required Java EE middleware specifications. This exercise should be combined with an initial capacity planning iteration since the business processes, expected traffic, and application(s) footprint will ultimately dictate the initial IT environment capacity requirements.
Now, find below typical examples of problems I have observed in my past experience:
- Deployment of too many Java EE applications in a single 32-bit JVM
- Deployment of too many Java EE applications in a single middleware domain
- Lack of proper vertical scaling and under-utilized hardware (e.g., traffic driven by one or just a few JVM processes)
- Excessive vertical scaling and over-utilized hardware (e.g., too many JVM processes vs. available CPU cores and RAM)
- Lack of environment redundancy and fail-over capabilities
Trying to leverage a single middleware and / or JVM for many large Java EE applications can be quite attractive from a cost perspective. However, this can result in an operation nightmare and severe performance problems such as excessive JVM garbage collection and many domino effect scenarios (e.g., Stuck Threads) causing high business impact (e.g., App A causing App B, App C, and App D to go down because a full JVM restart is often required to resolve problems). Recommendations
- Project team should spend enough time creating a proper operation model for the Java EE production environment.
- Attempt to find a good "balance" for your Java EE middleware specifications to provide to the business & operation team proper flexibility in the event of outages scenarios.
- Avoid deployment of too many Java EE applications in a single 32-bit JVM. The middleware is designed to handle many applications, but your JVM may suffer the most.
- Choose a 64-bit over a 32-bit JVM when it is required but combine with proper capacity planning and performance testing to ensure your hardware will support it.
Now let's jump to pure technical problems starting with excessive JVM garbage collection. Most of you are familiar with this famous (or infamous) Java error: java.lang.OutOfMemoryError. This is the result of JVM memory space depletion (Java Heap, Native Heap, etc.). I'm sure middleware vendors such as Oracle and IBM could provide you with dozens and dozens of support cases involving JVM OutOfMemoryError problems on a regular basis, so no surprise that it made the #3 spot in our list.Keep in mind that a garbage collection problem will not necessarily manifest itself as an OOM condition. Excessive garbage collection can be defined as an excessive number of minor and / or major collections performed by the JVM GC Threads (collectors) in a short amount of time leading to high JVM pause time and performance degradation. There are many possible causes:
- Java Heap size chosen is too small vs. JVM concurrent load and application(s) memory footprint.
- Inappropriate JVM GC policy used.
- Your application(s) static and / or dynamic memory footprint is too big to fit in a 32-bit JVM.
- The JVM OldGen space is leaking over time * quite common problem *; excessive GC (major collections) is observed after few hours / days.
- The JVM PermGen space (HotSpot VM only) or Native Heap is leaking over time * quite common problem *; OOM errors are often observed over time following application dynamic redeployments.
- Ratio of YoungGen / OldGen space is not optimal to your application(s) (e.g., a bigger YoungGen Space is required for applications generating massive amount of short lived objects). A bigger OldGen space is required for applications creating lot of long lived / cached Objects.
- The Java Heap size used for a 32-bit VM is too big leaving small room for the Native Heap. Problems can manifest as OOM when trying to a new Java EE application, creating new Java Threads or any computing task that requires native memory allocations.
Before pointing a finger at the JVM, keep in mind that the actual "root" cause can be related to our #1 & #2 causes. An overloaded middleware environment will generate many symptoms, including excessive JVM garbage collection.Proper analysis of your JVM related data (memory spaces, GC frequency, CPU correlation, etc.) will allow you to determine if you are facing a problem or not. Deeper level of analysis to understand your application memory footprint will require you to analyze JVM Heap Dumps and / or profile your application using profiler tools (such as JProfiler) of your choice. Recommendation
- Ensure that you monitor and understand your JVM garbage collection very closely. There are several commercial and free tools available to do so. At the minimum, you should enable verbose GC, which will provide all the data that you need for your health assessment
- Keep in mind that GC related problems are unlikely to be caught during development or functional testing. Proper garbage collection tuning will require you to perform load and perform testing with high-volume from simultaneous users. This exercise will allow you to fine-tune your Java Heap memory footprint as per your applications behaviour and load level forecast.
The next common cause of bad Java EE performance is mainly applicable for highly distributed systems; typical for Telecom IT environments. In such environments, a middleware domain (e.g., Service Bus) will rarely do all the work but rather "delegate" some of the business processes, such as product qualification, customer profile, and order management, to other Java EE middleware platforms or legacy systems such as Mainframe via various payload types and communication protocols. Such external system calls means that the client Java EE application will trigger creation or reuse of Socket Connections to write and read data to/from external systems across a private network. Some of these calls can be configured as synchronous or asynchronous depending of the implementation and business process nature. It is important to note that the response time can change over time depending on the health of the external systems, so it is very important to shield your Java EE application and middleware via proper use of timeouts.
Major problems and performance slowdown can be observed in the following scenarios:
- Too many external system calls are performed in asynchronousand sequentialmanner. Such implementation is also fully exposed to instability and slowdown of its external systems.
- Timeouts between Java EE client applications and external systems are missing or values are too high. This will cause client Threads to get Stuck, which can lead to a full domino effect.
- Timeouts are properly implemented but middleware is not fine-tuned to handle the "non-happy" path. Any increase of response time (or outage) of external system will lead to increased Thread utilization and Java Heap utilization (increased # of pending payload data). Middleware environment and JVM must be tuned in a way to predict and handle both "happy" and "non-happy" paths to prevent a full domino effect.
Finally, I also recommend that you spend adequate time performing negative testing. This means that problem conditions should be "artificially" introduced to the external systems in order to test how your application and middleware environment handle failures of those external systems. This exercise should also be performed under a high-volume situation, allowing you to fine-tune the different timeout values between your applications and external systems.The next common performance problem should not be a surprise for anybody: database issues. Most Java EE enterprise systems rely on relational databases for various business processes from portal content management to order provisioning systems. A solid database environment and foundation will ensure that your IT environment will scale properly to support your client growing business. In my production support experience, database-related performance problems are very common. Since most database transactions are typically executed via JDBC Datasources (including for relational persistence API's such as Hibernate), performance problems will initially manifest as Stuck Threads from your Java EE container Thread manager. The following are common database-related problems I have seen over the last 10 years:* Note that Oracle database is used as an example since it is a common product used by my IT clients.*
- Isolated, long-running SQLs. This problem will manifest as stuck Threads and usually a symptom of lack of SQL tuning, missing indexes, non-optimal execution plan, returned dataset too large, etc.
- Table or row level data lock. This problem can manifest especially when dealing with a two-phase commit transactional model (ex: infamous Oracle In-Doubt Transactions). In this scenario, the Java EE container can leave some pending transactions waiting for final commit or rollback, leaving data lock that can trigger performance problems until such locks are removed. This can happen as a result of a trigger event such as a middleware outage or server crash.
- Sudden change of execution plan. I have seen this problem quite often and usually the result of some data patterns changes, which can (for example) cause Oracle to update the query execution plan on the fly and trigger major performance degradation.
- Lack of proper management of the database facilities. For example, Oracle has several areas to look at such as REDO logs, database data files, etc. Problems such as lack of disk space and log file not rotating can trigger major performance problems and an outage situation.
Recommendations
- Proper capacity planning involving load and performance testing is critical here to fine-tune your database environment and detect any problems at the SQL level.
- If you are using Oracle databases, ensure that your DBA team is reviewing the AWR Report on a regular basis, especially in the context of an incident and root cause analysis process. Same analysis approach should also be performed for other database vendors.
- Take advantage of JVM Thread Dump and AWR Report to pinpoint the slow running SQLs and / or use a monitoring tool of your choice to do the same.
- Make sure to spend enough time to fortify the "Operation" side of your database environment (disk space, data files, REDO logs, table spaces, etc.) along with proper monitoring and alerting. Failure to do so can expose your client IT environment to major outage scenarios and many hours of downtime.
To recap, so far we have seen the importance of proper capacity planning, load and performance testing, middleware environment specifications, JVM health, external systems integration, and the relational database environment. But what about the Java EE application itself? After all, your IT environment could have the fastest hardware on the market with hundreds of CPU cores, large amount of RAM, and dozens of 64-bit JVM processes; but performance can still be terrible if the application implementation is deficient. This section will focus on the most severe Java EE application problems I have been exposed to from various Java EE environments.My primary recommendation is to ensure that code reviews are part of your regular development cycle along with release management process. This will allow you to pinpoint major implementation problems as per below and prior to major testing and implementation phases. Thread safe code problemsProper care is required when using Java synchronization and non-final static variables / objects. In a Java EE environment, any static variable or object must be Thread safe to ensure data integrity and predictable results. Wrong usage of static variable for a Java class member variable can lead to unpredictable results under load since these variables/objects are shared between Java EE container Threads (e.g., Thread B can modify static variable value of Thread A causing unexpected and wrong behavior). A class member variable should be defined as non static to remain in the current class instance context so each Thread has its own copy.Java synchronization is also quite important when dealing with non-Thread safe data structure such as a java.util.HashMap. Failure to do so can trigger HashMap corruption and infinite looping. Be careful when dealing with Java synchronization since excessive usage can also lead to stuck Threads and poor performance. Lack of communication API timeoutsIt is very important to implement and test transaction (Socket read () and write () operations) and connection timeouts (Socket connect () operation) for every communication API. Lack of proper HTTP/HTTPS/TCP IP... timeouts between the Java EE application and external system(s) can lead to severe performance degradation and outage due to stuck Threads. Proper timeout implementation will prevent Threads to wait for too long in the event of major slowdown of your downstream systems.Below are some examples for some older and current APIs (Apache & Weblogic):
Problems observed
|
Possible capacity planning gaps
|
A newly deployed application triggers an overload to the current Java Heap or Native Heap space (e.g.,java.lang.OutOfMemoryError is observed).
|
-Lack of understanding of the current JVM Java Heap (YoungGen and OldGen spaces) utilization
-Lack of memory static and / or dynamic footprint calculation of the newly deployed application
-Lack of performance and load testing preventing detection of problems such as Java Heap memory leak
|
A newly deployed application triggers a significant increase of CPU utilization and performance degradation of the Java EE middleware JVM processes.
|
-Lack of understanding of the current CPU utilization (e.g., established baseline)
-Lack of understanding of the current JVM garbage collection healthy (new application / extra load can trigger increased GC and CPU)
-Lack of load and performance testing failing to predict the impact on existing CPU utilization
|
A new Java EE middleware system is deployed to production but unable to handle the anticipated volume.
|
-Missing or non-adequate performance and load testing performed
-Data and test cases used in performance and load testing not reflecting the real world traffic and business processes
-Not enough bandwidth (or pages are much bigger than capacity planning anticipated)
|
Communication API
|
Vendor
|
Protocol
|
Timeout code snippet
|
commons-httpclient 3.0.1
|
Apache
|
HTTP/HTTPS
|
HttpConnectionManagerParams.setSoTimeout(txTimeout); // Transaction timeout
HttpConnectionManagerParams.setConnectionTimeout(connTimeout);// Connection timeout
|
axis.jar (v1.4 1855)
|
Apache
|
WS via HTTP/HTTPS
|
*** Please note that version 1.x of AXIS is exposed to a known problem with SSL Socket creation which ignores the specified timeout value. Solution is to override the client-config.wsdd and setup the HTTPS transport to <transport name="https" pivot="java:org.apache.axis.transport.http.CommonsHTTPSender"/> ***
((org.apache.axis.client.Stub) port).setTimeout(timeoutMilliseconds); // Transaction & connection timeout
|
WLS103 (old JAX-RPC)
|
Oracle
|
WS via HTTP/HTTPS
|
// Transaction & connection timeout
((Stub)servicePort)._setProperty("weblogic.webservice.rpc.timeoutsecs", timeoutSecs);
|
WLS103 (JAX-RPC 1.1)
|
Oracle
|
WS via HTTP/HTTPS
|
((Stub)servicePort)._setProperty("weblogic.wsee.transport.read.timeout", timeoutMills); // Transaction timeout
((Stub)servicePort)._setProperty("weblogic.wsee.transport.connection.timeout", timeoutMills); // Connection timeout
|
- Very poor performance was observed from the Weblogic portal application.
- Data caching was implemented to improve performance with initial positive impact.
- The more products they were adding in their product catalogue, bigger data caching requirements and Java Heap memory resulted.
- Eventually, the IT team had to upgrade to 64-bit JVM with 8 GB per JVM process along with more CPU cores.
- Eventually, the situation was not sustainable and design had to be reviewed.
- The final solution ended up using a distributed data cache system, outside the Java EE middleware and JVM via separate hardware.
Lack of monitoring is not actually "causing" performance problems, but it can prevent you from understanding the Java EE platform capacity and health situation. Eventually, the environment can reach a break point, which may expose several gaps and problems (JVM memory leak, etc.). From my experience, it is much harder to stabilize an environment after months or years of operation as opposed to having proper monitoring, tools, and processes implemented from day one.That being said, it is never too late to improve an existing environment. Monitoring can be implemented fairly easily. My recommendations follow.
- Review your current Java EE environment monitoring capabilities and identify improvement opportunities.
- Your monitoring solution should cover the end-to-end environment as much as possible; including proactive alerts.
- The monitoring solution should be aligned with your capacity planning process discussed in our first section.
Our last source of performance problems is the network. Major network problems can happen from time to time such as router, switch, and DNS server failures. However, the more common problems observed are typically due to regular or intermittent latency when working on a highly distributed IT environment. The diagram below highlights an example of network latency gaps between two geographic regions of a Weblogic cluster communicating with an Oracle database server located in one geographic region only. Intermittent or regular latency problems can definitely trigger some major performance problems and affect your Java EE application in different ways.
- Applications using database queries with large datasets are fully exposed to network latency due to high number of fetch iterations (back and forward across network).
- Applications dealing with large data payloads (such as large XML data) from external systems are also exposed to network latency that can trigger intermittent high-response time when sending and receiving responses.
- Java EE container replication process (clustering) can be affected and put at risk its fail-over capabilities (e.g., multicast or unicast packet losses).
No comments:
Post a Comment