Performance and Site Relability Engineering -Sreenivasulu kota: Tibco Business Events Entity Cache Performance Trap

Thanks again for another great story from A. Alam – a performance engineer working for Infosys Ltd. who conducts large scale load tests for a very large enterprise. Mr. Alam and team shared this story from a JMeter load test they ran in their production Tibco Business Events environment. The load test was meant to verify performance of item availability system which manages rest calls and large scale Inventory Changes of ecommerce, mobile apps and retail stores.

Using Dynatrace on their Tibco servers helped them identified the sporadic spikes across the whole system when their tested transaction load exceeded 500 TPS (Transactions per Second). The following Dynatrace chart shows these spikes. Each line represents a different test script with response time being captured by Dynatrace on the actual Application Server:

Response time spiked across most of the simulated transactions once load exceeded 500 TPS (Transactions per Second)

Tibco Environment Explained

Architectural Diagram of their Tibco Environment. Active Users and Load Test both rely on Active Spaces

PurePath explains Active Space Object Retrieval Internals

Internals of the Active Space API: When 10k ObjectIDs are exhausted the next batch is requested from Active Space

Root Cause: Exhausted Active Space ServerSolution: Increase Batch Size

Response time spiked across most of the simulated transactions once load exceeded 500 TPS (Transactions per Second)

I am not an expert in Tibco – but thanks to our friends from Infosys, the insight of Dynatrace PurePath and the help of the Tibco Engineering team the problem was identified in Active Space. Turns out that when querying large Object Sets (>10k) queries get split into multiple requests to Active Space causing much more load than anticipated. The exact technical problem, how they found it and the chosen workaround will be discussed in the remainder of the blog.

To give you a better understanding of their environment check out the architectural diagram below. It shows the Tibco BE REST Service that is used by different end consumers to access inventory data managed by Active Space. It also shows the CDC Load coming from Tibco EMS and how it is used to test the Cache Service which also queries and updates data from Active Space.

During their load test – when load reached 400-500 TPS – performance spikes were both seen in CDC messages processing and also reported by the end users of the Tibco BE REST Service at the same time:

Architectural Diagram of their Tibco Environment. Active Users and Load Test both rely on Active Spaces

Tibco BE has a default batch setting to retrieve 10k Object IDs from the BE cache. If more than 10k Object IDs are requested several roundtrips to Active Space are necessary in order to retrieve the next batch size. They identified this behavior by looking at the PurePath’s captured in both the REST Service but also Cache Services. The PurePath showed them that the method nextEntityId goes off and keeps requesting multiple batches of 10k Objects from Active Space in case more than 10k Objects are requested. It does this by sending an asynchronous request to Active Space and letting the main thread wait until the next batch size is available:

Internals of the Active Space API: When 10k ObjectIDs are exhausted the next batch is requested from Active Space

As both the Tibco BE REST Interface and the Cache Service via the Load Test were requesting Objects in the millions per request the Active Space service was simply overloaded with too many parallel requests. If you think about it: an average request for 1 million objects results in 100 roundtrips. This means that the Active Space Server on average gets 100 more requests than are sent to REST and the Cache Service.

The solution that they chose was one given by the Tibco engineering team: Increasing the Object Entity Batch size from 10k to 10 million. This allowed both the REST interface and the Cache Service to fetch data from Active Space with a single roundtrip. This took a lot of load off of Active Space and therefore improved overall performance. It has yet to be seen if there are any other side effects, e.g: higher memory usage due to larger batch sizes.

It would be interesting to see if other Tibco and Active Space users ran into similar issues and how they solved it. Please let us know. If you don’t know whether your system is facing that issue simply install Dynatrace Free Trial on your Tibco Servers. You can use the Dynatrace Free Trial for this task which gives you exactly these details. Keep us posted on your findings!

Performance and Site Relability Engineering -Sreenivasulu kota

Tibco Business Events Entity Cache Performance Trap

No comments:

Post a Comment

12 Benefits of Cloud Computing