Tuesday, January 10, 2017

Gotcha #6 - Sometimes Naive Caching is Worse than None

There is a popular proverb about good intentions and the road to hell that is well known, but an alternate version of that proverb is phrased as "Hell is full of good meanings, but Heaven is full of good works". Unfortunately even good works can lead down the wrong path when they are undertaken with insufficient consideration for their impact.

This post is about one such work - about the implementation of a caching strategy on a large platform that is widely used across the industry. The work itself is good, smart, and done with significant forethought. However under certain conditions, the choices made ended up creating a system worse than if nothing had been done at all. This post explores the choices, circumstances, and effect that implementing an intelligent - but naive - caching system had on one particular project and why the result ended up being worse than if it had never been.

The names of all organizations, people, and software involved have been withheld.

6 Days to Go-Live

A war room had been convened. The production deployment of a new Public Access System (PAS) had gone smoothly over the weekend and everything was prepared to turn the switch and make it accessible in less than a week. The public information campaign had already been underway for some time, advertising the date, and getting the word out that this new offering was forthcoming. However, validation of the deployment had hit a snag - one member of the QA team had noticed something wrong.


At a critical step while making a new purchase on the PAS, the system would appear to pause. After a long delay (45s) it would return with a technical error. Repeating this process several times in a row would produce the same error across multiple sessions, across multiple machines. And then it would go away and everything would work smoothly again.

Until it happened again about an hour later.

And again an hour after that.

And then it went away. Combinations of testing, several people trying to test it in a variety of ways, and using precisely the same data - or different data - it couldn't be reproduced.

Until the next morning, when the problem showed up anew and again the following hour, before disappearing once more.

So began an urgent search to understand the problem began, testing the system, combing the logs, reviewing all the code tracing and analytic tools at our disposal to isolate and resolve the problem - in production.

Underlying this effort was the question - why wasn't this problem seen in any other environment? Through the myriads of Dev and QA environments, through multiple test cycles and performance testing, this pattern of problems had never been encountered before. At times a similar issue would be seen in isolated circumstances immediately following the restart of an environment, but once it was initialized everything was good.

So why was it happening in production, and nowhere else?

Architecture

The PAS is a clustered three-tier system using RESTful services to communicate between the web server and the Back-end Application (BEA). The BEA contains all the business code required to verify, validate, and persist purchase information.



The PAS cluster contains 2 nodes, and the BEA cluster contains 8, separated into two group of 4 - one group dedicated to the PAS and the other for direct Intranet access. As a true RESTful interface, no session information is tracked between requests to the BEA cluster. The load balancer provides a true round-robin.

When the system is active, most of the processing load takes place on the BEA cluster, the PAS operates as little more than a view rendering layer. When a page request comes in to the PAS, it opens a connection to one of the BEA nodes - passes along the request - and waits for a response. In the case where the BEA node takes an abnormally long time, or if there is a failure and no response is generated - the connection has a 45s timeout. If it takes longer than 45s, the PAS would close the connection and display a technical error message encouraging the customer to call their local company representative to complete the purchase.

The Trail of Evidence
When the problem was first raised, there was a great deal of speculation as to the possible cause. Database issues, connection issues, misconfiguration, load balancer issues - all possible sources of this technical error, but all things that would need to be verified and eliminated.

There were a number of observations made about the system that were unusual or worth investigating - including hung database threads and the presence of a query in the AWR (query performance statistics) report that was known to be problematic in other environments. But in hindsight there was only one that would prove to be important to the actual cause: the problem only happened in the morning when full-time employees arrived for work and began using the BEA system directly via the Intranet.

Now the Intranet BEA nodes are physically separate from the ones used by the PAS, but they do share a database. The problem only occurred during the morning hours when load on the BEA nodes and the database were at their highest. Not high enough to cause a bottleneck or result in performance problems - but not insignificant either. More on this later.

The most important piece of information came from our enterprise performance monitoring tool that was installed on all production servers. By drilling down, identifying, and isolating a single request that failed we could examine the performance of the BEA code for that single request.

And what we discovered was a function call that took 24s that involved retrieving a large but critical piece of reference data from the database, and storing it in the application cache. By itself, not enough to cause a timeout and result in a technical error - but in a confluence of circumstances this piece of reference data was then being used for a large and complex set of calculations that took ~20s to complete and in this particular instance was just barely slow enough to cross the 45s timeout threshold and cause the technical error to be displayed. 

Now this isn't unusual - the median time for this particular load-and-store operation was about 15-18s, and once it was complete the data would be cached and any future accesses would take 1/10th of a second or less. But in the situation where it did have to retrieve the data from the database while the db was under significant load caused it to slow just enough to trigger the problem.

But why was it happening at all? Why wasn't it happening elsewhere? And why wasn't it discovered during Performance Testing.

Unexpected Caching

The answer came in a conference call to the vendor. The cache on the BEA system operated as follows:

  1. Objects in the cache are marked as stale at intervals of 15 minutes if they have not been accessed during that timeframe.
  2. If an object was already marked as stale when the interval arrived, it would be evicted from the cache.
  3. Every 4th interval (once per hour) all objects in the cache would be evicted regardless of their state.

This cache interval was configurable to a maximum of 30 minute intervals, extending the hourly global eviction from 1 hour to 2 hours - but no further. This means that every hour all reference data would be purged from the BEA system's memory and need to be reloaded the next time it was accessed... on each node.

The Core Problem

In order for the problem to occur, the following had to happen:

  1. The BEA's cache was empty on the nodes that were used by the PAS.
  2. The other Intranet BEA nodes had to be actively used by a large number of concurrent users to put sufficient load on the database to cause a retrieval delay for the large amount of reference data.
  3. The PAS had to be inactive, with only 1 or 2 users online performing a new purchase. Other operations would cause the reference data to be loaded but not perform the complex calculations that would cause the response delay to exceed 45s.

An unlikely scenario it seems, right?

Except...

  1. The BEA cache would be emptied on the hour, every hour.
  2. The Intranet BEA nodes were being used by employees for their normal job, and would be heavily used every weekday for several hours during the morning.
  3. The PAS was a limited-scope rollout, the only functions that it performed were New Purchase, and Continue a Saved Purchase.
  4. The PAS was so new, that the initial expected load was only a handful of users per day. So the likelihood of having only 1-2 users performing a new purchase was very high.

These circumstances meant that it was very likely that any user of the system would run into the situation where all three conditions were met and cause their purchase attempt to fail.

This problem was further compounded by the round-robin load balancer, and the fact that the reference data needed to be loaded on each node separately. This means that not only was a user likely to encounter the technical error message, but if they attempted to retry their purchase their next page request would be directed to a different BEA node whose cache was still empty resulting in the same error message again.

The expected result is that for a majority of potential customers, each would need to retry their purchase five times before successfully completing it.

Why Wasn't This Caught?

The simple answer is the other environments were not busy enough. Dev, QA, Staging... all of these environments did not have sufficient database usage to cause enough delay to cross the 45s delay threshold. In addition, these environments typically were not configured in the dual-cluster architecture of production, so any activity directly on a BEA node would cause the reference data to be loaded into cache. The chances of a PAS new purchase occurring with an active database were essentially non-existent.

The exception was the Performance Testing environment which did have a fully configured production system, and sufficient database traffic during a test to cause the problem. The reason it wasn't discovered was that the PAS traffic was too high. Even a half-dozen simultaneous users was sufficient to cache the reference data across all the BEA nodes - and even if the error was observed near the beginning of the test once or twice, a few minor functional errors are not unexpected during a performance test and the absence of reoccurence over the next hour would be simply dismissed as an anomaly.

As soon as the circumstances of the problem were discovered, it became a simple matter to intentionally reproduce in Performance Test by adding load to the Intranet BEA nodes and manually attempting to test a new purchase.

Why Did This Happen?

This problem is caused directly by the implementation of a naive caching framework on the BEA system. The purpose behind clearing the cache regularly would be to ensure that reference data (if updated) could only be stale for a certain amount of time before being refreshed. The assumption was that the cache could be cleared every hour without causing a problem, which ended up not being the case. The BEA system had no facility with which to preload reference data, no functionality allowing it to be refreshed periodically except through eviction and reload on-demand, and no ability to persist reference data in a long-term no-expiry cache.

The reason this was more than just a normal problem, is that by implementing a limited cache the system was successfully able to mask the existence of the problem from developers, qa, and performance testing. The naive caching strategy allowed it to work well under load most of the time, and only appear as a problem in very specific circumstances.

Had there been no cache on the BEA system, this problem would have become abundantly clear in performance testing by resulting in terrible performance for that operation and also causing the error rate to approach 100%. But by masking the issue it managed to survive until mere days before go-live when it was discovered entirely by accident.

The Solution

The permanent solution will be to modify the BEA system to build a more sensible multi-use caching strategy with the ability to preload and periodically refresh reference data independent of a user request. Temporarily it has been resolved with an automated process that makes an appropriate service request to each BEA node every minute that triggers the reference data reload if it has been evicted in an attempt to complete that reload before a user encounters it.