Tuesday, January 10, 2017

Gotcha #6 - Sometimes Naive Caching is Worse than None

There is a popular proverb about good intentions and the road to hell that is well known, but an alternate version of that proverb is phrased as "Hell is full of good meanings, but Heaven is full of good works". Unfortunately even good works can lead down the wrong path when they are undertaken with insufficient consideration for their impact.

This post is about one such work - about the implementation of a caching strategy on a large platform that is widely used across the industry. The work itself is good, smart, and done with significant forethought. However under certain conditions, the choices made ended up creating a system worse than if nothing had been done at all. This post explores the choices, circumstances, and effect that implementing an intelligent - but naive - caching system had on one particular project and why the result ended up being worse than if it had never been.

The names of all organizations, people, and software involved have been withheld.

6 Days to Go-Live

A war room had been convened. The production deployment of a new Public Access System (PAS) had gone smoothly over the weekend and everything was prepared to turn the switch and make it accessible in less than a week. The public information campaign had already been underway for some time, advertising the date, and getting the word out that this new offering was forthcoming. However, validation of the deployment had hit a snag - one member of the QA team had noticed something wrong.


At a critical step while making a new purchase on the PAS, the system would appear to pause. After a long delay (45s) it would return with a technical error. Repeating this process several times in a row would produce the same error across multiple sessions, across multiple machines. And then it would go away and everything would work smoothly again.

Until it happened again about an hour later.

And again an hour after that.

And then it went away. Combinations of testing, several people trying to test it in a variety of ways, and using precisely the same data - or different data - it couldn't be reproduced.

Until the next morning, when the problem showed up anew and again the following hour, before disappearing once more.

So began an urgent search to understand the problem began, testing the system, combing the logs, reviewing all the code tracing and analytic tools at our disposal to isolate and resolve the problem - in production.

Underlying this effort was the question - why wasn't this problem seen in any other environment? Through the myriads of Dev and QA environments, through multiple test cycles and performance testing, this pattern of problems had never been encountered before. At times a similar issue would be seen in isolated circumstances immediately following the restart of an environment, but once it was initialized everything was good.

So why was it happening in production, and nowhere else?

Architecture

The PAS is a clustered three-tier system using RESTful services to communicate between the web server and the Back-end Application (BEA). The BEA contains all the business code required to verify, validate, and persist purchase information.



The PAS cluster contains 2 nodes, and the BEA cluster contains 8, separated into two group of 4 - one group dedicated to the PAS and the other for direct Intranet access. As a true RESTful interface, no session information is tracked between requests to the BEA cluster. The load balancer provides a true round-robin.

When the system is active, most of the processing load takes place on the BEA cluster, the PAS operates as little more than a view rendering layer. When a page request comes in to the PAS, it opens a connection to one of the BEA nodes - passes along the request - and waits for a response. In the case where the BEA node takes an abnormally long time, or if there is a failure and no response is generated - the connection has a 45s timeout. If it takes longer than 45s, the PAS would close the connection and display a technical error message encouraging the customer to call their local company representative to complete the purchase.

The Trail of Evidence
When the problem was first raised, there was a great deal of speculation as to the possible cause. Database issues, connection issues, misconfiguration, load balancer issues - all possible sources of this technical error, but all things that would need to be verified and eliminated.

There were a number of observations made about the system that were unusual or worth investigating - including hung database threads and the presence of a query in the AWR (query performance statistics) report that was known to be problematic in other environments. But in hindsight there was only one that would prove to be important to the actual cause: the problem only happened in the morning when full-time employees arrived for work and began using the BEA system directly via the Intranet.

Now the Intranet BEA nodes are physically separate from the ones used by the PAS, but they do share a database. The problem only occurred during the morning hours when load on the BEA nodes and the database were at their highest. Not high enough to cause a bottleneck or result in performance problems - but not insignificant either. More on this later.

The most important piece of information came from our enterprise performance monitoring tool that was installed on all production servers. By drilling down, identifying, and isolating a single request that failed we could examine the performance of the BEA code for that single request.

And what we discovered was a function call that took 24s that involved retrieving a large but critical piece of reference data from the database, and storing it in the application cache. By itself, not enough to cause a timeout and result in a technical error - but in a confluence of circumstances this piece of reference data was then being used for a large and complex set of calculations that took ~20s to complete and in this particular instance was just barely slow enough to cross the 45s timeout threshold and cause the technical error to be displayed. 

Now this isn't unusual - the median time for this particular load-and-store operation was about 15-18s, and once it was complete the data would be cached and any future accesses would take 1/10th of a second or less. But in the situation where it did have to retrieve the data from the database while the db was under significant load caused it to slow just enough to trigger the problem.

But why was it happening at all? Why wasn't it happening elsewhere? And why wasn't it discovered during Performance Testing.

Unexpected Caching

The answer came in a conference call to the vendor. The cache on the BEA system operated as follows:

  1. Objects in the cache are marked as stale at intervals of 15 minutes if they have not been accessed during that timeframe.
  2. If an object was already marked as stale when the interval arrived, it would be evicted from the cache.
  3. Every 4th interval (once per hour) all objects in the cache would be evicted regardless of their state.

This cache interval was configurable to a maximum of 30 minute intervals, extending the hourly global eviction from 1 hour to 2 hours - but no further. This means that every hour all reference data would be purged from the BEA system's memory and need to be reloaded the next time it was accessed... on each node.

The Core Problem

In order for the problem to occur, the following had to happen:

  1. The BEA's cache was empty on the nodes that were used by the PAS.
  2. The other Intranet BEA nodes had to be actively used by a large number of concurrent users to put sufficient load on the database to cause a retrieval delay for the large amount of reference data.
  3. The PAS had to be inactive, with only 1 or 2 users online performing a new purchase. Other operations would cause the reference data to be loaded but not perform the complex calculations that would cause the response delay to exceed 45s.

An unlikely scenario it seems, right?

Except...

  1. The BEA cache would be emptied on the hour, every hour.
  2. The Intranet BEA nodes were being used by employees for their normal job, and would be heavily used every weekday for several hours during the morning.
  3. The PAS was a limited-scope rollout, the only functions that it performed were New Purchase, and Continue a Saved Purchase.
  4. The PAS was so new, that the initial expected load was only a handful of users per day. So the likelihood of having only 1-2 users performing a new purchase was very high.

These circumstances meant that it was very likely that any user of the system would run into the situation where all three conditions were met and cause their purchase attempt to fail.

This problem was further compounded by the round-robin load balancer, and the fact that the reference data needed to be loaded on each node separately. This means that not only was a user likely to encounter the technical error message, but if they attempted to retry their purchase their next page request would be directed to a different BEA node whose cache was still empty resulting in the same error message again.

The expected result is that for a majority of potential customers, each would need to retry their purchase five times before successfully completing it.

Why Wasn't This Caught?

The simple answer is the other environments were not busy enough. Dev, QA, Staging... all of these environments did not have sufficient database usage to cause enough delay to cross the 45s delay threshold. In addition, these environments typically were not configured in the dual-cluster architecture of production, so any activity directly on a BEA node would cause the reference data to be loaded into cache. The chances of a PAS new purchase occurring with an active database were essentially non-existent.

The exception was the Performance Testing environment which did have a fully configured production system, and sufficient database traffic during a test to cause the problem. The reason it wasn't discovered was that the PAS traffic was too high. Even a half-dozen simultaneous users was sufficient to cache the reference data across all the BEA nodes - and even if the error was observed near the beginning of the test once or twice, a few minor functional errors are not unexpected during a performance test and the absence of reoccurence over the next hour would be simply dismissed as an anomaly.

As soon as the circumstances of the problem were discovered, it became a simple matter to intentionally reproduce in Performance Test by adding load to the Intranet BEA nodes and manually attempting to test a new purchase.

Why Did This Happen?

This problem is caused directly by the implementation of a naive caching framework on the BEA system. The purpose behind clearing the cache regularly would be to ensure that reference data (if updated) could only be stale for a certain amount of time before being refreshed. The assumption was that the cache could be cleared every hour without causing a problem, which ended up not being the case. The BEA system had no facility with which to preload reference data, no functionality allowing it to be refreshed periodically except through eviction and reload on-demand, and no ability to persist reference data in a long-term no-expiry cache.

The reason this was more than just a normal problem, is that by implementing a limited cache the system was successfully able to mask the existence of the problem from developers, qa, and performance testing. The naive caching strategy allowed it to work well under load most of the time, and only appear as a problem in very specific circumstances.

Had there been no cache on the BEA system, this problem would have become abundantly clear in performance testing by resulting in terrible performance for that operation and also causing the error rate to approach 100%. But by masking the issue it managed to survive until mere days before go-live when it was discovered entirely by accident.

The Solution

The permanent solution will be to modify the BEA system to build a more sensible multi-use caching strategy with the ability to preload and periodically refresh reference data independent of a user request. Temporarily it has been resolved with an automated process that makes an appropriate service request to each BEA node every minute that triggers the reference data reload if it has been evicted in an attempt to complete that reload before a user encounters it.

Wednesday, August 17, 2016

Tricorn Games - Flipped Off!

I am excited to announce that I have embarked upon a new venture - Tricorn Games. Under this banner we will be developing and publishing tabletop and card games, with potential expansion into the realm of mobile and computer games.

Flipped Off!, our first product is currently in the final stages of development and playtesting. Flipped Off is a tactical card game where you play a plotting Mastermind aiming to defeat your rivals and seize control of the city for yourself. You must manage your Minions - the Pirates, Ninjas, and Robots that do your bidding, play actions that can improve your situation or wreak havoc with your enemies, and use your Minions to launch attacks and hope that your rivals can’t turn the cascade of results to their own advantage.


We are currently recruiting playtesters, if you would like an early look at this game and would like to help us refine it and have input into the final result - please sign up here and join us! Playtesters will receive a high-quality printable PDF of all the game components and instructions, access to the playtest forum, and active participants will receive public credit for their contributions when the final release has been completed.

In addition to Flipped Off!, we have several additional projects under development which will be announced when they are ready for release.

Remember - Play more games!

Wednesday, May 25, 2016

The Tragically Hip - Gord Downie

20 years ago when I was in my early-teens I took a trip across the country by myself. Looking back it doesn't seem like much - I flew from Winnipeg to Edmonton via Saskatoon, a 3-4 hour journey with layovers, but at the time it was a huge step towards independence for me. I was going to Edmonton to spend a month with my cousins during the summer. In truth, I was dropped off at one gate, shepherded by a flight attendant who was there to help underage passengers flying alone, and picked up by family on the other end. But for a kid who had never flown before, to do so alone when he was perfectly happy wiling away the hours on a computer screen in the basement, it felt like a huge, scary undertaking.

It was freedom, and it changed my life, though it took me another five years to realize it.

The flight was uneventful. Scary, thrilling, strange in the way that is common to airplanes and amusement rides. Funny how I became a fan of the former, and you couldn't pay me enough money to tolerate the latter. I wasn't the only unaccompanied minor on the plane, there was another kid seated with me as well who was a couple of years younger and it was his first flight too.

I guess I was visibly nervous, I was a shy kid to start with - not much for talking - and I expect I was probably white as a sheet, staring out the window at the tarmac as the rest of the plane boarded. As the plane began making its way to the takeoff strip, shuddering, pinging, whirring - all things unfamiliar to a novice traveler - a passenger in the row ahead of me turned to talk to us. I don't recall what he asked us, but I told him my name and that it was my first time flying, and we talked briefly in the way that strangers do while the plane was getting ready for takeoff.

As I said, the flight was uneventful for anyone who knew what to expect. Once we were in the air and on our way to Saskatoon, the nice guy in the row ahead of us asked us if we could give him a hand. He and his buddy were doing a crossword puzzle from the morning paper. I can't say that either of them were terribly adept at it, but I was a kid. I don't think I helped worth beans, but it was fun. My fellow underage passenger wasn't interested, but I remember poking my head over the top of the guy's seat and spent the better part of an hour working the puzzle with the two of them.

Before I knew it, we were descending into Saskatoon. At the gate, the two guys said goodbye and left with some other friends of theirs who were also on the flight, while I stayed aboard for the next leg of the journey to Edmonton.

After everyone who was getting off had done so, the flight attendant came back to check on us before more passengers got on. I gather she was a fan - she came over and asked us "Do you know who that was? That was the Tragically Hip!".

I was a kid, I didn't have any bloody idea who the Tragically Hip were. I had heard the name before, but I couldn't have named a single song they played. I was more amazed at my uncle's exclamation when I related the story to him that evening. He knew who The Hip were.

Fast-forward five years, I am 17 years old, just about to turn 18, sitting in my dorm room in Waterloo, Ontario - 2,000km from my home and family. I had just embarked upon the biggest, most exciting, most frightening journey of independence in my life since that flight to Edmonton... and I was feeling it pretty badly. That night, and more than a few nights after I would put on my headphones and turn on my favorite playlist, comforted by the familiar strains of songs I had listened to a thousand times before. Songs that spoke to me in some way, songs that sounded like home. I would fall asleep listening to Great Big Sea and The Tragically Hip, whose album I would never have picked up had it not been for an enthusiastic flight attendant.

This post was prompted by the news that Gord Downie of The Tragically Hip has cancer, treatable but terminal. My father-in-law was killed by the disease less than a year ago, so I can relate to some of the experiences his family is going through today. So I wanted to post this, my own personal experience with them.

So for Ahead by a Century, Membership, Thompson Girl, My Music at Work, and most importantly... Courage, thank you.

Monday, March 21, 2016

Gotcha #5 - Network Performance, CSS, and HTML5

Today's article comes courtesy of understanding your customer's environment, and that emphasizing style over substance can break the bank.

Our story starts with an urgent request for assistance from a system administrator, a minor application update has resulted in a nearly five-fold increase in total network traffic at a remote office. On Monday morning, when the morning shift arrived network traffic began to spike. It grew, and grew, far beyond normal traffic levels for the office and stayed there, night and day, until the last shift signed off on Friday evening.



Think about this for a moment.

The application's network traffic did not increase five-fold. The entire office's network traffic consumption increased from ~2Mbps (24-hour average) to nearly 9Mbps (24-hour average) with peaks topping out at 10Mbps.

The first stage is always denial.

How could a minor update to a server-based web application that mostly consisted of a couple minor bug fixes, one new summary screen, and some UI tweaks be the problem? It must be something else, there isn't anything new or unusual in the application update.

Well...

Step 1: Isolating the Problem
Easy enough to do, by using more detailed reports we could see the initial spike in unusual network activity began at 10am on the Friday before, the same time as the outage window when the application update was deployed.

Also, there were no other maintenance activities ongoing at that time, no file transfers, and no other system updates scheduled or otherwise.

Hmm...

Step 2: Reproducing the Issue
To confirm, we conducted a test where all instances of the application on every workstation on the site was closed, and by so doing network traffic returned to its normal (low) stable state.

One by one, each workstation re-loaded the application and by time the last workstation was reconnected the network traffic was back at its five-fold peak. It was during this process of re-loading that the system administrator noticed something unusual. A new UI feature had been added to some screens, when a specific piece of data was in a warning state - the text box would flash orange. And every time a new screen loaded that happened to display one of these flashing orange boxes, network traffic would spike sharply.

In the end, the system administrator noticed 10 out of more than 100 workstations happened to be displaying this flashing orange box. Returning to shut down just these 10 flashing displays suddenly dropped network traffic back to normal once more. When they were re-loaded, network traffic would spike once more.

Step 3: Cause and Solution
So how can a simple CSS and HTML5 flashing text box on 10 screen possibly start pushing 4 times the total network traffic of an office of more than 100? The technique to highlight a text box had been used elsewhere in the application already, and the style control was rendered entirely in-browser with no additional server requests being made.

How can a box no larger than 400px by 300px with a lovely fade-in/fade-out HTML5 animation possibly consume 800Kbps per screen?

The Epiphany

Animated. For the first time in this application, the developer (to satisfy a requested requirement) used a simple animation cycle to flash the control orange on a 4-second cycle. It was a lovely effect, understated, but unmistakable and was easily visible when displayed on a large screen from across the room.

What wasn't clear however, was that each workstation did not use a browser to view the application. Instead, each workstation used a Remote Desktop (RDP) session to a centralized (offsite) server which then connected to the application in a browser window.

Instead of being rendered in a local browser session and consuming no network resources at all, the beautifully animated flashing textbox was being rendered and streamed as uncompressed full-motion video over the network to the site office.

10 workstations streaming uncompressed video for 24-hours a day, 5 days a week.

Step 4: Resolution
In the end the problem was easy to solve. Rip out the animation, and using a simple steady-state orange highlight to communicate the information.

But it was an important lesson. You must be aware of how your customer is using the tools that you build, and how the simplest of design choices can cause a major impact under the wrong circumstances.

Tuesday, January 19, 2016

Research Study: Memory Test

Contest!

Anyone who signs in with an email address and completes the test before Feb 28 will be entered into a drawing for one of the top 2015/2016 indie games from Steam or an equal value Amazon book order.

Participate Here!


I'm conducting a study on memory and the ability for people to memorize short sequences of numbers. In order to do this I've built a small web app to conduct memory tests. It tracks both correct answers and time spent and scores your performance - so speed is better but not at the expense of answering incorrectly.

The purpose of this study is to test your short term memory skills, numeric analysis ability, and speed. The test is divided in 5 levels of 5 questions that increase in difficulty. You may complete them in any order you choose and at any time if you would like to do some now and return later to do the rest, but you must complete them all before the results can be processed.

The test is randomly generated, so you can take it as many times as you'd like to try and do better, and it tracks your history so you can see you best score if you do it more than once. Each question presents you with a small set of numbers to memorize, then asks you to answer a question about the numbers.

It's very simple, and quick to do. Please give it a try, and share it around - compete against your friends!

For SCIENCE!