Wednesday, December 06, 2017

Breaking the Chains of Analysis Paralysis - The Black Box (Part 1 of 7)

This is the first article in a series of 7 that I will be putting up in the coming weeks about Analysis Paralysis. The purpose of these articles is to highlight primary causes of AP among people who play games, and discuss strategies that you can apply as a designer to reduce the potential for AP.

Each article in this series will go into depth on one potential cause of AP, examples of situations where it can be problematic, and present some solutions that can improve the situation.

Although these articles are specific to Boardgames, the concepts of Analysis Paralysis apply widely to everyday life. These specific examples can easily be translated to apply to systems in general, including business systems and enterprise reporting & dashboards.
  1. The Black Box
  2. The Paradox of Choice – Dec 13
  3. The Prisoner’s Dilemma – Dec 20
  4. The Maze to Victory – Dec 27
  5. Relationship Status: It’s Complicated – Jan 3
  6. Sophie’s Choice – Jan 10
  7. Bigger vs Better – Jan 17
Additional articles will be released weekly, we hope you will read, share, and comment on them. Tell me what you think, and how you might apply some of these concepts and solutions to your business.

The first article discusses The Black Box, the problem of information obscurity. Read Analysis Paralysis - The Black Box

Friday, June 09, 2017

Gotcha #7 - AppDynamics and the Mysterious Case of the Vanishing Cookie

Not much of a mystery today in this article (the answer is alluded to in the title) but the process of identifying the problem, discovering why it was a problem, and then fixing it was quite a challenge.

Today's Gotcha is about Rational Performance Tester (v8.5.1.2 for anyone interested) and a vexing problem with successful test execution.

The problem originally arose when we discovered that our tests weren't always succeeding. After executing a schedule we would consistently have a large number of pages that wouldn't have 100% success status codes (located on the Server Health Detail view). There would be a few with 100% success rate, but consistently nearly every page was hovering between 94 and 98%. In some larger tests this would drop further, into the 80's or high 70's, but even a simple low volume test would have most pages at less than 100%.

To get a better handle on why and when these page errors were happening, I added a new counter to the default Page Throughput / Page Hit Rate graph (Add/Remove Performance Counters --> Page Performance Counter... --> Status Code Successes --> Percent Page Status Code Success [for Interval]).

Suddenly we could see this wavering line appear on the graph that would start off at 100%, dip sharply at the beginning of the test, before returning back up near 100% and stay fluctuating in the high 90% range.

Baffling when the very same test scripts in the past against the same application would run at 100% without trouble. Doubly-baffling when running the script as a standalone would work perfectly fine.

Following the Clues

The first clue came when we decided to rerun one of our test schedules and monitor the Test Execution of a user thread. Since the failures were reasonably consistent across all pages, chances are we might be able to catch it in the act.

After a couple of hours of watching successful page responses, we caught one. Right from the beginning of a new test script - a page failure. And then another, and another, and so on until every single page in that script failed in order. And the next script. And the next... but then the fourth script, for no apparent reason, began returning successful responses again.

Digging into the error, the only message we could find was a "Failed substitution ^csrfToken=.*$" message followed by a cascade of 500 - Server Error responses. The csrfToken appears in the HTTP Response header as a "Set-Cookie" value from the server (you can find an explanation of its purpose here). And for some reason, it wasn't being found.

Except... inspecting the response of that very first page - the csrfToken was right where it should be. Further digging into the application logs showed the same thing - on occasion our test would be sending an invalid csrfToken - the recorded token - which happens when the substitution couldn't be made.

But it was right there in the Response Header - right where we expected it to be. Except, not quite...

Smoking Gun

It looked like it was right where it was supposed to be - in among a half-dozen other cookies being set by the server. But a careful comparison with the original recording showed that it was 1 line down. There was a NEW Set-Cookie value that was inserted in the list just ahead of the csrfToken.

Inspecting and comparing successful test executions with failed ones showed that this new mysterious cookie only appeared in the failed tests, and wasn't to be found in the successful ones.

The server was randomly inserting a new cookie in its response... some of the time. And it was enough, by shifting the order of the Set-Cookie headers for RPT fail when trying to identify the csrfToken.

The culprit was a cookie called ADRUM_BTs and it looks like the following:
Set-Cookie: ADRUM_BTs="R:0|s:f"; Version=1; Max-Age=30; Expires=...date...; Path=/; Secure

There was no consistency in when or where it would appear. It would show up for a script execution or three, then disappear once more.

Solution

It turns out that we weren't the only ones to run the Mysterious Case of the Vanishing Cookie - other people had identified ADRUM_BTs as playing havoc with their scripting efforts.

The cookie in question belongs to AppDynamics End User Monitoring javascript package, a tool that can be used to monitor user behaviour and web application performance from the user's perspective. It was a new feature in the AppDynamics package that had been turned on in our Performance Testing environment recently to try and improve our measurement capabilities.

In the end, there was no way to remove this new cookie or adjust the system to consistently distribute it. In order to get our Performance Testing back up and running we had to remove all End-user Monitoring components - and suddenly our vanishing cookie was gone for the last time.

Saturday, March 25, 2017

Go Fish Fitness - Getting Kids Moving

My wife and I are parents of 4 (ages 8, 5, 2, and 6 months) and she also runs a home daycare. On any given day we spend time with 10 different kids in our home (though not all at the same time!). We also live in Canada, in a city that has famously been compared unfavourably to the surface of Mars. As a result, we often deal with 10 different kids who have been cooped up inside for long stretches of time due to cold, snow, ice, wind, rain, and general all-around outdoor unpleasantness.

If you’ve been there, you probably know that a group of children stuck inside = energy. A lot of energy. Sometimes destructive energy. They get on each other’s nerves, they fight, they rampage, and they often find it difficult to concentrate. They need to get out and run around, but spending 5 minutes freezing their fingers and faces and they want to come right back in.

So we decided to try something. We started making games for the kids, using rules that they were familiar with as a base (Go Fish – for example) and added in an exercise component. We took a normal deck of playing cards and assigned short exercises to each number, and every time a match was made they had to do the associated exercise.
And then to appeal to their competitive spirit we modified the rules, so that when a match was made everyone ELSE had to do the exercise. Suddenly we had a group of laughing, playing, goofy, jumping around kids who were reveling in the game and making their friends and sibling do the work – and having just as much fun doing it when someone else made a match.

After 15 minutes of playing, the kids went from destructive to constructive. They still had plenty of energy – a short bout of concentrated effort certainly isn’t enough to tire them – but it was enough to take the edge off the irritable frustration that had set in by being stuck inside. They went from angry to happy to play, or craft, or sit and colour. And as a parent I can tell you that the only blessing greater than naptime, is a horde of children happily engaged in their own independent activity.

If you have kids, you can do this too very easily. It’s as simple as a spare deck of cards and a sharpie – write an exercise on each card (the same numbers have the same exercises) and play Go Fish. When a match is made, everyone else must do the exercise a number of times equal to the number on the card. (For example, 10 Jumping Jacks, or 4 Sit-ups).

In the interests of making this game accessible and more interesting for kids, we’ve been working with an artist (Fredrik Skarstedt) to create beautiful, colourful cards that we are currently trying to kickstart a print-run for. If this is something of interest to you, come check it out at (www.tricorngames.com/kickstarter) which ends April 6th, 2017.

Tuesday, January 10, 2017

Gotcha #6 - Sometimes Naive Caching is Worse than None

There is a popular proverb about good intentions and the road to hell that is well known, but an alternate version of that proverb is phrased as "Hell is full of good meanings, but Heaven is full of good works". Unfortunately even good works can lead down the wrong path when they are undertaken with insufficient consideration for their impact.

This post is about one such work - about the implementation of a caching strategy on a large platform that is widely used across the industry. The work itself is good, smart, and done with significant forethought. However under certain conditions, the choices made ended up creating a system worse than if nothing had been done at all. This post explores the choices, circumstances, and effect that implementing an intelligent - but naive - caching system had on one particular project and why the result ended up being worse than if it had never been.

The names of all organizations, people, and software involved have been withheld.

6 Days to Go-Live

A war room had been convened. The production deployment of a new Public Access System (PAS) had gone smoothly over the weekend and everything was prepared to turn the switch and make it accessible in less than a week. The public information campaign had already been underway for some time, advertising the date, and getting the word out that this new offering was forthcoming. However, validation of the deployment had hit a snag - one member of the QA team had noticed something wrong.


At a critical step while making a new purchase on the PAS, the system would appear to pause. After a long delay (45s) it would return with a technical error. Repeating this process several times in a row would produce the same error across multiple sessions, across multiple machines. And then it would go away and everything would work smoothly again.

Until it happened again about an hour later.

And again an hour after that.

And then it went away. Combinations of testing, several people trying to test it in a variety of ways, and using precisely the same data - or different data - it couldn't be reproduced.

Until the next morning, when the problem showed up anew and again the following hour, before disappearing once more.

So began an urgent search to understand the problem began, testing the system, combing the logs, reviewing all the code tracing and analytic tools at our disposal to isolate and resolve the problem - in production.

Underlying this effort was the question - why wasn't this problem seen in any other environment? Through the myriads of Dev and QA environments, through multiple test cycles and performance testing, this pattern of problems had never been encountered before. At times a similar issue would be seen in isolated circumstances immediately following the restart of an environment, but once it was initialized everything was good.

So why was it happening in production, and nowhere else?

Architecture

The PAS is a clustered three-tier system using RESTful services to communicate between the web server and the Back-end Application (BEA). The BEA contains all the business code required to verify, validate, and persist purchase information.



The PAS cluster contains 2 nodes, and the BEA cluster contains 8, separated into two group of 4 - one group dedicated to the PAS and the other for direct Intranet access. As a true RESTful interface, no session information is tracked between requests to the BEA cluster. The load balancer provides a true round-robin.

When the system is active, most of the processing load takes place on the BEA cluster, the PAS operates as little more than a view rendering layer. When a page request comes in to the PAS, it opens a connection to one of the BEA nodes - passes along the request - and waits for a response. In the case where the BEA node takes an abnormally long time, or if there is a failure and no response is generated - the connection has a 45s timeout. If it takes longer than 45s, the PAS would close the connection and display a technical error message encouraging the customer to call their local company representative to complete the purchase.

The Trail of Evidence
When the problem was first raised, there was a great deal of speculation as to the possible cause. Database issues, connection issues, misconfiguration, load balancer issues - all possible sources of this technical error, but all things that would need to be verified and eliminated.

There were a number of observations made about the system that were unusual or worth investigating - including hung database threads and the presence of a query in the AWR (query performance statistics) report that was known to be problematic in other environments. But in hindsight there was only one that would prove to be important to the actual cause: the problem only happened in the morning when full-time employees arrived for work and began using the BEA system directly via the Intranet.

Now the Intranet BEA nodes are physically separate from the ones used by the PAS, but they do share a database. The problem only occurred during the morning hours when load on the BEA nodes and the database were at their highest. Not high enough to cause a bottleneck or result in performance problems - but not insignificant either. More on this later.

The most important piece of information came from our enterprise performance monitoring tool that was installed on all production servers. By drilling down, identifying, and isolating a single request that failed we could examine the performance of the BEA code for that single request.

And what we discovered was a function call that took 24s that involved retrieving a large but critical piece of reference data from the database, and storing it in the application cache. By itself, not enough to cause a timeout and result in a technical error - but in a confluence of circumstances this piece of reference data was then being used for a large and complex set of calculations that took ~20s to complete and in this particular instance was just barely slow enough to cross the 45s timeout threshold and cause the technical error to be displayed. 

Now this isn't unusual - the median time for this particular load-and-store operation was about 15-18s, and once it was complete the data would be cached and any future accesses would take 1/10th of a second or less. But in the situation where it did have to retrieve the data from the database while the db was under significant load caused it to slow just enough to trigger the problem.

But why was it happening at all? Why wasn't it happening elsewhere? And why wasn't it discovered during Performance Testing.

Unexpected Caching

The answer came in a conference call to the vendor. The cache on the BEA system operated as follows:

  1. Objects in the cache are marked as stale at intervals of 15 minutes if they have not been accessed during that timeframe.
  2. If an object was already marked as stale when the interval arrived, it would be evicted from the cache.
  3. Every 4th interval (once per hour) all objects in the cache would be evicted regardless of their state.

This cache interval was configurable to a maximum of 30 minute intervals, extending the hourly global eviction from 1 hour to 2 hours - but no further. This means that every hour all reference data would be purged from the BEA system's memory and need to be reloaded the next time it was accessed... on each node.

The Core Problem

In order for the problem to occur, the following had to happen:

  1. The BEA's cache was empty on the nodes that were used by the PAS.
  2. The other Intranet BEA nodes had to be actively used by a large number of concurrent users to put sufficient load on the database to cause a retrieval delay for the large amount of reference data.
  3. The PAS had to be inactive, with only 1 or 2 users online performing a new purchase. Other operations would cause the reference data to be loaded but not perform the complex calculations that would cause the response delay to exceed 45s.

An unlikely scenario it seems, right?

Except...

  1. The BEA cache would be emptied on the hour, every hour.
  2. The Intranet BEA nodes were being used by employees for their normal job, and would be heavily used every weekday for several hours during the morning.
  3. The PAS was a limited-scope rollout, the only functions that it performed were New Purchase, and Continue a Saved Purchase.
  4. The PAS was so new, that the initial expected load was only a handful of users per day. So the likelihood of having only 1-2 users performing a new purchase was very high.

These circumstances meant that it was very likely that any user of the system would run into the situation where all three conditions were met and cause their purchase attempt to fail.

This problem was further compounded by the round-robin load balancer, and the fact that the reference data needed to be loaded on each node separately. This means that not only was a user likely to encounter the technical error message, but if they attempted to retry their purchase their next page request would be directed to a different BEA node whose cache was still empty resulting in the same error message again.

The expected result is that for a majority of potential customers, each would need to retry their purchase five times before successfully completing it.

Why Wasn't This Caught?

The simple answer is the other environments were not busy enough. Dev, QA, Staging... all of these environments did not have sufficient database usage to cause enough delay to cross the 45s delay threshold. In addition, these environments typically were not configured in the dual-cluster architecture of production, so any activity directly on a BEA node would cause the reference data to be loaded into cache. The chances of a PAS new purchase occurring with an active database were essentially non-existent.

The exception was the Performance Testing environment which did have a fully configured production system, and sufficient database traffic during a test to cause the problem. The reason it wasn't discovered was that the PAS traffic was too high. Even a half-dozen simultaneous users was sufficient to cache the reference data across all the BEA nodes - and even if the error was observed near the beginning of the test once or twice, a few minor functional errors are not unexpected during a performance test and the absence of reoccurence over the next hour would be simply dismissed as an anomaly.

As soon as the circumstances of the problem were discovered, it became a simple matter to intentionally reproduce in Performance Test by adding load to the Intranet BEA nodes and manually attempting to test a new purchase.

Why Did This Happen?

This problem is caused directly by the implementation of a naive caching framework on the BEA system. The purpose behind clearing the cache regularly would be to ensure that reference data (if updated) could only be stale for a certain amount of time before being refreshed. The assumption was that the cache could be cleared every hour without causing a problem, which ended up not being the case. The BEA system had no facility with which to preload reference data, no functionality allowing it to be refreshed periodically except through eviction and reload on-demand, and no ability to persist reference data in a long-term no-expiry cache.

The reason this was more than just a normal problem, is that by implementing a limited cache the system was successfully able to mask the existence of the problem from developers, qa, and performance testing. The naive caching strategy allowed it to work well under load most of the time, and only appear as a problem in very specific circumstances.

Had there been no cache on the BEA system, this problem would have become abundantly clear in performance testing by resulting in terrible performance for that operation and also causing the error rate to approach 100%. But by masking the issue it managed to survive until mere days before go-live when it was discovered entirely by accident.

The Solution

The permanent solution will be to modify the BEA system to build a more sensible multi-use caching strategy with the ability to preload and periodically refresh reference data independent of a user request. Temporarily it has been resolved with an automated process that makes an appropriate service request to each BEA node every minute that triggers the reference data reload if it has been evicted in an attempt to complete that reload before a user encounters it.