Tuesday, April 23, 2013

Hurdles #3 - Apache Pivot - Finding Request IP Address


Hurdles article number and still working with Apache Pivot, this time we are looking at the RESTful services provided by the Pivot libraries. The Pivot WebService libraries are quite handy, they provide a nice, clean wrapper to work with that abstracts out a lot of the complexity around serialization, HTTPServlet objects, and HTTPRequest objects.

Unfortunately there is one, nasty flaw with the v2.0.2 libraries that I have come across so far. And that is the HTTPRequest object has been so thoroughly abstracted away that you cannot access it, or much of the information included within it directly.

The provided QueryServlet implementation discards information such as: the remoteAddr and remoteHost values. So there is no way to retrieve the requesting IP address if you want to do simple things like validating the source request location.

The reason I discovered this issue in the first place is that I wanted to add a nice little authenticated cache component to my service. Nothing complex, I have already established a verified sessionkey, but in order to negate simple man-in-the-middle attacks that involve sessionkey stealing I wanted to add a secondary validation step that associated a sessionkey to a source IP address. Like I said, very simple (and certainly not foolproof), but my application doesn't involve national security. It's just a quick two-factor authentication routine is all I wanted.

Well, no such luck. In order to resolve this issue I had to resort to extending the org.apache.pivot.web.server.QueryServlet class with my own modifications to the service method and then extend that class with my actual Servlet implementations.

Fortunately GrepCode is kind enough to provide us with the source of the QueryServlet class to use as a starting point. So here is a QueryServletExt class that I created that parses the remoteAddr and remoteHost values into properties that can then be accessed via getRemoteHost and getRemoteAddr methods.

public abstract class QueryServletExt extends QueryServlet {
private transient ThreadLocal<String> remoteAddr = new ThreadLocal<String>();
private transient ThreadLocal<String> remoteHost = new ThreadLocal<String>();

    @Override
   @SuppressWarnings("unchecked")
   protected void service(HttpServletRequest request, HttpServletResponse response)
       throws IOException, ServletException {
       // Get client's IP address
       String ipAddress = request.getRemoteAddr();
       remoteAddr.set(ipAddress);
       // Get client's hostname
       String hostname = request.getRemoteHost();
       remoteHost.set(hostname);
       super.service(request,  response);
   }

public String getRemoteAddr() {
return remoteAddr.get();
}

public String getRemoteHost() {
return remoteHost.get();
}
}

This piece of code doesn't bring all the attributes of the HTTPRequest into your QueryServlet, but at least it will provide a template for how to include whatever pieces you do need in your code.

Tuesday, April 02, 2013

Report Conceptualization Training Strategy – Part 2, Structured Report Planning


For beginning and advanced report developers alike, the biggest challenge I have witnessed is the challenge of Report Conceptualization. This is the process of translating often incomplete business requirements into a structured plan that will produce the result that the business needs. For operational reports based on known data structures and calculations this process can be simple, often operational reports are structured very simply and the job of the report developer is to simply put the right fields in the right place. The process becomes much more difficult for strategic reports which attempt to help the business define what they should be doing. Strategic reports are often poorly understood by the business, their needs are uncertain and difficult to communicate, and vision for the end product is amorphous. Taking vague, conceptual requirements into a concrete view of business data is a challenge for developers and analysts alike.

Recently I have been mentoring two Cognos Report Developers in their process. These developers are experienced at building operations-level reports in Cognos and have a good understanding of the data and the business, but they are challenged at providing insightful analysis to the business when the business users themselves only have vague ideas of what they want to see, or in some cases have too many ideas that all jumble together.

The first step in this conceptualization process was to focus on a particular element, a decision, or a comparison that the business needs to make and determine an appropriate way to visualize the data that is needed. This was covered in Part 1, Report Visualization.
Once a vision has been established, the second step is to plan the underlying structure of the report so that the end result is a simple data structure that can be directly mapped to the visualization. This is the process I use for breaking down a difficult report into simple steps, and have trained numerous developers who use it successfully.

The Process
We will start with the user story:
“As a Product Manager I need to know year-over-year sales trends of my products to determine which products I should assign to my premium display space for each retailer.”

Based on the story we have done further analysis and come up with a detailed business requirement. The business user needs a line graph that shows monthly sales of the product lines and products they select for the current year and the previous year. The graph also needs to show the average sales of the selected products. The graph should only include sales from selected retailers.

A mock-up of the chart is given below.

Identify Your Measures
Measures are all numerical fields that can be aggregated or calculated and that are connected to dimensions. These are generally (in Cognos) located in the Fact table, but for the purposes of this process also includes deduced values such as counts or distinct counts of measures. Ie count(distinct product_id) is a measure that can be connected to other dimensions through an appropriate fact table, this produces the deduced value “Count of products for each value of dimension X where sales exist.”

In our above example our only measure is “Sales $”.

Identify Your Summarizations/Calculations
List all unique summarizations of measures alongside your measure list. These will be treated as separate measures that may or may not be able to be grouped with the list of identified measures so far. Remember that your measures will already be summarized according to their default (usually total) aggregation, you need to identify any alternate summarization such as “average”, “max”, “min”, “count” etc in this list.

Calculations are any formula that combines measures (or applies to a single identified measure). Add these alongside your measure list.

In this example we also have “Average $” as a summarization of all products.

Identify Your Aggregations/Groups
Aggregations and groups are any dimension, attribute, column, or field that the defined measures are grouped by (sums, counts, averages, etc). These groups I am calling descriptors, because the column or attribute describes an aggregation, and any particular value (ie. “Product A”) describes the values associated with it.

In our example the aggregations are: “Product”, “Calendar Month”, “Calendar Year”.

Identify Your Filters/Restrictions
Filters and restrictions are any condition applied to the data that reduces the data set. These are usually another set of descriptors (dimensions, attributes, columns, or fields) that are limited or selected in some fashion, and usually overlap in some fashion with the already defined aggregations/groups. There may be additional filters/restrictions that are not represented in the list of descriptors so far, and there may be value filters applied that are based on aggregations of the defined measures.

In our example the filters are: “Product Line – by selection”, “Product – by selection”, “Calendar Year = Current Year”, “Calendar Year = Previous Year”, "Retailer - by selection".

Build a Mapping Table
Create a table and label each column with your Measures, Summarizations, and Calcuations and label each row with your Descriptors, Filters, and Restrictions. Place a mark in each cell where the Descriptor will be used to group or filter the Measure

Our example should produce a table like this:

Sales $Average $
Product
X

Calendar Month
X
X
Calendar Year
X
X
Filter - Product Line
X
X
Filter - Product
X
X
Filter - Calendar Year = Current Year
X
X
Filter - Calendar Year = Previous Year
X
X
Filter - Retailer
X
X

Note that Average $ is not grouped by Product but is impacted by the Filter - Product, this is because the Average $ calculation will be performed across all products in the filtered list.

Choose Your Pivot Descriptor
The Pivot is a critical component in the analysis, it is the descriptor that is at the core of the visualization. Generally you can find it at the centre of the business requirement, it is the thing that is being reported on, and that decisions are being made about. The purpose it serves is that all other pieces of data in the analysis are connected to it and revolve around it.

This pivot will generally be used as the join key in any inner/outer joins that will be created in this report, it may not be the only item used in a join, but it will be the common thread that is used in (nearly) every join.

On simpler reports it may be unclear which descriptor is actually serving as pivot, and this discussion often centers around the calendar, whether the date is the pivot object, or whether something else (like product or location) is it. The easiest way to decide is simply asking why are you using the report? What are you reporting on? Are you reporting on Sales by Month (but shown by product)? Or are you really concerned about Product Sales (but shown by month/year)?.

Another quick way to sometimes decide which descriptor is acting as pivot (not always) is whether or not it is being shown on the report in two or more different ways (or distinct sets)? If it is, then it’s probably not your pivot. You can also look at filters, which descriptors are being filtered by the business user’s choice? These are good candidates for the Pivot.

In the example above your Pivot is probably “Product”. The other option would be “Month”, but there are a couple of reasons why this is probably not the case. 1) Product is part of a user-driven filter, the business user is choosing which product lines/products to see and is thus reporting on the sales of particular products, not sales of particular months. 2) Calendar is being included in two distinct sets, Previous Year and Current Year, this is not a hard-and-fast rule but it can imply that this is not the central piece of the view but it is reinforces reason #1.
3) Although Average $ is not being grouped by the Product, it is performing a summary of Products and in the defined report the Average is being displayed in the same manner as a Product.

Group Descriptors into Hierarchies
Identify descriptors that are related to one another in a natural or structured hierarchy, these can be easily collected as “summary” and “detail” components of a single query.
In the example the natural hierarchies are “Month, Year, Filter - Current Year”, and “Month, Year, Filter - Previous Year”. Since month is included in two separate hierarchies it is the natural joining point to connect data from the “Current Year” hierarchy and the “Previous Year” hierarchy.

If we wanted to also display the Product Line on our report we would add it as a Descriptor and it would form a third hierarchy "Product Line, Product" but for simplicity we will leave it out.


Identify Independent or Disjoint Sets
Identify filter or restriction combinations that are independent of one another. These are filter combinations that are mutually exclusive or partially overlapping, and that if both were included in the same query would produce no results or fewer rows than intended.

Our example has the mutually exclusive "Filter - Current Year" and "Filter - Previous Year". If both filters were included in the same query, this would result in no data because a row cannot be both in the Current Year AND the Previous Year.

You may note that in this situation both could be combined into the same query using an OR condition, and due to the simplicity of this example that is true. This process is not intended to produce the most optimal or shortest path to a result, but is intended to produce a consistent, repeatable, and understandable path to a result. The counter-argument to using an OR clause is that if we decided to replace "Current Year" and "Previous Year" with arbitrary (potentially overlapping) date ranges, then using an OR clause would no longer be possible and at least 2 new queries would need to be created to perform the enhancement.

Group Measures/Calculations into Queries based on Descriptor Map
First lets review our updated mapping table.

Sales $Average $
Product
X

1Calendar Month
X
X
 -  Calendar Year
X
X
 -  Filter - Calendar Year = Current Year
X
X
2Calendar Month
X
X
 -  Calendar Year
X
X
 -  Filter - Calendar Year = Previous Year
X
X
Filter - Product Line
X
X
Filter - Product
X
X
Filter - Retailer
X
X
1 & 2 - Independent sets

First rule of grouping into queries: If 2 measures have exactly the same mapping (and in Cognos they must come from the same Fact table) then they can be combined into the same queries.

Second rule: Split independent sets into different queries, and add all unrelated descriptors into both queries.

Third rule: Identify each combination of measures (with different mappings) and split descriptors as a separate query.

In our example we will produce 4 queries:

  1. Sales $ [Product, Month/Year/Filter - Current Year, Filter - Product Line, Filter - Product, Filter - Retailer]
  2. Sales $ [Product, Month/Year/Filter - Previous Year, Filter - Product Line, Filter - Product, Filter - Retailer]
  3. Average $ [Month/Year/Filter - Current Year, Filter - Product Line, Filter - Product, Filter - Retailer]
  4. Average $ [Month/Year/Filter - Previous Year, Filter - Product Line, Filter - Product, Filter - Retailer]

Combine Queries by Joining on Pivot
With each query defined, the only step left is to join them together in a fashion that produces our report. There are a couple of factors that go into deciding how to join queries together. A key point to remember is that there isn't a wrong way to join them, the resulting report will still be technically valid, but the data may not be represented (or show) the values that you are intending to show. Because we have each of the queries we need the raw data is available to produce our intended result.

Guidelines:

  1. Start with the most-inclusive and most-detailed query and add to it. Additional filters can be added later to reduce the set of results being displayed. Start with the largest set of data and summarize/reduce the working set as you add more information.
  2. Joins (outer/inner) are used to directly compare measures using a calculation (such as percent change or difference). They are not used to indirectly compare measures (such as grouping and plotting together on a graph). Unions are used to make indirect comparisons.
  3. Outer joins are used to add additional reference information to your pivot (ie. adding cost of goods info to products that happen to have COG values).
  4. Inner joins are used to intersect and add conditional information to your pivot (ie. only showing products that have both Sales and COG values).

Our example is quite simple in terms of joins, we are only going to use Union joins by creating a new column in our Average $ queries to replace the missing Product column and using it to label the rows as "AVG" and "AVG-PY".

If we were going to year-over-year differentials between products and averages we would have to change our joins as follows:

  1. JOIN Sales $ (Previous Year) to Sales $ (Current Year) using an outer join on Product and Month
  2. JOIN Average $ (Previous Year) to Average $ (Current Year) using an outer join on Month
  3. UNION Sales $ (Year over year) to Average $ (Year over year) as normal
This would allow us to calculate the change in Sales $ from Previous year to Current year, and calculate the change in Average $ from Previous year to Current year and display both values on our chart.

Conclusion
This is an outline of my process when building complex reports in Cognos. As described above there are a few gaps in it, and I have only touched on some of the finer points of report development (joins, multiple pivots, and aggregations for example) but I have used it effectively, and through a series of workshops have taught it to both novice and experienced Cognos report developers so that they could apply it on their own and produce solid, maintainable reports that are currently being using in production within their own organizations.

I hope this can provide some small help to Cognos developers out there, and if you have any questions or need me to clarify or expand on any of my points in this article please feel free to contact me.