Tuesday, April 23, 2013
Hurdles #3 - Apache Pivot - Finding Request IP Address
Hurdles article number and still working with Apache Pivot, this time we are looking at the RESTful services provided by the Pivot libraries. The Pivot WebService libraries are quite handy, they provide a nice, clean wrapper to work with that abstracts out a lot of the complexity around serialization, HTTPServlet objects, and HTTPRequest objects.
Unfortunately there is one, nasty flaw with the v2.0.2 libraries that I have come across so far. And that is the HTTPRequest object has been so thoroughly abstracted away that you cannot access it, or much of the information included within it directly.
The provided QueryServlet implementation discards information such as: the remoteAddr and remoteHost values. So there is no way to retrieve the requesting IP address if you want to do simple things like validating the source request location.
The reason I discovered this issue in the first place is that I wanted to add a nice little authenticated cache component to my service. Nothing complex, I have already established a verified sessionkey, but in order to negate simple man-in-the-middle attacks that involve sessionkey stealing I wanted to add a secondary validation step that associated a sessionkey to a source IP address. Like I said, very simple (and certainly not foolproof), but my application doesn't involve national security. It's just a quick two-factor authentication routine is all I wanted.
Well, no such luck. In order to resolve this issue I had to resort to extending the org.apache.pivot.web.server.QueryServlet class with my own modifications to the service method and then extend that class with my actual Servlet implementations.
Fortunately GrepCode is kind enough to provide us with the source of the QueryServlet class to use as a starting point. So here is a QueryServletExt class that I created that parses the remoteAddr and remoteHost values into properties that can then be accessed via getRemoteHost and getRemoteAddr methods.
public abstract class QueryServletExt extends QueryServlet {
private transient ThreadLocal<String> remoteAddr = new ThreadLocal<String>();
private transient ThreadLocal<String> remoteHost = new ThreadLocal<String>();
@Override
@SuppressWarnings("unchecked")
protected void service(HttpServletRequest request, HttpServletResponse response)
throws IOException, ServletException {
// Get client's IP address
String ipAddress = request.getRemoteAddr();
remoteAddr.set(ipAddress);
// Get client's hostname
String hostname = request.getRemoteHost();
remoteHost.set(hostname);
super.service(request, response);
}
public String getRemoteAddr() {
return remoteAddr.get();
}
public String getRemoteHost() {
return remoteHost.get();
}
}
This piece of code doesn't bring all the attributes of the HTTPRequest into your QueryServlet, but at least it will provide a template for how to include whatever pieces you do need in your code.
Tuesday, April 02, 2013
Report Conceptualization Training Strategy – Part 2, Structured Report Planning
For beginning and advanced report developers alike, the
biggest challenge I have witnessed is the challenge of Report
Conceptualization. This is the process of translating often incomplete business
requirements into a structured plan that will produce the result that the
business needs. For operational reports based on known data structures and
calculations this process can be simple, often operational reports are
structured very simply and the job of the report developer is to simply put the
right fields in the right place. The process becomes much more difficult for
strategic reports which attempt to help the business define what they should be doing. Strategic reports are
often poorly understood by the business, their needs are uncertain and
difficult to communicate, and vision for the end product is amorphous. Taking
vague, conceptual requirements into a concrete view of business data is a
challenge for developers and analysts alike.
Recently I have been mentoring two Cognos Report Developers
in their process. These developers are experienced at building operations-level
reports in Cognos and have a good understanding of the data and the business,
but they are challenged at providing insightful analysis to the business when
the business users themselves only have vague ideas of what they want to see,
or in some cases have too many ideas that all jumble together.
The first step in this conceptualization process was to
focus on a particular element, a decision, or a comparison that the business
needs to make and determine an appropriate way to visualize the data that is
needed. This was covered in Part
1, Report Visualization.
Once a vision has been established, the second step is to
plan the underlying structure of the report so that the end result is a simple
data structure that can be directly mapped to the visualization. This is the process I use for breaking down a difficult report into simple steps, and have trained numerous developers who use it successfully.
The Process
We will start with
the user story:
“As a Product Manager I need to
know year-over-year sales trends of my products to determine which products I
should assign to my premium display space for each retailer.”
Based on the story we have done
further analysis and come up with a detailed business requirement. The business
user needs a line graph that shows monthly sales of the product lines and
products they select for the current year and the previous year. The graph also
needs to show the average sales of the selected products. The graph should only
include sales from selected retailers.
A mock-up of the chart is given below.
Identify Your Measures
Measures are all numerical fields that can be aggregated or
calculated and that are connected to dimensions. These are generally (in
Cognos) located in the Fact table, but for the purposes of this process also
includes deduced values such as counts or distinct counts of measures. Ie
count(distinct product_id) is a measure that can be connected to other
dimensions through an appropriate fact table, this produces the deduced value
“Count of products for each value of dimension X where sales exist.”
In our above example our only measure is “Sales $”.
Identify Your
Summarizations/Calculations
List all unique summarizations of measures alongside your
measure list. These will be treated as separate measures that may or may not be
able to be grouped with the list of identified measures so far. Remember that
your measures will already be summarized according to their default (usually
total) aggregation, you need to identify any alternate summarization such as
“average”, “max”, “min”, “count” etc in this list.
Calculations are any formula that combines measures (or
applies to a single identified measure). Add these alongside your measure list.
In this example we also have “Average $” as a
summarization of all products.
Identify Your
Aggregations/Groups
Aggregations and groups are any dimension, attribute, column, or field that the defined measures are grouped by (sums, counts,
averages, etc). These groups I am calling descriptors, because the column or
attribute describes an aggregation, and any particular value (ie. “Product A”) describes the values associated with it.
In our example the aggregations are: “Product”, “Calendar
Month”, “Calendar Year”.
Identify Your
Filters/Restrictions
Filters and restrictions are any condition applied to the
data that reduces the data set. These are usually another set of descriptors
(dimensions, attributes, columns, or fields) that are limited or selected in
some fashion, and usually overlap in some fashion with the already defined
aggregations/groups. There may be additional filters/restrictions that are not
represented in the list of descriptors so far, and there may be value filters
applied that are based on aggregations of the defined measures.
In our example the filters are: “Product Line – by selection”,
“Product – by selection”, “Calendar Year = Current Year”, “Calendar Year = Previous Year”, "Retailer - by selection".
Build a Mapping Table
Create a table and label each column with your Measures, Summarizations, and Calcuations and label each row with your Descriptors, Filters, and Restrictions. Place a mark in each cell where the Descriptor will be used to group or filter the Measure
Our example should produce a table like this:
Note that Average $ is not grouped by Product but is impacted by the Filter - Product, this is because the Average $ calculation will be performed across all products in the filtered list.
Choose Your Pivot Descriptor
Create a table and label each column with your Measures, Summarizations, and Calcuations and label each row with your Descriptors, Filters, and Restrictions. Place a mark in each cell where the Descriptor will be used to group or filter the Measure
Our example should produce a table like this:
Sales $ | Average $ | |
Product |
X
| |
Calendar Month |
X
|
X
|
Calendar Year |
X
|
X
|
Filter - Product Line |
X
|
X
|
Filter - Product |
X
|
X
|
Filter - Calendar Year = Current Year |
X
|
X
|
Filter - Calendar Year = Previous Year |
X
|
X
|
Filter - Retailer |
X
|
X
|
Note that Average $ is not grouped by Product but is impacted by the Filter - Product, this is because the Average $ calculation will be performed across all products in the filtered list.
Choose Your Pivot Descriptor
The Pivot is a critical component in the analysis, it is the
descriptor that is at the core of the visualization. Generally you can find it
at the centre of the business requirement, it is the thing that is being reported on, and that decisions are being made
about. The purpose it serves is that all other pieces of data in the analysis
are connected to it and revolve around it.
This pivot will generally be used as the join key in any inner/outer joins that will be created in this report, it may not be the only item used in a join, but it will be the common thread that is used in (nearly) every join.
This pivot will generally be used as the join key in any inner/outer joins that will be created in this report, it may not be the only item used in a join, but it will be the common thread that is used in (nearly) every join.
On simpler reports it may be unclear which descriptor is
actually serving as pivot, and this discussion often centers around the
calendar, whether the date is the
pivot object, or whether something else (like product or location) is
it. The easiest way to decide is simply asking why are you using the report?
What are you reporting on? Are you reporting on Sales by Month (but shown by
product)? Or are you really concerned about Product Sales (but shown by
month/year)?.
Another quick way to sometimes decide which descriptor is
acting as pivot (not always) is whether or not it is being shown on the report
in two or more different ways (or distinct sets)? If it is, then it’s probably
not your pivot. You can also look at filters, which descriptors are being
filtered by the business user’s choice? These are good candidates for the
Pivot.
In the example above your Pivot is probably “Product”. The
other option would be “Month”, but there are a couple of reasons why this is
probably not the case. 1) Product is part of a user-driven filter, the business
user is choosing which product lines/products to see and is thus reporting on
the sales of particular products, not sales of particular months. 2) Calendar
is being included in two distinct sets, Previous Year and Current Year, this is
not a hard-and-fast rule but it can imply that this is not the central piece of
the view but it is reinforces reason #1.
3) Although Average $ is not being grouped by the Product, it is performing a summary of Products and in the defined report the Average is being displayed in the same manner as a Product.
3) Although Average $ is not being grouped by the Product, it is performing a summary of Products and in the defined report the Average is being displayed in the same manner as a Product.
Group Descriptors into
Hierarchies
Identify descriptors that are related to one another in a
natural or structured hierarchy, these can be easily collected as “summary” and
“detail” components of a single query.
In the example the natural hierarchies are “Month, Year, Filter - Current Year”, and “Month, Year, Filter - Previous Year”. Since month is
included in two separate hierarchies it is the natural joining point to connect
data from the “Current Year” hierarchy and the “Previous Year” hierarchy.
If we wanted to also display the Product Line on our report we would add it as a Descriptor and it would form a third hierarchy "Product Line, Product" but for simplicity we will leave it out.
If we wanted to also display the Product Line on our report we would add it as a Descriptor and it would form a third hierarchy "Product Line, Product" but for simplicity we will leave it out.
Identify Independent or Disjoint Sets
Identify filter or restriction combinations that are independent of one another. These are filter combinations that are mutually exclusive or partially overlapping, and that if both were included in the same query would produce no results or fewer rows than intended.
Our example has the mutually exclusive "Filter - Current Year" and "Filter - Previous Year". If both filters were included in the same query, this would result in no data because a row cannot be both in the Current Year AND the Previous Year.
You may note that in this situation both could be combined into the same query using an OR condition, and due to the simplicity of this example that is true. This process is not intended to produce the most optimal or shortest path to a result, but is intended to produce a consistent, repeatable, and understandable path to a result. The counter-argument to using an OR clause is that if we decided to replace "Current Year" and "Previous Year" with arbitrary (potentially overlapping) date ranges, then using an OR clause would no longer be possible and at least 2 new queries would need to be created to perform the enhancement.
Our example has the mutually exclusive "Filter - Current Year" and "Filter - Previous Year". If both filters were included in the same query, this would result in no data because a row cannot be both in the Current Year AND the Previous Year.
You may note that in this situation both could be combined into the same query using an OR condition, and due to the simplicity of this example that is true. This process is not intended to produce the most optimal or shortest path to a result, but is intended to produce a consistent, repeatable, and understandable path to a result. The counter-argument to using an OR clause is that if we decided to replace "Current Year" and "Previous Year" with arbitrary (potentially overlapping) date ranges, then using an OR clause would no longer be possible and at least 2 new queries would need to be created to perform the enhancement.
Group
Measures/Calculations into Queries based on Descriptor Map
First lets review our updated mapping table.
Sales $ | Average $ | |
Product |
X
| |
1Calendar Month |
X
|
X
|
- Calendar Year |
X
|
X
|
- Filter - Calendar Year = Current Year |
X
|
X
|
2Calendar Month |
X
|
X
|
- Calendar Year |
X
|
X
|
- Filter - Calendar Year = Previous Year |
X
|
X
|
Filter - Product Line |
X
|
X
|
Filter - Product |
X
|
X
|
Filter - Retailer |
X
|
X
|
1 & 2 - Independent sets
First rule of grouping into queries: If 2 measures have exactly the same mapping (and in Cognos they must come from the same Fact table) then they can be combined into the same queries.
Second rule: Split independent sets into different queries, and add all unrelated descriptors into both queries.
Third rule: Identify each combination of measures (with different mappings) and split descriptors as a separate query.
In our example we will produce 4 queries:
Combine Queries by Joining on Pivot
First rule of grouping into queries: If 2 measures have exactly the same mapping (and in Cognos they must come from the same Fact table) then they can be combined into the same queries.
Second rule: Split independent sets into different queries, and add all unrelated descriptors into both queries.
Third rule: Identify each combination of measures (with different mappings) and split descriptors as a separate query.
In our example we will produce 4 queries:
- Sales $ [Product, Month/Year/Filter - Current Year, Filter - Product Line, Filter - Product, Filter - Retailer]
- Sales $ [Product, Month/Year/Filter - Previous Year, Filter - Product Line, Filter - Product, Filter - Retailer]
- Average $ [Month/Year/Filter - Current Year, Filter - Product Line, Filter - Product, Filter - Retailer]
- Average $ [Month/Year/Filter - Previous Year, Filter - Product Line, Filter - Product, Filter - Retailer]
Combine Queries by Joining on Pivot
With each query defined, the only step left is to join them together in a fashion that produces our report. There are a couple of factors that go into deciding how to join queries together. A key point to remember is that there isn't a wrong way to join them, the resulting report will still be technically valid, but the data may not be represented (or show) the values that you are intending to show. Because we have each of the queries we need the raw data is available to produce our intended result.
Guidelines:
Our example is quite simple in terms of joins, we are only going to use Union joins by creating a new column in our Average $ queries to replace the missing Product column and using it to label the rows as "AVG" and "AVG-PY".
If we were going to year-over-year differentials between products and averages we would have to change our joins as follows:
Guidelines:
- Start with the most-inclusive and most-detailed query and add to it. Additional filters can be added later to reduce the set of results being displayed. Start with the largest set of data and summarize/reduce the working set as you add more information.
- Joins (outer/inner) are used to directly compare measures using a calculation (such as percent change or difference). They are not used to indirectly compare measures (such as grouping and plotting together on a graph). Unions are used to make indirect comparisons.
- Outer joins are used to add additional reference information to your pivot (ie. adding cost of goods info to products that happen to have COG values).
- Inner joins are used to intersect and add conditional information to your pivot (ie. only showing products that have both Sales and COG values).
Our example is quite simple in terms of joins, we are only going to use Union joins by creating a new column in our Average $ queries to replace the missing Product column and using it to label the rows as "AVG" and "AVG-PY".
If we were going to year-over-year differentials between products and averages we would have to change our joins as follows:
- JOIN Sales $ (Previous Year) to Sales $ (Current Year) using an outer join on Product and Month
- JOIN Average $ (Previous Year) to Average $ (Current Year) using an outer join on Month
- UNION Sales $ (Year over year) to Average $ (Year over year) as normal
This would allow us to calculate the change in Sales $ from Previous year to Current year, and calculate the change in Average $ from Previous year to Current year and display both values on our chart.
Conclusion
This is an outline of my process when building complex reports in Cognos. As described above there are a few gaps in it, and I have only touched on some of the finer points of report development (joins, multiple pivots, and aggregations for example) but I have used it effectively, and through a series of workshops have taught it to both novice and experienced Cognos report developers so that they could apply it on their own and produce solid, maintainable reports that are currently being using in production within their own organizations.
I hope this can provide some small help to Cognos developers out there, and if you have any questions or need me to clarify or expand on any of my points in this article please feel free to contact me.
Subscribe to:
Posts (Atom)