Simpler reporting : make your data richer

A lot of time, I hear discussions about which reporting tool is the easiest to use for certain special tasks. Most of the time, I just ignore these “threads” because it’s not my cup of tea as a developer of ETL solutions.
However, it has to be said, often the solution to complex reporting requirements is to be found in ETL.
When you find yourself struggling with complex reports that need any of the following:

  • compare different records
  • aggregate beyond simple sums and averages.
  • report on a non-existing records (report 0 sales, etc)

Well, in those cases you need ETL.

Let’s take for example the case of the reporting on non-existing sales: how can you report that there has been 0 sales for a certain product during a certain week? Well, you can create an aggregate table in your ETL that contains the following:

  • Dimensions
    • Time (end of week)
    • Product (lowest level)
  • Facts
    • Sum of Sales for the product during the passed week.

You will agree with me that if you create such a simple fact table, that the report based on it will also be simple. That’s not all. You get another benefit from this: you create the possibility to extend the fact table with other interesting data when it comes up. For example, in the same transformation that populates this aggregate fact table, you can add the following (examples):

  • Facts
    • Total Sales last month
    • Total Sales last quarter
    • Total Sales last half year
    • Total Sales Year-to-date
    • Total Sales ever
    • Difference in sales with last week

You see, the initial investment of building the ETL job is not thrown away over time, you build upon the earlier effort and grow with the needs.
Now you might think that these metrics cost a lot in terms of CPU and I/O to calculate, but of-course that’s not true: you can just look at the numbers for the previous week and do a simple addition…..
Sample aggregation transformation

The added advantage of doing it during the night in the ETL jobs is that the speed of the reports that run on this aggregate are bound to run very fast because no calculations need to take place afther the facts.

Finally, let me say that by constructing your datawarehouse like this (one fact table derived from another) you’re actually enriching your data, making it high quality information… which brings me to this famous quote as what we’re doing here is exactly the oposite:

“Where is the wisdom? Lost in the knowledge.
Where is the knowledge? Lost in the information.”

T.S. Eliot

“Where is the information? Lost in the data.
Where is the data? Lost in the #@$%?!& database.”

Joe Celko

Until next time,
Matt

7 comments

  • Piterchu

    Hey Matt,

    what a nice and interesting blog entry !

    The only thing I can say, is that it comes out right on time to help me solve my problem ! (Or is it just because of my last couple of posts on the Pentaho forum ? )

    Did you get upset by hearing (or should I say reading :-D) again and again the same questions ?

    Anyway thanks a lot !!!

    Regards,

    Peter

  • Hi Peter,

    No honestly, I don’t get upset. OK, perhaps I used to, but now I find that I have more and more ways to point people into the right direction. Blogging, writing tech-tips, documentation, examples all help.

    So yes, your post indeed triggered this blog entry, but only because I thought the answer could be useful for a larger audience :-)

    All the best,
    Matt

  • Paul O'Hagan

    Full disclosure – I work for a company that does “data integration.” I wouldn’t call it ETL, and I don’t believe we compete with Pentaho any more than we compete with Informatica. If that makes you nervous, contact me and I’ll provide you more details.

    Whilst I agree with everything you’ve written, I keep bumping into two key challenges:

    1 – Time allotments for executing batch routines are shrinking
    2 – Business needs for more “fresh” data increasing

    You mention it possibly costing CPU and I/O to perform the calculations; and I think you’re right. In fact, I think that’s a _really_ important point. What I see happening is that the number of and types of calculations are increasing the time requirement to complete the ETL processes far beyond what the business allows for. I also see production systems being much more heavily utilized (I spoke with a telco who’s customer billing system operated at 80% utilization during off-peak hours (128 CPU HP Superdome) and 95% – 98% during peak hours))

    So what’s my point? Well, don’t just look for efficiencies in building dimensions and fact tables for reporting. Look for efficiencies you can gain by making your extracts light-weight and smarter.

    Not a simple problem to solve, but probably one of the most important problems companies are faced with today. And believe me, I talk to a lot of companies struggling with these problems… :-)

    –P

  • Hi Paul,

    Thank you for your comments.

    While I agree with you in general, I have to disagree on the details.

    In my experience:
    – most of the time, the data warehouse is sitting idle most of the night.
    – users don’t like to wait for reports to finish.
    – It’s OK to put a 100% load on a dedicated data warehouse server, especially during the night. That’s what the thing was made for in the first place.

    Also, like I mentioned in my blog, it is possible to do serious optimisations here. Smart things to do in this case include shifting calculations to the database as much as possible in certain situations and chopping up loads in smaller pieces to get a “predictable” load on your system in other situations.

    I’m sure that if you’re in the real-time data warehousing business that you will hear a lot of complaints like I’m sure you do. However, how many companies are running 24/7 businesses? I’m sure you’ll agree with me that most companies are more like “9-to-5″ :-)

    Please do tell me/us about what you’re doing. I don’t get nervous all that quick these days ;-)

    All the best,

    Matt

  • Great post Matt!

    One thing I’d say is that like any good artisan the “intelligence” builder (ie, data warehouse architect/designer/developer) must use appropriate tools to appropriate tasks.

    You’ve outlined BRILLIANTLY the starting place… Asking the same question “What is my Sales” by about 1000 different qualifiers it is absolutely the right thing to get the data into a well designed fact table. In fact, you’ll know you have a well designed fact table when you can generate all your summary reports from it (or a summary or aggregate).

    The one thing I might suggest differently is in some of the summaries. Building ETL = more complex, includes additional javascript code/calculations/etc that have to be maintained. Some of the time series stuff can be very eloquently handled by mondrian. In other words, you can program Mondrian with your formula and many of those vs. prior period, and running totals etc are calculated without additional tables/etl work. Mondrian can also pick up that aggregate if you so choose so there’s nothing that says if performance is of paramount interest you can’t also build the table later along.

    NOW…..

    RE: Data freshness. That’s the brilliant real time Kettle integration into the platform for reporting. ONe can use the same tool to do the historical, 100% on the CPU data warehouse population but then at report time match that with some up to date CRM data when building a report. Lots of possibilities here. Paul – Check out the Pentaho Data Integration sample re: this.

    Nick

  • Paul

    I think we’re in agreement, but I wasn’t clear about what I meant which may make it seem like we’re not. What I hear people talk about isn’t the load on the data warehouse, it’s the impact of extracting the data from the source systems. Impact in terms of CPU, I/O and time. One of the companies I talked to recently is building “regional” data warehouses for sales staff in London England, Portland OR and Tokyo. In that case 9 – 5 is pretty much 24 hours a day. The problem they were facing is two-fold:

    1 – What’s the impact of an extract?
    2 – How do we create dimensions that make use of regionalized fact tables?

    Another thing I keep bumping into is the same problem I’m sure y’all and many others have seen. The reports people want to see are often not all that the warehouse has been architected to produce. In that situation I see people keeping the transactional information and doing roll-ups etc. to support generic and ad-hoc reports. Ad-hoc reports that “hit” the production systems are usually considered problematic because of the impact mentioned above.

    As usual, perspective is all about where you stand at any given time. :-)

    Also, I’m very new to the Pentaho platform, but what I’ve seen deeply impresses me. Despite my lack of knowledge I did recommend to a 2 billion dollar public company (healthcare vertical) that they take a look at it because it seemed like the right answer to some challenges they face.

    Believe me – I plan on learning more… :-)

    So where am I going with this? I don’t know. Your original posting made me think of it though…

    –P

  • Paul,

    Thanks for the referal, I know for a fact that a lot of companies are evaluating Pentaho at the moment and it gives me great hopes for the future.

    On topic of the load that DWHs place on the operational systems, we have given Kettle the ability to generate date-ranges based on the operational meta-data. If you only want to grab the data since the last time a transformation ran without error, you can split up the load that the transformation places on the source system. (works only if you have a time-stamp on the data in the source system of-course)

    That’s what I meant by placing a predictable load on the system.
    So rather than grab the data for the complete day from the source system (say 24M records) you only get 15′ worth of data, running the transformation continously throughout the day, processing 250K records at a time.

    You can find a “Weekly Kettle Tip” on auditing that describes the process of generating date ranges here

    All the best,

    Matt