May 23rd 2011 04:06 pm

Dynamic de-normalization of attributes stored in key-value pair tables

Dear Kettlers,

A couple of years ago I wrote a post about key/value tables and how they can ruin the day of any honest person that wants to create BI solutions.  The obvious advice I gave back then was to not use those tables in the first place if you’re serious about a BI solution.  And if you have to, do some denormalization.

However, there are occasions where you need to query a source system and get some report going on them.  Let’s take a look at an example :

mysql> select * from person;
+----+-------+----------+
| id | name  | lastname |
+----+-------+----------+
|  1 | Lex   | Luthor   |
|  2 | Clark | Kent     |
|  3 | Lois  | Lane     |
+----+-------+----------+
3 rows in set (0.00 sec)

mysql> select * from person_attribute;
+----+-----------+---------------+------------+
| id | person_id | attr_key      | attr_value |
+----+-----------+---------------+------------+
|  1 |         1 | GENDER        | M          |
|  2 |         2 | GENDER        | M          |
|  3 |         3 | GENDER        | F          |
|  4 |         1 | NATURE        | EVIL       |
|  6 |         2 | NATURE        | KIND       |
|  7 |         3 | NATURE        | KIND       |
|  8 |         1 | PEOPLE_SAVED  | 0          |
|  9 |         2 | PEOPLE_SAVED  | 394239324  |
| 10 |         3 | PEOPLE_SAVED  | 263403     |
| 11 |         1 | PEOPLE_HARMED | 983439     |
| 12 |         2 | PEOPLE_HARMED | 0          |
| 13 |         3 | PEOPLE_HARMED | 29         |
+----+-----------+---------------+------------+
12 rows in set (0.00 sec)mysql> select * from attribute_description;
+----+---------------+-------------------------+-----------+
| id | attr_key      | attr_description        | attr_type |
+----+---------------+-------------------------+-----------+
|  1 | GENDER        | Gender (M/F)            | String    |
|  2 | NATURE        | Nature                  | String    |
|  3 | PEOPLE_SAVED  | Number of people saved  | Integer   |
|  4 | PEOPLE_HARMED | Number of people harmed | Integer   |
+----+---------------+-------------------------+-----------+

UPDATE: This example can be downloaded here.

It was Pentaho partner OpenBI that gave me the use-case but the data is obviously fictive. In the real-world case, the “person_attribute” table contains over one hundred million rows of data.

The challenge is that you want to have the following columns in your query: person ID, name, last name, gender, nature, number of saved people and number of people harmed.  However, it has to be implemented in such a way that if a new attribute is added to table “attribute_description” a new column will appear in the output.  It has to be 100% maintenance free.

Fortunately, we’ve done some interesting things recently with dynamic transformations to allow us to create the following solution:

This transformation reads the metadata from the “attribute_description” table and injects that into the following template:

The Inject step is instructed to specify the list of columns to denormalize and take the output from that specific step:

As you can see in the next screen shot image from the previewing of data, all defined attributes are de-normalized correctly with the appropriate data types:

Now we can add an attribute to see if our transformation takes it into account:

mysql> insert into attribute_description(attr_key, attr_description, attr_type) values(’NUM_FRIENDS’, ‘Number of friends’, ‘Integer’);
Query OK, 1 row affected (0.06 sec)

mysql> insert into person_attribute(person_id, attr_key, attr_value) values(1, ‘NUM_FRIENDS’, ‘5′);
Query OK, 1 row affected (0.00 sec)

mysql> insert into person_attribute(person_id, attr_key, attr_value) values(2, ‘NUM_FRIENDS’, ‘34′);
Query OK, 1 row affected (0.00 sec)

mysql> insert into person_attribute(person_id, attr_key, attr_value) values(3, ‘NUM_FRIENDS’, ‘12′);
Query OK, 1 row affected (0.00 sec)

After updating our tables like this we can now preview the exact same step.  As you can tell from the following preview screen shot the extra column is automatically being picked up:

Now obviously, from a maintenance perspective this helps a great deal when you are building a data warehouse.  However, it also allows us to start thinking about doing reporting and analyses on these kinds of dynamic data sources.  How can we explain to Pentaho Metadata that attributes should be picked up automatically?  How can we refresh Mondrian schema information on-the-fly when the new attribute is introduced?  While the answers to those questions are not yet 100% clear, I’m sure it’s going to be an interesting discussion.

Until next time,

Matt

P.S. Please note you’ll need a very recent build of PDI 4.2.0-M2 to be able to pass along data from a dynamic transformation.

10 Comments »

10 Responses to “Dynamic de-normalization of attributes stored in key-value pair tables”

  1. ObjectiveC on 23 May 2011 at 16:36 #

    Hello,

    Haven´t had the need for it yet, but my fingers are itching to experiment with it.

    Thanks for the update Matt !

  2. Sylvain on 23 May 2011 at 16:39 #

    Hi Matt,

    Thks for this post… and the very usefull “Inject Step” in PDI 4.2 too !!

    In fact, many (old and bad…) RDBMS physical models for business applications store “free attributes” (customized by end-users).
    And always with fixed columns in many business tables (customers, products), like this :
    Attribute_01 | Attribute_02 | Attribute_03 | …. | Attribute_10 |

    Last week, I had this specific request from a client and dynamic transformation with metadata injection is a very nice solution. In my case, it’s the opposite of your example, with a “Normalize step”, but it should work too ;-)

    All the best

    Sylvain

  3. Matt Casters on 23 May 2011 at 18:18 #

    That’s right Sylvan, the normalize step also allows injection of metadata in 4.2.0.

  4. Diethard on 24 May 2011 at 12:45 #

    Excellent article Matt! Thanks for sharing! I am currently working on a project which has similar requirements. Taking it a step further to automatically generate/amend a metadata model and an OLAP schema would be definitely very interesting.

  5. ken on 07 Oct 2011 at 19:15 #

    hi Matt,
    It’s an great article that will help solve my current situation. I used the same idea to create a transformation
    that contain the following steps:
    1. a UDJC that continues receiving messages through socket.
    2. a Split Fields Step to parse each message into row

    — beginning of a sub-transformation in a Single Threader (batch time= 20000ms)
    3. Mapping input Specification
    4. Modified Java Script Value Step to print out debug messages
    5. sort the batch of rows
    6. groupby Step to compute the average.
    7. Mapping output Specification
    — end of sub-transformation

    8. output to text file

    I ran it and notice the followings:
    1. the single Threader only kick off once, but not every 20secs. (step 3 only get printed once)
    2. once the sub-transformation is kicked off, I don’t see a batch of rows, only the first row is being passed.

    Any idea? and how do I debug this?

  6. Matt Casters on 07 Oct 2011 at 23:44 #

    Hi Ken, try to simplify your transformation. Make sure that your UDJC step indeed continues to produce rows.
    If you want to post a sample, hop on the forum and create a discussion thread there.
    Good luck,

    Matt

  7. ken on 08 Oct 2011 at 0:10 #

    hi Matt, I think I kind of figured out the issue, see if that makes sense to you. I noticed the following behavior:

    * The single threader is set to batch every 20 secs

    - events came in at 1, 2, 5, 10th secs
    - there was no more event before the 20 sec time-window expired.

    In this situation, i found out the time threader would still be waiting without producing any output.

    - now, lets say, at the 50th sec, another event arrived, then interestingly, the time threader finally split out the result and aggregated with the previous events ( at 1, 2, 5, 10, 50th secs)

    Thanks for your help.
    –ken

  8. Pablo on 05 Dec 2011 at 20:08 #

    Hi Matt,
    I´ve got a situation a little diferrent, as we don’t have any person_attribute table, this is made dinamically as a file arrive to our application. When I’ve tried to do what you show hear, doesn’t work. I wonder if it just work if the data is stored in our table or can I do it if it came on a strem?
    Is there any modification I should do, or did I just do something wrong?

  9. Matt Casters on 05 Dec 2011 at 23:47 #

    Hi Pablo,

    The data can originate from any type of data source, no restrictions there.
    Try to give a few more details on the Kettle forum.

    Best regards,

    Matt

  10. micropoum on 17 Dec 2012 at 16:14 #

    Hi Matt

    Very interesting, i use successfully this solution.
    However, I have tryed to use dynamic.ktr in a sub-transformation, ie : a step “mapping input” before “attribute_description” and a step “mapping output” instead of dummy “output”

    I have a transformation with a mapping step which maps dynamic.ktr and a dummy step just after. When I run it, I have a “bad encoding” error

    Thanks for your help

Trackback URI | Comments RSS

Leave a Reply

« The Single Threader step | Memory tuning fast paced ETL »

Pentaho world image