Clustering/clouds made easy

Dear Kettle fans,

The last main item on my agenda before we could release a release candidate of version 3.2. was the inclusion of a number of features that would help us make dynamic clustering easier.

It was already possible to make things happen, but thanks to Sven Bodens parameters, we can now up all that one level.  Let me explain to you what we did with a small simple example…

The “Slaved” step takes the place of one or more steps that you would like to see run clustered (optionally partitioned) on a number of machines.

So let’s say this transformation is part of a job…

We want to have this run on Amazon EC2.  So I created an AMI just for you to test with:

IMAGE   ami-f63ed99f    kettle32/PDI32_CARTE_CLUSTER_V4.manifest.xml    948932434137    available       public          i386    machine

The input of that AMI is a piece of XML that configures the Carte instance on it that is started automatically upon boot of the image:

<slave_config>
<slaveserver>
<name>carte-slave</name>
<hostname>localhost</hostname>
<network_interface>eth0</network_interface>  <!– OPTIONAL –>
<port>8080</port>
<username>cluster</username>
<password>cluster</password>
<master>N</master>
</slaveserver>
</slave_config>

This file, let’s call it carte-master.xml is passed when we run our instance:

ec2-run-instances -f carte-master.xml -k <your-key-pair> ami-f63ed99f

When it’s booted we take the internal Amazon EC2 IP address of this server and pass that into a second document, let’s call it carte-slave.xml:

<slave_config>
<masters>
<slaveserver>
<name>master1</name>
<hostname>Internal IP address</hostname>
<port>8080</port>
<username>cluster</username>
<password>cluster</password>
<master>Y</master>
</slaveserver>

</masters>

<report_to_masters>Y</report_to_masters>

<slaveserver>
<name>carte-slave</name>
<hostname>localhost</hostname>
<network_interface>eth0</network_interface>  <!– OPTIONAL –>
<port>8080</port>
<username>cluster</username>
<password>cluster</password>
<master>N</master>
</slaveserver>
</slave_config>

Then we fire up 5 slaves with that configuration…

ec2-run-instances -f carte-slave.xml -k <your-key-pair> ami-f63ed99f -n 5

These 5 slaves will report to the master and explain where they can be reached.  So all we need to do in our PDI job/transformation is create a master slave configuration:

To top it off, we define MASTER_HOST and MASTER_PORT as parameters in the job and transformation…

So all that’s left to do is specify these parameters when you execute the job…

As you can see from the dialog, we pass the complete job (including sub-jobs and sub-transformations) over to the “Cluster Master” slave server prior to execution because it is not possible nor needed for Spoon to contact the various slave servers directly.  That is because they report with their internal IP addresses.  We wouldn’t want it otherwise since that offers the best performance (and costs less).

These goodies are soon to be had in a 3.2.0-RC1 near you…

Until next time,
Matt

6 comments

  • Wow! Great job, amazing how fast this is progressing.

    I think this is going to be a big part of you MySQL UC presentation next month?

    kind regards,

    Roland

  • Well Roland, I guess you’ll just have to come and see :-)

  • Awesome! This helps greatly.

  • Matt, fantastic work as usual. Between this and the resource exporter, you’re evolving with the users needs, as always. With the advent of cloud computing and MPP databases, this kind of utility is sooooooo useful, whether you’re “cloud computing” with amazon or your own dedicated servers. I’m excited to put this to work and hope to be able to provide useful feedback. Working now on a project using Vertica (MPP) and our own dedicated cloud of servers. That management of the cloud servers will be a “growing” concern, and all this dynamic management helps so much. Awesomeeeeeeeeee!

  • AP

    Hi Matt,

    I sent you an email at your Pentaho, but is kettle started as soon as you run the instances? I’m receiving some connection refused messages (from Pentaho running on Windows to the AMI’s which are Linux) – I have opened the correct port (8080) and still receive the messages.

  • Hello Allan & Brainchild!

    This is probably caused by the fact that we start slave servers on the Amazon internal network. Spoon can’t directly access them. We do this for speed and $$ (internal traffic is fast and free of charge). That is why we included the “pass export to remote server” option in the job execution dialog.

    Good luck!
    Matt