2013. január 9., szerda

ERROR: Pentaho Data Integration (Kettle) process runs twice

The Problem

I work a lot with Pentaho Data Integration a.k.a Kettle toolkit. For those who don't known: Kettle allows you to build processes with a GUI that can be run in the IDE or from the command line, and reads data, converts and transforms it, then spits it out. It can deal with a lot of databases and various file formats, can invoke shell scripts, can run JavaScript snippets, and perform various conversions and transformations. Very handy when you have to load large, broken CSV files into relational databases just to mention an example.

I recently re-designed one of our processes (or "jobs" in Kettle) when something really strange showed up during testing.

Part of the process run twice. It seemed like I'd duplicated the whole process from some point.

The IDE is basicly allows you to build a graph, the nodes are the process steps, the edges are telling what to do when a node was finished.

Clearly, beside error handling edges, I only draw a single edge from any node. The thing I saw could only happened If I would drew two edges between some of the two nodes. The IDE don't even allow such thing to happen. Or do it?

I tried to disable the edges (called "hops" in Kettle job terminology), to catch where the process goes two ways.

I found out that disabling one of the hops did NOT cause the job to stop at that point - which is the thing it should have done. Instead, the script run correctly, executing it's tail only once.


 The Solution

Something was clearly off the rails. In the IDE, I disabled one hop, but the command line tool still thought that there is an enabled hop between the two steps. When I enabled the hop, the command line tool probably saw TWO hops between the same nodes - a thing that is impossible to (intentionally) achieve in the IDE (which is called Spoon btw). I was suspecting file corruption, and/or serious command line tool bug.

Kettle keeps the job description in XML files, therefore they can be opened and easily modified in a simple text editor. So this is exactly what I did: I fired up Geany, and opened the offending job file. This is what I saw:


...
    <hop>
      <from>Set up dimension tables in stage</from>
      <to>transform_fixlogs_to_stage</to>
      <from_nr>0</from_nr>
      <to_nr>0</to_nr>
      <enabled>Y</enabled>
      <evaluation>Y</evaluation>
      <unconditional>N</unconditional>
    </hop>
    <hop>
      <from>Set up dimension tables in stage</from>
      <to>transform_fixlogs_to_stage</to>
      <from_nr>0</from_nr>
      <to_nr>0</to_nr>
      <enabled>Y</enabled>
      <evaluation>Y</evaluation>
      <unconditional>N</unconditional>
    </hop>
...

Yep, there are TWO hops between the same nodes.

Removing one of the hops with the text editor solved the problem, the job now runs correctly.