
At any organization as large as Mozilla, enormous amounts of data are generated, that are potentially useful when making development decisions: We count downloads, up-to-date checks for the malware blocklist and installed addons, and there are webserver logs for the various Mozilla Web sites, the responses for Test Pilot case studies, and so on.
Such data pieces are generated in large quantities every day. So before the data can be analyzed by the metrics team, it needs to be grouped by common criteria (e.g. by date, language, region or hour of the day). This way, the data sets are summarized and the amount of becomes digestable for statistical tools such as R, Microsoft Excel or OpenOffice Calc.
A data collection that is accessible for analysis in this way is often called a data warehouse, as it persists data and makes it accessible in consumable quantities. Data integration is the task of transforming data from various sources into the data warehouse representation.
Kettle is a powerful node based tool to model and implement data integration tasks. It simple to visually express data transformations without getting lost in a soup of nested function calls and intermingled SQL-statements. Nodes in the graph represent transformation steps, while the edges indicate the flow of records. Each step handles one record at a time, allowing for parallel handling of large data sets.
There are lots of predefined steps available to import and export data to and from text/CSV files or database tables, to perform basic calculation, to group and to sort. There are steps for details-lookup, for merging and for splitting of record sets. When manually programmed, such operations require a lot of hand coding and are easy to get wrong. With Kettle, it is just a matter of connecting the right nodes and setting the appropriate configuration.
Whenever the predefined steps are insufficient, it is possible to write custom steps in JavaScript — or Java if performance is a concern.
Of course, there is always room or improvement, with Pentaho Data Integration especially in the area of usability. On Mac OS X, the visual editor does not use the appropriate shortcuts (it requires you to press ctrl-c instead of cmd-c for copy to clipboard). Also, the node editing is a bit fiddly: The editor for the various transformation steps should be a contextual inspector that automatically shows the editing options for the currently selected steps. Instead, I manually have to open it for each step (that means: all the time).
In the end, Kettle is very useful nonetheless: It allows for headless execution of the created transformations, so that the data integration process can be automated. And it has very good support for regular expressions, allowing for the parsing of fairly complicated source formats.
If you have to deal with the conversion of file and/or database formats on a regular basis, you might want to give Kettle a try. There is a free community edition of PDI available at the Pentaho website.