Tuesday 29 October 2013

How to handling large and potentially complex XML datasets in Hadoop

How should one handle large and potentially complex XML datasets in Hadoop?

Suppose you have loads of XML data being generated by source systems.
You want to do some data analytics on this data.
Chances are your data scientists will need to access this data and you might need to run a few batch jobs over this data set.
XML parsing is CPU intensive.
XML data contains both data and meaning making it robust against upstream definition changes and against corruption.
How should one approach this problem?

Came across this interesting slideshow re handling XML data  here: http://www.slideshare.net/Hadoop_Summit/bose-june26-405pmroom230cv3-24148869.
Then found the Youtube video presentation of it.
Good stuff that presents some of the issues/challenges and then provides some solution approaches and backs them up with some test results.

Another interesting article for python coders can be found here http://davidvhill.com/article/processing-xml-with-hadoop-streaming.

An old post from 2010 showing how to use Hadoop and Mahout XMLInputFormat class to process xml files. Debate about the effectiveness and richness of this approach found in this entry.

Another article using Hive and XPath to solve a specify XML parsing problem using Hive.

Articles others point people to is this one http://www.undercloud.org/?p=408.

Interesting article from June 2012 re this.

These guys in Feb 2012 were looking at the similar XML challenge - useful articulation of the problem and their thinking at the time.

Andy was playing with XML and Hadoop way back in 2008.