Converting TVAnytime XML to Impala and Parquet

Chinmay Sinha XML

In this example we will use Flexter to convert an XML file to parquet. We then query and analyse the output with Impala (using Cloudera VM).

Flexter can generate a target schema from an XML file or a combination of XML and XML schema (XSD) files. In our example we process ContentCS.xml file from the TVA data ( "TV-Anytime" (TVA) is a full and synchronized set of specifications established by the TV-Anytime Forum. TVA features enable the search, selection, acquisition and rightful use of content on local and/or remote personal storage systems from both broadcast and online services.

The XML file looks like the one below

Parsing XML using Flexter to generate Parquet output

Both the XML files and the XSD are available for the data and we will use the information from both files to generate the target schema.

Now that we have gathered statistics from our XML sample we can generate the logical target schema with the xsd2er command line tool using the -k switch (-k is shortcut for --use-stats)

Let’s go through the steps

Now we use the Logical Schema ID  (origin: 6) to convert the XML data to Parquet

When the command is ready, removing --skip or -s, allows us to process the data. We direct the parquet output to the output directory for the data.xml file. Let’s first create a folder “output_for_impala” as the location to extract the generated output. The location is given by -o parameter when extracting data using xml2er command.

We can find the extracted parquet files in the output folder. It is a directory structure, which you can find in the current directory. We can ‘ls’ to see the contents of the .parquet folder as shown below.

 Loading and Querying the XML output with Impala

Next, we load the parquet files into Impala and use the Impala-shell to query the parquet files. Impala is an open-source add-on to the Cloudera Enterprise Core that returns rapid responses to queries. We start with the installation of Cloudera QuickStart VM. Cloudera QuickStart virtual machines (VMs) includes Impala as well as other essentials like CDH, Cloudera Manager and Cloudera Search.

We can also install without Cloudera Manager. Before installing Impala manually, make sure all applicable nodes have the appropriate hardware configuration, levels of operating system and CDH, and any other software prerequisites here. We will start with downloading the latest version of Cloudera QuickStartVM from the cloudera website. Next, we download the QuickStartVM as shown below

Cloudera provides its VM compatible VMware, KVM and VirtualBox. We can go ahead with any version as per our requirements (For the sake of this example we will go ahead using the VirtualBox version ). There is also a docker version of Cloudera VM which is available on the Cloudera website. After downloading the cloudera-quickstart-vm-5.5.0-0-virtualbox.ovf file, we need to import it using VirtualBox.

After importing Cloudera QuickStartVM image, start the virtual machine. This virtual machine has Hadoop, cloudera Impala, and all the required software installed. The snapshot of the VM is shown below.

Once we have Cloudera VM running on the VirtualBox, we can fire up the impala-shell on the terminal as shown below. We login as a super user

In addition to Impala shell, you can communicate with Impala using the Hue browser as shown below.

Firstly, we need to share the output of Flexter with the Cloudera Virtual Machine. As shown above we can place all the files generated into a shared folder in the Cloudera VM and then copy the folder into ‘hdfs’. Let’s say the .parquet files are placed in the folder ‘/home/cloudera/flexter_output/’ then we need to copy the output files to a new folder in hdfs say ‘hdfs/parquet_imapala_input’

Alternatively, we could have installed Flexter on the Cloudera VM and generated the XML output there directly.

We have the three parquet files in hdfs and the files are ready to be read into impala. Then we launch impala using the impala-shell in a terminal and create a database to work in

After we have created the database we use the created database and load the parquet files generated using impala as below

We can take a look at the various tables in the database which we are using by using the ‘show tables’ command and also describe tables which we have created using Flexter. We can see that the schema is inferred directly from the  parquet files which we have provided.

As an alternative we can use the Flexter metadata database to automatically generate the table scripts and schema.

Let’s take the Table ‘Term’ created using the Term.parquet file and look inside the table by displaying the first 3 row of the table.