Converting XML to Hive

Chinmay Sinha XML

In this example we will use the Flexter XML converter to generate a Hive schema and parse an XML file into a Hive database. We will then use the spark-sql interface to query the generated tables.

TVAnytime XML standard

For the example we will use TVAnytime XML standard. You can download sample XML files and an XSD for this standard from the TVAnytime website.

We will convert the RolesCS.xml file which contains the classification and definition of various roles. For example, the roles that a “Reporter” can play is given in the xml below.

<Term termID="REPORTER">
                    <Name xml:lang="en">Reporter</Name>
                    <Name xml:lang="en">Newsman</Name>
                    <Name xml:lang="en">Newswoman</Name>
                    <Name xml:lang="en">Newsperson</Name>
                    <Definition xml:lang="en">A person who gathers news and other journalistic material and writes or broadcasts it-the basic job in journalism</Definition>

We will parse the above xml file using Flexter and then generate the hive schema from it, and finally we will view the data extracted using the spark-sql interface.

Converting TVAnytime XML to Hive tables

We start by creating a database in the spark-warehouse by firing up the spark-sql terminal.

Next, we will create a target database

Once the target database has been created, we can use Flexter to extract the data from the RolesCS.XML file. We first run the command using the -s option and then again run Flexter without the -s option to generate the logical schema for the XML.

When the command is ready, we use the above generated logical schema to generate the Hive based output. This mode can be activated by parameter -V or --hive-create and the output location is optional.
When the output location is not defined, the tables will be created as managed tables, following the database/schema default location. With a defined output location, the tables will be created as external tables.
The target schema might be provided, otherwise the “default” schema will be used implicitly.

Below are some useful options:

We can use the above parameters to extract the XML on the schema of our choice. By default we generate Parquet files, but we could also generate ORC files.
In order to view the data we can again fire ‘spark-sql’ from the terminal and check the various tables generated
We can use basic spark-sql commands to check on the descriptions of the various tables that were parsed by Flexter. Below we describe the table ‘term’ created as the output by Flexter
Similarly, we can view the other tables generated - ‘name’ and ‘classificationscheme’.
We can use the FK_Term in the ‘name’ table and the PK_Term in the ‘term’ table to join both the tables and classify the Name column in the ‘name’ table to classify in under the ‘termID’ column in the term table. For example, we can see what names come under the ‘REPORTER’ termID in our XML as depicted above.