Converting XML to Hive
In this example we will use the Flexter XML converter to generate a Hive schema and parse an XML file into a Hive database. We will then use the spark-sql interface to query the generated tables.
TVAnytime XML standard
For the example we will use TVAnytime XML standard. You can download sample XML files and an XSD for this standard from the TVAnytime website.
We will convert the RolesCS.xml file which contains the classification and definition of various roles. For example, the roles that a “Reporter” can play is given in the xml below.
<Term termID=“REPORTER“>
<Name xml:lang=“en“>Reporter</Name>
<Name xml:lang=“en“>Newsman</Name>
<Name xml:lang=“en“>Newswoman</Name>
<Name xml:lang=“en“>Newsperson</Name>
<Definition xml:lang=“en“>A person who gathers news and other journalistic material and writes or broadcasts it-the basic job in journalism</Definition>
</Term>
We will parse the above xml file using Flexter and then generate the hive schema from it, and finally we will view the data extracted using the spark-sql interface.
[flexter_banner]
Converting TVAnytime XML to Hive tables
We start by creating a database in the spark-warehouse by firing up the spark-sql terminal.
1 2 3 |
# firing the spark-sql terminal from the home directory $ spark-sql spark-sql> |
Next, we will create a target database
1 2 3 4 5 6 7 8 |
# creating a target database ‘mydb’ spark-sql> DROP DATABASE IF EXISTS mydb CASCADE; CREATE DATABASE mydb LOCATION '/pathTo/mydb.db'; # Checking if the database was created spark-sql> SHOW DATABASES; default mydb Time taken: 0.056 seconds, Fetched 2 row(s) |
Once the target database has been created, we can use Flexter to extract the data from the RolesCS.XML file. We first run the command using the -s option and then again run Flexter without the -s option to generate the logical schema for the XML.
[flexter_button]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
# Simulating the execution with the skip -s option $ xml2er -s -g3 RolesCS.xml # Now running the code for real without the -s option. With the -g switch we specify the type of optimisation we want to apply, e.g. the degree of normalisation $ xml2er -g3 RolesCS.xml # input path: RoleCS.xml # schema origin: 34 logical: 13 job: 56 # statistics startup: 3306 ms parse: 690 ms stats: 5422 ms map: 201 ms unique xpaths: 8 |
When the command is ready, we use the above generated logical schema to generate the Hive based output. This mode can be activated by parameter -V or –hive-create and the output location is optional.
1 |
$ xml2er -V … |
When the output location is not defined, the tables will be created as managed tables, following the database/schema default location. With a defined output location, the tables will be created as external tables.
1 |
$ xml2er -V -o <Output Directory> … |
The target schema might be provided, otherwise the “default” schema will be used implicitly.
Below are some useful options:
1 2 |
-V, --hive-create Enable creating hive tables -E, --hive-schema SCHEMA Creating hive tables into schema |
We can use the above parameters to extract the XML on the schema of our choice. By default we generate Parquet files, but we could also generate ORC files.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
# Extracting data into ‘mydb’ created above $ xml2er -l13 RolesCS.xml -V -E mydb.db … 13:48:49.373 INFO Finished successfully in 20057 milliseconds # schema origin: 30 logical: 11 job: 54 # statistics startup: 8431 ms load: 7489 ms parse: 128 ms write: 2946 ms stats: 1042 ms map: 3 ms unique xpaths: 8 |
In order to view the data we can again fire ‘spark-sql’ from the terminal and check the various tables generated
1 2 3 4 5 6 7 8 |
# using ‘mydb’ created above spark-sql> USE mydb; # Listing all the tables created by the Flexter mydb> SHOW TABLES; default classificationscheme false default name false default term false Time taken: 0.055 seconds, Fetched 3 row(s) |
We can use basic spark-sql commands to check on the descriptions of the various tables that were parsed by Flexter. Below we describe the table ‘term’ created as the output by Flexter
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
# Describing the tables created mydb> DESCRIBE TERM; PK_Term decimal(38,0) NULL FK_ClassificationScheme decimal(38,0) /ClassificationScheme/Term Definition string /ClassificationScheme/Term/Definition Definition_lang string /ClassificationScheme/Term/Definition/@lang termID string /ClassificationScheme/Term/@termID # Checking on te data by displaying first 3 rows mydb> SELECT * FROM TERM LIMIT 3; +---------+-------------------------+-----------------------------------------+ | PK_Term | FK_ClassificationScheme |Definition | Definition_lang | termID | +---------+-------------------------+-------------------------------------------| 54...1 |54...1 | A person who creates the content | en| AUTHOR | 54...2 |54...1 | A television reporter who coordinates a broadcast to which several correspondents contribute| en| ANCHOR | 54...3 |54...1 | A person who gathers news and other journalistic material and writes or broadcasts it-the basicjob in journalism | en| REPORTER +---------+-------------------------+------------------------------------------+ # Checking on the distinct termID (displaying first 10) mydb> SELECT DISTINCT termID FROM TERM LIMIT 5; SOUND-EFFECTS-PERSON WEBCASTER SCRIPTWRITER SET-DESIGNER STAFF ... |
Similarly, we can view the other tables generated – ‘name’ and ‘classificationscheme’.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# Describing the tables created mydb> DESCRIBE NAME; FK_Term decimal(38,0) /ClassificationScheme/Term/Name Name string /ClassificationScheme/Term/Name lang string /ClassificationScheme/Term/Name/@lang Time taken: 0.098 seconds, Fetched 3 row(s) # Checking on the data by displaying first 3 rows mydb> SELECT * FROM NAME LIMIT 3; +---------+-----------+------+ | FK_Term | Name | lang | +---------+-----------+------+ | 54...1 | Author | en | | 54...2 | Anchor | en | | 54...2 | Anchorman | en | +---------+-----------+------+ |
We can use the FK_Term in the ‘name’ table and the PK_Term in the ‘term’ table to join both the tables and classify the Name column in the ‘name’ table to classify in under the ‘termID’ column in the term table. For example, we can see what names come under the ‘REPORTER’ termID in our XML as depicted above.
1 2 3 4 5 6 7 8 9 10 |
# Joining the Name column with the termID column based on the Term Key mydb> SELECT name.Name FROM name, term WHERE name.FK_Term = term.PK_Term AND term.termID='REPORTER'; Reporter Newsman Newswoman Newsperson Time taken: 0.588 seconds, Fetched 4 row(s) |
[faq_button]