Converting XML to Hive

Published on January 27, 2018
Updated on May 19, 2021

In this example we will use the Flexter XML converter to generate a Hive schema and parse an XML file into a Hive database. We will then use the spark-sql interface to query the generated tables.

TVAnytime XML standard

For the example we will use TVAnytime XML standard. You can download sample XML files and an XSD for this standard from the TVAnytime website.
We will convert the RolesCS.xml file which contains the classification and definition of various roles. For example, the roles that a “Reporter” can play is given in the xml below.
<Term termID=“REPORTER“>
<Name xml:lang=“en“>Reporter</Name>
<Name xml:lang=“en“>Newsman</Name>
<Name xml:lang=“en“>Newswoman</Name>
<Name xml:lang=“en“>Newsperson</Name>
<Definition xml:lang=“en“>A person who gathers news and other journalistic material and writes or broadcasts it-the basic job in journalism</Definition>
</Term>
We will parse the above xml file using Flexter and then generate the hive schema from it, and finally we will view the data extracted using the spark-sql interface.
[flexter_banner]

Converting TVAnytime XML to Hive tables

We start by creating a database in the spark-warehouse by firing up the spark-sql terminal.

# firing the spark-sql terminal from the home directory
$ spark-sql
spark-sql>

# firing the spark-sql terminal from the home directory

$ spark-sql

spark-sql>

Next, we will create a target database

# creating a target database ‘mydb’
spark-sql> DROP DATABASE IF EXISTS mydb CASCADE;
           CREATE DATABASE mydb LOCATION '/pathTo/mydb.db';
# Checking if the database was created
spark-sql> SHOW DATABASES;
default
mydb
Time taken: 0.056 seconds, Fetched 2 row(s)

# creating a target database ‘mydb’

spark-sql> DROP DATABASE IF EXISTS mydb CASCADE;

CREATE DATABASE mydb LOCATION '/pathTo/mydb.db';

# Checking if the database was created

spark-sql> SHOW DATABASES;

default

mydb

Time taken: 0.056 seconds, Fetched 2 row(s)

Once the target database has been created, we can use Flexter to extract the data from the RolesCS.XML file. We first run the command using the -s option and then again run Flexter without the -s option to generate the logical schema for the XML.
[flexter_button]

# Simulating the execution with the skip -s option
$ xml2er -s -g3 RolesCS.xml
# Now running the code for real without the -s option. With the -g switch we specify the type of optimisation we want to apply, e.g. the degree of normalisation
$ xml2er -g3 RolesCS.xml
# input
           path:  RoleCS.xml
# schema
         origin:  34
        logical:  13
            job:  56
# statistics
        startup:  3306 ms
          parse:  690 ms
          stats:  5422 ms
            map:  201 ms
  unique xpaths:  8

# Simulating the execution with the skip -s option

$ xml2er -s -g3 RolesCS.xml

# Now running the code for real without the -s option. With the -g switch we specify the type of optimisation we want to apply, e.g. the degree of normalisation

$ xml2er -g3 RolesCS.xml

# input

path: RoleCS.xml

# schema

origin: 34

logical: 13

job: 56

# statistics

startup: 3306 ms

parse: 690 ms

stats: 5422 ms

map: 201 ms

unique xpaths: 8

When the command is ready, we use the above generated logical schema to generate the Hive based output. This mode can be activated by parameter -V or –hive-create and the output location is optional.

$ xml2er -V …

1	$ xml2er -V …

When the output location is not defined, the tables will be created as managed tables, following the database/schema default location. With a defined output location, the tables will be created as external tables.

$ xml2er -V -o <Output Directory> …

1	$ xml2er -V -o <Output Directory> …

The target schema might be provided, otherwise the “default” schema will be used implicitly.
Below are some useful options:

-V, --hive-create               Enable creating hive tables
-E, --hive-schema SCHEMA        Creating hive tables into schema

1 2	-V, --hive-create Enable creating hive tables -E, --hive-schema SCHEMA Creating hive tables into schema

We can use the above parameters to extract the XML on the schema of our choice. By default we generate Parquet files, but we could also generate ORC files.

# Extracting data into ‘mydb’ created above
$ xml2er -l13 RolesCS.xml -V -E mydb.db
…
13:48:49.373 INFO  Finished successfully in 20057 milliseconds
# schema
         origin:  30
        logical:  11
            job:  54
# statistics
        startup:  8431 ms
           load:  7489 ms
          parse:  128 ms
          write:  2946 ms
          stats:  1042 ms
            map:  3 ms
  unique xpaths:  8

# Extracting data into ‘mydb’ created above

$ xml2er -l13 RolesCS.xml -V -E mydb.db

…

13:48:49.373 INFO Finished successfully in 20057 milliseconds

# schema

origin: 30

logical: 11

job: 54

# statistics

startup: 8431 ms

load: 7489 ms

parse: 128 ms

write: 2946 ms

stats: 1042 ms

map: 3 ms

unique xpaths: 8

In order to view the data we can again fire ‘spark-sql’ from the terminal and check the various tables generated

# using ‘mydb’ created above
spark-sql> USE mydb;
# Listing all the tables created by the Flexter
mydb> SHOW TABLES;
default	classificationscheme	false
default	name	false
default	term	false
Time taken: 0.055 seconds, Fetched 3 row(s)

# using ‘mydb’ created above

spark-sql> USE mydb;

# Listing all the tables created by the Flexter

mydb> SHOW TABLES;

default classificationscheme false

default name false

default term false

Time taken: 0.055 seconds, Fetched 3 row(s)

We can use basic spark-sql commands to check on the descriptions of the various tables that were parsed by Flexter. Below we describe the table ‘term’ created as the output by Flexter

# Describing the tables created
mydb> DESCRIBE TERM;
PK_Term	decimal(38,0)	NULL
FK_ClassificationScheme	decimal(38,0)	/ClassificationScheme/Term
Definition	string	/ClassificationScheme/Term/Definition
Definition_lang	string	/ClassificationScheme/Term/Definition/@lang
termID	 string	/ClassificationScheme/Term/@termID
# Checking on te data by displaying first 3 rows
mydb> SELECT * FROM TERM LIMIT 3;
+---------+-------------------------+-----------------------------------------+ | PK_Term | FK_ClassificationScheme |Definition | Definition_lang |  termID  |
+---------+-------------------------+-------------------------------------------|  54...1 |54...1 | A person who creates the content | en| AUTHOR
|  54...2 |54...1 | A television reporter who coordinates a broadcast to which several correspondents contribute| en| ANCHOR
|  54...3 |54...1 | A person who gathers news and other journalistic  material and writes or broadcasts it-the basicjob in journalism | en| REPORTER
+---------+-------------------------+------------------------------------------+
# Checking on the distinct termID (displaying first 10)
mydb> SELECT DISTINCT termID FROM TERM LIMIT 5;
SOUND-EFFECTS-PERSON
WEBCASTER
SCRIPTWRITER
SET-DESIGNER
STAFF
...

# Describing the tables created

mydb> DESCRIBE TERM;

PK_Term decimal(38,0) NULL

FK_ClassificationScheme decimal(38,0) /ClassificationScheme/Term

Definition string /ClassificationScheme/Term/Definition

Definition_lang string /ClassificationScheme/Term/Definition/@lang

termID string /ClassificationScheme/Term/@termID

# Checking on te data by displaying first 3 rows

mydb> SELECT * FROM TERM LIMIT 3;

+---------+-------------------------+-------------------------------------------| 54...1 |54...1 | A person who creates the content | en| AUTHOR

| 54...2 |54...1 | A television reporter who coordinates a broadcast to which several correspondents contribute| en| ANCHOR

| 54...3 |54...1 | A person who gathers news and other journalistic material and writes or broadcasts it-the basicjob in journalism | en| REPORTER

+---------+-------------------------+------------------------------------------+

# Checking on the distinct termID (displaying first 10)

mydb> SELECT DISTINCT termID FROM TERM LIMIT 5;

SOUND-EFFECTS-PERSON

WEBCASTER

SCRIPTWRITER

SET-DESIGNER

STAFF

...

Similarly, we can view the other tables generated – ‘name’ and ‘classificationscheme’.

# Describing the tables created
mydb> DESCRIBE NAME;
FK_Term	decimal(38,0)	/ClassificationScheme/Term/Name
Name	string	/ClassificationScheme/Term/Name
lang	string	/ClassificationScheme/Term/Name/@lang
Time taken: 0.098 seconds, Fetched 3 row(s)
# Checking on the data by displaying first 3 rows
mydb> SELECT * FROM NAME LIMIT 3;
+---------+-----------+------+
| FK_Term |    Name   | lang |
+---------+-----------+------+
|  54...1 | Author    | en   |
|  54...2 | Anchor    | en   |
|  54...2 | Anchorman | en   |
+---------+-----------+------+

# Describing the tables created

mydb> DESCRIBE NAME;

FK_Term decimal(38,0) /ClassificationScheme/Term/Name

Name string /ClassificationScheme/Term/Name

lang string /ClassificationScheme/Term/Name/@lang

Time taken: 0.098 seconds, Fetched 3 row(s)

# Checking on the data by displaying first 3 rows

mydb> SELECT * FROM NAME LIMIT 3;

+---------+-----------+------+

| FK_Term | Name | lang |

+---------+-----------+------+

| 54...1 | Author | en |

| 54...2 | Anchor | en |

| 54...2 | Anchorman | en |

+---------+-----------+------+

We can use the FK_Term in the ‘name’ table and the PK_Term in the ‘term’ table to join both the tables and classify the Name column in the ‘name’ table to classify in under the ‘termID’ column in the term table. For example, we can see what names come under the ‘REPORTER’ termID in our XML as depicted above.

# Joining the Name column with the termID column based on the Term Key
mydb> SELECT name.Name
      FROM name, term
      WHERE name.FK_Term = term.PK_Term
      AND term.termID='REPORTER';
Reporter
Newsman
Newswoman
Newsperson
Time taken: 0.588 seconds, Fetched 4 row(s)

# Joining the Name column with the termID column based on the Term Key

mydb> SELECT name.Name

FROM name, term

WHERE name.FK_Term = term.PK_Term

AND term.termID='REPORTER';

Reporter

Newsman

Newswoman

Newsperson

Time taken: 0.588 seconds, Fetched 4 row(s)

[faq_button]

Back to Blog

Cookie	Duration	Description
__cfruid	session	Cloudflare sets this cookie to identify trusted web traffic.
cookielawinfo-checkbox-marketing	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Marketing".
cookielawinfo-checkbox-necessary	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-preferences	1 month	This cookie is set by the GDPR Cookie Consent plugin to check if the user has given consent to use cookies under the "Preferences" category.
cookielawinfo-checkbox-statistics	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Statistics".
cookielawinfo-checkbox-unclassified	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Unclassified".
CookieLawInfoConsent	1 month	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
csrftoken	1 year	This cookie is associated with Django web development platform for python. Used to help protect the website against Cross-Site Request Forgery attacks

Cookie	Duration	Description
AnalyticsSyncHistory	1 month	Linkedin set this cookie to store information about the time a sync took place with the lms_analytics cookie.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	2 years	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
li_gc	2 years	Linkedin set this cookie for storing visitor's consent regarding using cookies for non-essential purposes.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
mgref	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
mgrefby	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.

Cookie	Duration	Description
G	1 year	Cookie used to facilitate the translation into the preferred language of the visitor.
SERVERID	session	This cookie is set by Slideshare's HAProxy load balancer to assign the visitor to a specific server.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_7H38LVR4Z5	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_44804396_1	1 minute	Set by Google to distinguish users.
_gat_UA-44804396-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
SIDCC	6 Months	The "SIDCC" cookie is used as security measure to protect users data from unauthorised access
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
AN	1 month
AS	session
ebEventToTrack	1 month
eblang	1 year
SNID	2 years	This cookie is set by the Google. This cookie is used by the map which helps visitors to identify and reach the facility.
SP	session
SS	session

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

Converting XML to Hive

TVAnytime XML standard

Converting TVAnytime XML to Hive tables

Converting XML to Hive

TVAnytime XML standard

Converting TVAnytime XML to Hive tables

Related Articles

XML Conversion Using Python in 2024

Best Way to Load & Convert XML Data to Oracle Tables

9 Critical Types of XML Tools for Developers

Cookies consent