XML,

Home » XML » Convert MISMO XML to Hive and Parquet

Convert MISMO XML to Hive and Parquet

by Maciek

Maciek is the Co-founder of Sonra. He has a knack for turning messy semi-structured formats like XML, JSON, and XSD into readable data. With a brain wired for product and data architecture, Maciek is the magic ingredient to making sure your systems don’t just work—they shine.

Published on October 17, 2017
Updated on January 1, 2025

In this walkthrough, we will convert the MISMO (The Mortgage Industry Standards Maintenance Organization) XML files to Parquet and query in Hive. The XML files are converted to Parquet using the enterprise version of Flexter. The enterprise version provides users with numerous additional features which aren’t available on the free version of Flexter (try for free).

In the following steps, we discuss the process of setting up Flexter Enterprise and running commands for XML parsing into Hive as Parquet tables.

About MISMO XML

MISMO is a subsidiary of the Mortgage Banker’s Association which aims at delivering technology standards for residential and commercial property transactions in the US markets. The core aim of MISMO is to create standards that have improved data consistency, transparency and have a cost-effective method of implementation. It has a comprehensive set of standards and data points for all stages of the mortgage loan life cycle like underwriting, mortgage insurance application and credit reporting amongst others. By constructing a standard approach using XML, MISMO enables two business related firms to streamline shared data as reports.
After downloading MISMO XML, the files are processed using Flexter as shown below.

How Flexter XML Converter Works

A detailed description of how Flexter works is provided here (FAQ).

The Flexter platform consists of three pluggable modules:
Schema Analyser (xsd2er)
Mapping generator (CalcMap)
Xml Processor (xml2er)

Step 1: The Schema Analyser is a dedicated module that loads, parses out, processes and stores the XML schema information in Flexter’s internal metadata DB. This step is only required to be performed once for each schema to be processed. You can either supply an XSD or a representative sample of XML files for this step.
Step 2: Now that we know the exact layout of the source XML it is possible to generate the relational equivalent. Flexter’s module, Mapping Generator generates the output schema layout and the mapping to it. Various optimisations of the target schema can be applied during this step.
Step 3: The XML Processor module takes the information generated from the two previous steps, processes the XML, and writes the data to the relational target schema.

[flexter_banner]

Convert MISMO XML to Parquet and Hive

Generating the metadata and logical schema

In the following steps, we describe the process of generating the logical schema for this exercise.
Step1. After installing the xsd2er package, go to command prompt and enter xsd2er. The xsd2er command processes the schema and stores the information in the internal metadata DB. When this command is executed, the input and metadata processing options are displayed on the screen. Based upon the requirements, the command can be configured.

Step2. Specify the command to analyse the XSD file. In this step, we process the XSD file to obtain the metadata information and store it in Flexter’s DB
xsd2er -g3 /home/centos/MISMO/MISMO_3_0.xsd

The logical schema number can be seen from the output. Note down this number which is to be used in xml2er

Logical schema number: 13
Extracting Parquet output from MISMO using Flexter Enterprise
In the following steps, we describe the loading of XML data into the Hive database
Step 1. After installing the xml2er package, go to command prompt

Step 2. Enter xml2er. The output is shown below

This command displays the various options that are provisioned in Flexter enterprise. The command for processing the XML files can be modified based upon these options. There are various switches provisioned in Flexter which can be used to configure the output data. Based upon the output requirement, the switch can be selected. For example, if the output is supposed to be a Parquet file, then select the format as -f parquet

Command	Description
-o,	Output location path
-u,	Output user
-p,	Output password
-f,	Format of output. (jdbc, parquet, json, csv, tsv)
-z,	Parquet, csv, tsv compression mode (uncompressed, snappy, gzip, lzo, lz4, bzip2)
-S,	Save Mode when table, directory or file exists ex: [e]rror, [a]ppend, [o]verwrite, [i]gnore default: append
-Y,	Number of partitions for writing data
-e,	Mode of parsing: [a]ll, [d]ata, [s]tats default: all
-b,	Block Size. 1024kb, 64mb, 128mb…
-B,	Batch size to write into databases default: 1000
-c,	Show SQL commands
-s,	Skip writing results

Step 3. The target output is the local system the path for which is to be specified in the command.
Step 4. In the next step, we run the xml2er command to process the XML data. On the command prompt, enter
xml2er -l13 -o /home/centos/MISMO /home/centos/MISMO/ Test_MESSAGE2.xml
The description of the attributes in the command is given below
L13 – logical schema number obtained from xsd2er
o – output path
/home/centos/MISMO – output path
/home/centos/MISMO/ Test_MESSAGE2.xml – input path
Step 5. Run the command

Step 6. After job completion

Step 7. The job is completed and the data is copied to the local server to be loaded into hive

Loading Parquet Files to Hive and Querying Data

The ER diagram generated by Flexter for the MISMO data is shown below

The tables in Hive are created as shown below
The DDL for the LOAN_IDENTIFIER table is:
create table loan_identifier (FK_Loan int, InvestorCommitmentIdentifier string, MERS_MINIdentifier string, SellerLoanIdentifier string) STORED AS PARQUET tblproperties (“parquet.compress”=”SNAPPY”);;
The data can be loaded into Hive table with the load command
Load data inpath ‘/user/centos/loan/LOAN.parquet’ into table loan_identifier;
A sample query is shown below

Any Questions? Visit the Flexter FAQ

About the author:

Maciek

Co-founder of Sonra

Follow Maciek:

Back to Blog

Cookie	Duration	Description
__cfruid	session	Cloudflare sets this cookie to identify trusted web traffic.
cookielawinfo-checkbox-marketing	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Marketing".
cookielawinfo-checkbox-necessary	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-preferences	1 month	This cookie is set by the GDPR Cookie Consent plugin to check if the user has given consent to use cookies under the "Preferences" category.
cookielawinfo-checkbox-statistics	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Statistics".
cookielawinfo-checkbox-unclassified	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Unclassified".
CookieLawInfoConsent	1 month	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
csrftoken	1 year	This cookie is associated with Django web development platform for python. Used to help protect the website against Cross-Site Request Forgery attacks

Cookie	Duration	Description
AnalyticsSyncHistory	1 month	Linkedin set this cookie to store information about the time a sync took place with the lms_analytics cookie.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	2 years	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
li_gc	2 years	Linkedin set this cookie for storing visitor's consent regarding using cookies for non-essential purposes.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
mgref	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
mgrefby	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.

Cookie	Duration	Description
G	1 year	Cookie used to facilitate the translation into the preferred language of the visitor.
SERVERID	session	This cookie is set by Slideshare's HAProxy load balancer to assign the visitor to a specific server.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_7H38LVR4Z5	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_44804396_1	1 minute	Set by Google to distinguish users.
_gat_UA-44804396-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
SIDCC	6 Months	The "SIDCC" cookie is used as security measure to protect users data from unauthorised access
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
AN	1 month
AS	session
ebEventToTrack	1 month
eblang	1 year
SNID	2 years	This cookie is set by the Google. This cookie is used by the map which helps visitors to identify and reach the facility.
SP	session
SS	session

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

Convert MISMO XML to Hive and Parquet

About MISMO XML

How Flexter XML Converter Works

Convert MISMO XML to Parquet and Hive

Generating the metadata and logical schema

Loading Parquet Files to Hive and Querying Data

Maciek

Convert MISMO XML to Hive and Parquet

About MISMO XML

How Flexter XML Converter Works

Convert MISMO XML to Parquet and Hive

Generating the metadata and logical schema

Loading Parquet Files to Hive and Querying Data

Related Articles

Cookies consent