XML,

Convert IRS XML to PostgreSQL using Flexter Enterprise

Published on August 28, 2017
Updated on November 20, 2024

Loading IRS Data into PostgreSQL database using Flexter Enterprise

In this walkthrough, we load the IRS XML files into PostgreSQL database using Flexter Enterprise edition. Flexter Enterprise provisions users with all features. The enterprise version provides users with numerous additional features which aren’t available on the free version of Flexter.
In the following steps, we discuss the process of setting up Flexter Enterprise and running commands for XML parsing into PostgreSQL

About IRS

The Internal Revenue Service (IRS) is the revenue service of the United States federal government. The government agency is a bureau of the Department of the Treasury, and is under the immediate direction of the Commissioner of Internal Revenue, who is appointed for a five-year term by the President of the United States. The IRS is responsible for collecting taxes and administering the Internal Revenue Code, the federal statutory tax law of the U.S. Its duty to maximize tax revenue entails providing tax assistance to taxpayers, as well as pursuing and resolving instances of erroneous or fraudulent tax filings.
The data that has been obtained for this walkthrough contains the FATCA payload data and the metadata specifications. The FATCA XML data is based on business requirements collected by a multilateral working group. The schema uses elements from existing reporting schemas used by the OECD and the European Union (EU) to reduce burden on reporting entities.

How Flexter works

A detailed description of how Flexter works is provided here.
The Flexter platform consists of three pluggable modules:
Schema Analyser (xsd2er)
Mapping generator (CalcMap)
Xml Processor (xml2er)

https://sonra.io/wp-content/uploads/2017/08/faq_1.png
Step 1: The Schema Analyser is a dedicated module that loads, parses out, processes and stores the XML schema information in Flexter’s internal metadata DB. This step is only required to be performed once for each schema to be processed. You can either supply an XSD or a representative sample of XML files for this step.
Step 2: Now that we know the exact layout of the source XML it is possible to generate the relational equivalent. Flexter’s module, Mapping Generator generates the output schema layout and the mapping to it. Various optimisations of the target schema can be applied during this step.
Step 3: The XML Processor module takes the information generated from the two previous steps, processes the XML, and writes the data to the relational target schema.

https://sonra.io/wp-content/uploads/2017/08/faq_2.png

Pre-Requisite:

Java run time environment (Java 8 preferably) should be up and running
PostgreSQL should be installed and configured
If you have any more questions please refer to our FAQ.

Loading IRS data into PostgreSQL database

Generating the metadata and logical schema

In the following steps, we describe the process of generating the logical schema for this exercise.
Step1. After installing the xsd2er package, go to command prompt and enter xsd2er. The xsd2er command processes the schema and stores the information in the internal metadata DB. When this command is executed, the input and metadata processing options are displayed on the screen. Based upon the requirements, the command can be configured.
Screen Clipping
Step2. Specify the command to analyse the XSD file. In this step, we process the XSD file to obtain the metadata information and store it in Flexter’s DB
xsd2er -g3 /home/anveshg/irsnew.zip
Screen Clipping
The logical schema number can be seen from the output. Note down this number which is to be used in xml2er
Screen Clipping
Logical schema number: 658

Loading IRS XML data into PostgreSQL

In the following steps, we describe the loading of XML data into PostgreSQL database
Step 1. After installing the xml2er package, go to command prompt

Step 2. Enter xml2er. The output is shown below

This command displays the various options that are provisioned in Flexter enterprise. The command for processing the XML files can be modified based upon these options. There are various switches provisioned in Flexter which can be used to configure the output data. Based upon the output requirement, the switch can be selected. For example, if the output is supposed to be a TSV file, then select the format as -f tsv

CommandDescription
-o,Output location path
-u,Output user
-p,Output password
-f,Format of output. (jdbc, parquet, json, csv, tsv)
-z,Parquet, csv, tsv compression mode (uncompressed, snappy, gzip, lzo, lz4, bzip2)
-S,Save Mode when table, directory or file exists ex: [e]rror, [a]ppend, [o]verwrite, [i]gnore default: append
-Y,Number of partitions for writing data
-e,Mode of parsing: [a]ll, [d]ata, [s]tats default: all
-b,Block Size. 1024kb, 64mb, 128mb…
-B,Batch size to write into databases default: 1000
-c,Show SQL commands
-s,Skip writing results

Step 3. The target output is the PostgreSQL database. A new database needs to be created in postgres to be specified as the target. We haven’t detailed the steps to create a database in postgres as part of this walk-through.
Step 4. In the next step, we run the xml2er command to process the XML data. On the command prompt, enter
xml2er -l656 -o jdbc:postgresql://host:port /irs username-u irs -p password /home/anveshg/irsnew.zip
The description of the attributes in the command is given below
L656 – logical schema number obtained from xsd2er
o – output path
jdbc:postgresql://host:5432/IRS – jdbc connector to postgres database
u – username
p – password
Step 5. Run the command

Step 6. After job completion

Step 7. The tables are created in PostgreSQL
The IRS XML files have been parsed into PostgreSQL database using Flexter enterprise.

Connecting to PostgreSQL and querying data

We can access postgreSQL from the command line and query.
Screen Clipping
A sample query executed in postgreSQL with the IRS XML data is shown below

Output:
Screen Clipping
We have processed the IRS XML data using Flexter enterprise and the output was loaded into PostgreSQL for querying.
Which data formats apart from XML also give you the heebie jeebies and need to be liberated? Please leave a comment below or reach out to us.

Maciek

Maciek

Maciek is a data wizard with a knack for turning messy semi-structured formats like XML, JSON, and XSD into readable data. With a brain wired for product and data architecture, Maciek makes the complex look effortless. A true architect of order in the chaos of data, Maciek is the magic ingredient to making sure your systems don’t just work—they shine. Maciek is the CTO of Sonra and product lead for Flexter and FlowHigh. He has been shaping the data industry since 2000 with a visionary approach and leadership.

LinkedIn