Convert IRS XML to PostgreSQL using Flexter Enterprise

Anvesh Gali XML

Loading IRS Data into PostgreSQL database using Flexter Enterprise

In this walkthrough, we load the IRS XML files into PostgreSQL database using Flexter Enterprise edition. Flexter Enterprise provisions users with all features. The enterprise version provides users with numerous additional features which aren’t available on the free version of Flexter.

Features
Free Online
Trial
Enterprise
Version
Max daily data limit
50MB
Unlimited
Scalability
Single instance
Clustered
File output formats
TSV
TSV,PARQUET,AVRO,ORC
RDBMS support
Oracle, MS SQL Server, PostgreSQL
Location
Online
On Premise
Data Lineage
Scheduled execution
Support
Optimisations
Elevate, Re-use, Naming
Elevate, Re-use, Naming
Visualisations
Browse Schemas
In-memory processing
Yes

 

In the following steps, we discuss the process of setting up Flexter Enterprise and running commands for XML parsing into PostgreSQL

About IRS

The Internal Revenue Service (IRS) is the revenue service of the United States federal government. The government agency is a bureau of the Department of the Treasury, and is under the immediate direction of the Commissioner of Internal Revenue, who is appointed for a five-year term by the President of the United States. The IRS is responsible for collecting taxes and administering the Internal Revenue Code, the federal statutory tax law of the U.S. Its duty to maximize tax revenue entails providing tax assistance to taxpayers, as well as pursuing and resolving instances of erroneous or fraudulent tax filings.

The data that has been obtained for this walkthrough contains the FATCA payload data and the metadata specifications. The FATCA XML data is based on business requirements collected by a multilateral working group. The schema uses elements from existing reporting schemas used by the OECD and the European Union (EU) to reduce burden on reporting entities.

How Flexter works

A detailed description of how Flexter works is provided here.

The Flexter platform consists of three pluggable modules:

Schema Analyser (xsd2er)
Mapping generator (CalcMap)
Xml Processor (xml2er)

https://sonra.io/wp-content/uploads/2017/08/faq_1.png

Step 1: The Schema Analyser is a dedicated module that loads, parses out, processes and stores the XML schema information in Flexter's internal metadata DB. This step is only required to be performed once for each schema to be processed. You can either supply an XSD or a representative sample of XML files for this step.

Step 2: Now that we know the exact layout of the source XML it is possible to generate the relational equivalent. Flexter's module, Mapping Generator generates the output schema layout and the mapping to it. Various optimisations of the target schema can be applied during this step.

Step 3: The XML Processor module takes the information generated from the two previous steps, processes the XML, and writes the data to the relational target schema.


https://sonra.io/wp-content/uploads/2017/08/faq_2.png

Pre-Requisite:

Java run time environment (Java 8 preferably) should be up and running

PostgreSQL should be installed and configured

If you have any more questions please refer to our FAQ.

Loading IRS data into PostgreSQL database

Generating the metadata and logical schema

In the following steps, we describe the process of generating the logical schema for this exercise.

Step1. After installing the xsd2er package, go to command prompt and enter xsd2er. The xsd2er command processes the schema and stores the information in the internal metadata DB. When this command is executed, the input and metadata processing options are displayed on the screen. Based upon the requirements, the command can be configured.

Screen Clipping

Step2. Specify the command to analyse the XSD file. In this step, we process the XSD file to obtain the metadata information and store it in Flexter’s DB

xsd2er -g3 /home/anveshg/irsnew.zip

Screen Clipping

The logical schema number can be seen from the output. Note down this number which is to be used in xml2er

Screen Clipping

Logical schema number: 658

Loading IRS XML data into PostgreSQL

In the following steps, we describe the loading of XML data into PostgreSQL database

Step 1. After installing the xml2er package, go to command prompt

Step 2. Enter xml2er. The output is shown below

This command displays the various options that are provisioned in Flexter enterprise. The command for processing the XML files can be modified based upon these options. There are various switches provisioned in Flexter which can be used to configure the output data. Based upon the output requirement, the switch can be selected. For example, if the output is supposed to be a TSV file, then select the format as -f tsv

Command Description
-o, Output location path
-u, Output user
-p, Output password
-f, Format of output. (jdbc, parquet, json, csv, tsv)
-z, Parquet, csv, tsv compression mode (uncompressed, snappy, gzip, lzo, lz4, bzip2)
-S, Save Mode when table, directory or file exists ex: [e]rror, [a]ppend, [o]verwrite, [i]gnore default: append
-Y, Number of partitions for writing data
-e, Mode of parsing: [a]ll, [d]ata, [s]tats default: all
-b, Block Size. 1024kb, 64mb, 128mb...
-B, Batch size to write into databases default: 1000
-c, Show SQL commands
-s, Skip writing results

Step 3. The target output is the PostgreSQL database. A new database needs to be created in postgres to be specified as the target. We haven’t detailed the steps to create a database in postgres as part of this walk-through.

Step 4. In the next step, we run the xml2er command to process the XML data. On the command prompt, enter

xml2er -l656 -o jdbc:postgresql://host:port /irs username-u irs -p password /home/anveshg/irsnew.zip

The description of the attributes in the command is given below

L656 – logical schema number obtained from xsd2er

o – output path

jdbc:postgresql://host:5432/IRS – jdbc connector to postgres database

u – username

p – password

Step 5. Run the command

Step 6. After job completion

Step 7. The tables are created in PostgreSQL

The IRS XML files have been parsed into PostgreSQL database using Flexter enterprise.

Connecting to PostgreSQL and querying data

We can access postgreSQL from the command line and query.

Screen Clipping

A sample query executed in postgreSQL with the IRS XML data is shown below

Output:

Screen Clipping

We have processed the IRS XML data using Flexter enterprise and the output was loaded into PostgreSQL for querying.

Which data formats apart from XML also give you the heebie jeebies and need to be liberated? Please leave a comment below or reach out to us.