Loading IRS Data into PostgreSQL database using Flexter Enterprise
In this walkthrough, we load the IRS XML files into PostgreSQL database using Flexter Enterprise edition. Flexter Enterprise provisions users with all features. The enterprise version provides users with numerous additional features which aren’t available on the free version of Flexter.
In the following steps, we discuss the process of setting up Flexter Enterprise and running commands for XML parsing into PostgreSQL
The Internal Revenue Service (IRS) is the revenue service of the United States federal government. The government agency is a bureau of the Department of the Treasury, and is under the immediate direction of the Commissioner of Internal Revenue, who is appointed for a five-year term by the President of the United States. The IRS is responsible for collecting taxes and administering the Internal Revenue Code, the federal statutory tax law of the U.S. Its duty to maximize tax revenue entails providing tax assistance to taxpayers, as well as pursuing and resolving instances of erroneous or fraudulent tax filings.
The data that has been obtained for this walkthrough contains the FATCA payload data and the metadata specifications. The FATCA XML data is based on business requirements collected by a multilateral working group. The schema uses elements from existing reporting schemas used by the OECD and the European Union (EU) to reduce burden on reporting entities.
How Flexter works
A detailed description of how Flexter works is provided here.
The Flexter platform consists of three pluggable modules:
Schema Analyser (xsd2er)
Mapping generator (CalcMap)
Xml Processor (xml2er)
Step 1: The Schema Analyser is a dedicated module that loads, parses out, processes and stores the XML schema information in Flexter’s internal metadata DB. This step is only required to be performed once for each schema to be processed. You can either supply an XSD or a representative sample of XML files for this step.
Step 2: Now that we know the exact layout of the source XML it is possible to generate the relational equivalent. Flexter’s module, Mapping Generator generates the output schema layout and the mapping to it. Various optimisations of the target schema can be applied during this step.
Step 3: The XML Processor module takes the information generated from the two previous steps, processes the XML, and writes the data to the relational target schema.
Java run time environment (Java 8 preferably) should be up and running
PostgreSQL should be installed and configured
If you have any more questions please refer to our FAQ.
Loading IRS data into PostgreSQL database
Generating the metadata and logical schema
In the following steps, we describe the process of generating the logical schema for this exercise.
Step1. After installing the xsd2er package, go to command prompt and enter xsd2er. The xsd2er command processes the schema and stores the information in the internal metadata DB. When this command is executed, the input and metadata processing options are displayed on the screen. Based upon the requirements, the command can be configured.
Step2. Specify the command to analyse the XSD file. In this step, we process the XSD file to obtain the metadata information and store it in Flexter’s DB
xsd2er -g3 /home/anveshg/irsnew.zip
The logical schema number can be seen from the output. Note down this number which is to be used in xml2er
Logical schema number: 658
Loading IRS XML data into PostgreSQL
In the following steps, we describe the loading of XML data into PostgreSQL database
Step 1. After installing the xml2er package, go to command prompt
Step 2. Enter xml2er. The output is shown below
This command displays the various options that are provisioned in Flexter enterprise. The command for processing the XML files can be modified based upon these options. There are various switches provisioned in Flexter which can be used to configure the output data. Based upon the output requirement, the switch can be selected. For example, if the output is supposed to be a TSV file, then select the format as -f tsv
|-o,||Output location path|
|-f,||Format of output. (jdbc, parquet, json, csv, tsv)|
|-z,||Parquet, csv, tsv compression mode (uncompressed, snappy, gzip, lzo, lz4, bzip2)|
|-S,||Save Mode when table, directory or file exists ex: [e]rror, [a]ppend, [o]verwrite, [i]gnore default: append|
|-Y,||Number of partitions for writing data|
|-e,||Mode of parsing: [a]ll, [d]ata, [s]tats default: all|
|-b,||Block Size. 1024kb, 64mb, 128mb…|
|-B,||Batch size to write into databases default: 1000|
|-c,||Show SQL commands|
|-s,||Skip writing results|
Step 3. The target output is the PostgreSQL database. A new database needs to be created in postgres to be specified as the target. We haven’t detailed the steps to create a database in postgres as part of this walk-through.
Step 4. In the next step, we run the xml2er command to process the XML data. On the command prompt, enter
xml2er -l656 -o jdbc:postgresql://host:port /irs username-u irs -p password /home/anveshg/irsnew.zip
The description of the attributes in the command is given below
L656 – logical schema number obtained from xsd2er
o – output path
jdbc:postgresql://host:5432/IRS – jdbc connector to postgres database
u – username
p – password
Step 5. Run the command
Step 6. After job completion
Step 7. The tables are created in PostgreSQL
The IRS XML files have been parsed into PostgreSQL database using Flexter enterprise.
Connecting to PostgreSQL and querying data
We can access postgreSQL from the command line and query.
A sample query executed in postgreSQL with the IRS XML data is shown below
select count(*) CountryNm,CountryNm from financialinstitution2 group by CountryNm;
We have processed the IRS XML data using Flexter enterprise and the output was loaded into PostgreSQL for querying.
Which data formats apart from XML also give you the heebie jeebies and need to be liberated? Please leave a comment below or reach out to us.