Flexter, Informatica, and Redshift work Hand in Hand to convert ESMA XML

Anvesh Gali ETL, Uncategorized, XML

In this walk-through, we combine two powerful software platforms to present a highly efficient and user-friendly method to perform ETL of complex XML files. This implementation uses Flexter, which is a powerful tool for converting complex XML files to a database or text and Informatica for ETL. We will convert ESMA XML files (these files contain the reporting specifications and transactions of ESMA for the European Union’s financial system) to TSV using Flexter, initiate ETL using Informatica and load the TSV files into Amazon Redshift using a JDBC connection

At a glance

An overview of the steps involved in this process is shown below.

About Informatica

Informatica offers an Enterprise Cloud Data Management platform that has made a global stand as a reliable cloud data integration service. It presents a robust platform for data services and can integrate with a variety of tools and other cloud services. Informatica can be used to perform the ETL process to various cloud data system in an easy and efficient manner. For this implementation, we use the Cloud Data Integration tool on Informatica cloud to perform the ETL. This data synchronization tool can be used to establish connections to data sources and other cloud platforms (Redshift etc.) using JDBC and can be configured for streaming of data at regular intervals.

Rationale behind combining Flexter and Informatica

As an ETL tool, Informatica can process XML files as part of the data load process. So, what prompts us to use Flexter? The answer lies in the complexity of the XML files, processing times, scalability, and the ease of use.

We analysed and contrasted the processing of XML files using Flexter and Informatica. We tried to process the ESMA XML data using Flexter and Informatica and here’s what we found!

Flexter scores considerably over Informatica owing to its ability to be distributed across multiple nodes for scalability and performance. In addition, Flexter can support modular schemas - multiple linked XSD files - auto-import of XSDs from http(s), ftp, sftp, hdfs and support for schemas in archives (zip, gz, tgz, 7z).

Comparison between Flexter and Informatica for XML processing

The data was processed through Flexter in a simple manner and we could download the TSV output almost instantaneously. It was free of charge and can be used multiple times.

When it came to Informatica for XML conversion, the first hurdle was the non-availability of a built in XML processing utility. An add-on utility needs to be purchased (you can get it for free for the trial period after which an hourly/annual charge applies). A connection to the XML files can be established using this utility and then processed.

In addition to Flexter’s web service utility – where users can upload XMLs and download the TSV files – an enterprise edition can be used wherein these steps can be performed programmatically by calling the Flexter API. The Flexter API provides users with the option to specify the data lineage – source to target map, XML schema visualisation, data partitioning and the error logs. You can seamlessly integrate Flexter with any ETL tool as outlined in these posts where we integrate Flexter with Oracle Data Integrator and IRI Voracity.

Another major drawback was Informatica’s inability to process multiple XML files in one go. While Flexter does this in a straightforward manner, Informatica requires additional configurations. This means that users need to create complex loops in Informatica’s GUI to process the files one by one. This has a huge performance impact as it does not process the XMLs in bulk. This can be a major hurdle when thousands of XMLs must be loaded into a target. It also requires extra ETL developer time as these data orchestration pipelines have to be engineered.

We have collated a view of the comparison and the features of these two tools below.

Flexter Informatica
Flexter if a dedicated tool for XML processing. It can process any XML industry standard data Informatica is a cloud data management service that can integrate different data sources for processing
Flexter provides a highly structured data output and can process multiple XML files simultaneously Informatica does not have a built in XML processing capability. A plug in must be purchased for this utility.
Flexter does not place any restriction on the data volume and the output can be processed quickly Informatica cannot directly process multiple XML files. This needs an indirect source loading and involves a lengthy setup process
No need for any development/training. XMLs can be processed using the free version of Flexter and data becomes available instantly Informatica charges users on a hourly/annual basis and the XML processing utility will be billed separately
Can be used as an independent software and is highly cost efficient. Flexter can process any industry standard XML.. There is no information regarding the processing of industry standard XML data using Informatica There is a limit on the data usage. Additional charges may apply for large files
Data can be easily downloaded and analysed using any tool available on the market XML processing can be completed and viewed after setting up the target connections and agents

Flexter Vs Informatica

In the following steps, we have described the processing of XML files and the ETL how to in detail.

XML data from ESMA

We have sourced the data from ESMA. ESMA is an independent EU Authority that contributes to safeguarding the stability of the European Union's financial system by enhancing the protection of investors and promoting stable and orderly financial markets. The data includes the reporting specifications and transactions. The specifications include the financial regulations laid down as part of different committees and the related market transactions.

Processing the XML file

We will be using the free version of Flexter to convert the XML data to Amazon Redshift.

Convert ESMA XML to Text Using Flexter

Flexter is a tool that can transform complex XMLs into readable files that can be used for analysis. The step by step instructions for transforming XML files using Flexter are shown below. We will process the ESMA XML using the free version of Flexter. This can also be performed using Flexter API programmatically. A quick comparison of these two methods of XML processing is given below

For this exercise, we have used the free version of Flexter. The steps involved in the XML processing using Flexter are given below.

Screen Clipping

  • Click on the ‘Terms and Conditions’ checkbox and click ‘Try Flexter for Free’

Screen Clipping

  • Upload the XML file that needs to be transformed and click ‘Continue’

Screen Clipping

  • On the next page, we can upload the XSD file (if available). For this implementation, the XSD file is not available and hence we will click on ‘Skip ‘and proceed to the next page.

Screen Clipping

  • Provide Email and click ‘Continue’
  • On the next page, click on ‘Finish’ and the output from Flexter will be delivered to the subscribed email address.

Screen Clipping

  • An email from Flexter with the link to the transformed data is obtained. The TSV files can be downloaded from this link which redirects to the Flexter page.

Screen Clipping

Screen Clipping

  • For the XML files from ESMA, a total of 28 TSV files were generated and downloaded.

Screen Clipping

In the next steps, we have detailed the process of setting up Informatica and performing the ETL.

Getting started with Informatica

Informatica offers users with a free trial. Follow the below steps to set up a free account for ETL processing.

Step 1. Go to Informatica Free Trial

Screen Clipping

Step2. Fill in the details and click ‘Start your free trial’

Screen Clipping

Step3. An email with the log in instructions will be sent to the registered address. Click on ‘Confirm account’

Screen Clipping

Step 4. Set up a password and security question Screen Clipping

Step5. The Informatica home page is displayed.

Screen Clipping

Configuring Redshift and JDBC connectivity

The next steps would be to setup the Redshift account with AWS and configure a JDBC connection.

Step1. Go to Amazon Redshift

Screen Clipping

Step2. Click on ‘Start Free Trial’

Screen Clipping

Step3. Enter AWS user name and password and click ‘Sign In’

Screen Clipping

Redshift account is ready for use. In the next steps, we will create a cluster in Redshift which can then be connected to Informatica.

Creating a Redshift Cluster

The most important component for a data warehouse in Amazon Redshift is the cluster. An Amazon Redshift data warehouse is a collection of computing and storage resources called nodes, which are organized into a group called a cluster. Each cluster runs an Amazon Redshift engine and contains one or more databases. The following steps detail the process of creating a Redshift cluster.

Step 1. Login to the Redshift console on AWS and click on Clusters

Screen Clipping

Step 2. Click on ‘Launch Cluster’

Screen Clipping

Step 3. The cluster details page is displayed as shown below

Screen Clipping

Step 4. Specify the cluster identifier as ‘flexter-esma’, database name as ‘esma’, the username and password and click ‘Continue’

Screen Clipping

Step 5. Select the node type as dc1.large and the cluster type as ‘Single Node’. The node type and the cluster types are selected based on the storage requirement and the compute speed. For this implementation, a single node would suffice.

Screen Clipping

Step 6. In the additional configuration page, the default values which are pre-populated are left unchanged. Ensure that the VPC security group is selected as ‘default’ and click ‘Continue’

Screen Clipping

Step 7. Review the details and click ‘Continue’

Screen Clipping

Step 8. The below message is displayed. Click on ‘Close’

Screen Clipping

Step 9. The cluster dashboard is displayed as shown.

Screen Clipping

We can see that the cluster flexter-esma has been created and is available. The next step would be the database schema setup in the Redshift cluster.

Working with Redshift

In the following steps, we detail the use of SQL Workbench/J to query in the Redshift cluster. SQL Workbench/J is a free, DBMS-independent, cross-platform SQL query tool. Its main focus is on running SQL scripts (either interactively or as a batch) and export/import features.

Step 1. Go to SQL Workbench/J downloads

Screen Clipping

Step 2. Click on ‘Download generic package for all systems’. A ZIP file is downloaded. Extract the contents.

Screen Clipping

Step 3. Launch the SQLWorkbench application

Screen Clipping

Step 4. SQL Workbench/J needs to be connected to Redshift using a JDBC driver. Download the Amazon Redshift JDBC driver. To add this connection in SQL workbench/J, go to File -> Connect Window

Screen Clipping

Step 5. Click on Manage Drivers

Screen Clipping

Step 6. Click ‘Create a new entry’ icon

Screen Clipping

Step 7. Specify the Name as ‘Amazon Redshift’ and library as the location of the downloaded driver (.jar file) and click ‘Ok’

Screen Clipping

Step 8. In the ‘Select connection profile’ window, specify the name as flexter_esma, the driver as ‘Amazon_Redshift’. For the URL, we need to go to the cluster dashboard on Redhsift, click on cluster flexter_esma and select the JDBC URL.

Screen Clipping

Screen Clipping

Step 9. The security group in the Redshift cluster needs to be configured to allow inbound connections. Click on cluster flexter-esma and select the VPC security group

Screen Clipping

Step 10. Select Inbound. Click ‘Edit’ and specify the type as ‘Redshift’, port as ‘5439’and Source as 0.0.0.0/0 and click ‘Save’

Screen Clipping

Step 11. The security group is configured as shown below

Screen Clipping

Step 9. On the SQL Workbench/J console, specify all details including the JDBC URL, username and password (as specified for the database) and click ‘Test’’

Screen Clipping

Step 10. Message displaying the successful connectivity should be displayed

Screen Clipping

Step 11. In the SQL Workbench/J page, the cluster schema is displayed

Screen Clipping

Step 12. For this exercise, we create a table in the Redshift cluster using SQL workbench/J

Screen Clipping

In a similar manner, the database can be created as per the schema. The next stage would be the ETL from Informatica.

ETL into Redshift using Informatica

In the next steps, we will be creating a data synchronization task to complete the ETL process.

Step 1. Login to the Informatica console and go to ‘Overview tab’

Screen Clipping

Step 2. Click on ‘Download the Secure Agent’ and install it. The secure agent should be up and running.

Screen Clipping

Step 3. Go to Task Wizards on the console and select ‘Data Synchronization’ and click ‘New’

Screen Clipping

Step 4. Specify the task name as ‘esma_flexter’ and the operation as ‘Upsert’. Click ‘Next’

Screen Clipping

Step 5. Under Source, click ‘New’ in Connection

Screen Clipping

Step 6. Specify connection name as Flexter_output_esma and type as ‘Flat file’. After selecting the type, new fields for Runtime environment and Directory are displayed. Select the directory, code page as UTF-8 (for TSV files) and Click ‘Test’

Screen Clipping

Step 7. The connection should be successful

Screen Clipping

Step 8. In the source window, select formatting options

Screen Clipping

Step 9. Select delimiter as Tab and click ‘Ok’. Select the source object as necessary (the source object is the TSV file that is to be loaded)

Screen Clipping

Step 10. In the Target page, click on New under Connection

Screen Clipping

Screen Clipping

Step 11. Enter the connection name as flexert_to_redshift. Specify the AWS user name, passwords and the JDBC URL that can be copied from the cluster dashboard (Redshift)

Screen Clipping

Step 12. The connection should be successful

Screen Clipping

Step 13. In the target page select the object as the table (created using SQL Workbench/J)

Screen Clipping

Step 14. Select the data filters. In this exercise, we do not specify any data filters

Screen Clipping

Step 15. On the next page verify if the field mapping is accurate

Screen Clipping

Step 16. On the schedule page specify the S3 bucket value as flexteroutput and select Save and Run

Screen Clipping

Step 17. Activity monitor showing the task in progress is displayed

Screen Clipping

The ETL from Flexter to Redshift through Informatica is complete. The free version of Flexter is an easy and efficient tool for parsing the ESMA XML files in a seamless manner. The output from Flexter can be integrated across various cloud platforms using Informatica, which makes it a highly useful combination for data synchronization and analysis.

Which data formats apart from XML also give you the heebie jeebies and need to be liberated? Please leave a comment below or reach out to us.