In this example we will use Flexter to convert an XML file to the Apache Avro format. We then query and analyse the output in the Spark-Shell.
Flexter can generate a target schema from an XML file or a combination of XML and XML schema (XSD) files. We will use the data from The ACORD RLC Insurance and Reinsurance Service Business Message Specification which is the result of a cooperative industry effort of RLC members including ins/reinsurers, ins/reinsurance intermediaries, clients/cedents and technology service providers.
We will be parsing the RegulatoryReporting and the Placing Message template files using Flexter and will be using the ACORD RLC XML Schema which is complied in conjunction with the industry standards facilitating fast, accurate data exchange. The XML and XSD files can be downloaded from the link here. The XML files used are

  • RegulatoryReporting-template2015-04.xml
  • Placing-template2015-04.xml

And the XSD file used is

  • Jv-Ins-Reinsurance-2015-04.xsd

Both the XML files and the XSD are available and we use the information from both files to generate the target schema.

Now that we have gathered statistics from our XML sample we can generate the logical target schema with the xsd2er command line tool using the -k switch (-k is shortcut for –use-stats)

Let’s go through the steps

Happy days. Now we use the Logical Schema ID  (origin: 60 and 61) to convert the XML data to avro.
Next we need to add the avro package to use it with xml2er. We can submit sprak-submit parameters in the following format.

Here, we can use –packages and –conf, spark-submit parameter to ensure that avro library and its dependencies are added to successfully parse and extract the XML files in the desired format.

When the command is ready, removing –skip or -s, allows us to process the data. We direct the parquet output to the output directory for the data.xml file. We can use the $PWD variable in the shell to generate the output files in the current working directory. Also, we use  -f for the output format(like, jdbc, parquet, orc, json, csv, tsv), the default format is parquet and we want the output in avro format in our case.

We can find the extracted avro files in the output folder. It is a directory structure, which you can find in the current directory. We can ‘ls’ to see the contents of the .avro folder as shown below.

The complete list of avro files parsed using Flexter can be downloaded here for Placing-template and here for RegulatoryReporting-template.

Once we have initiated the spark-shell, we can proceed with reading the avro files generated and import them as dataframes in spark.

We can take a look at the schema of the data frames generated and do some preliminary analysis before proceeding further on the data parsed. For example, let’s look at the “Tax Provisions table” created above.

We can see the headers and data types of the various columns. We can also perform some basic analysis on the dataset in Scala and look at the various variables present

Various basic data processing tasks can be performed on the dataframe generated. The sql function on a SparkSession enables applications to run SQL queries programmatically and returns the result as a DataFrame.

We can also use the ER diagram for the XML files parsed and do analysis on the basis of the requirement as per the diagram in the spark-shell once we have all the data frames ready for all the avro files. The ER mapping for the RegulatoryReporting-template2015-04.xml  and the Placing-template2015-04.xml files can be found here.