AWS Athena does not have native support for XML. Processing XML files on AWS Athena is slow, manual, and error prone. One option is to use AWS Glue and convert your XML to CSV, TSV or Parquet.
You would need to have technical knowledge to use AWS Glue and to do a lot of manual coding. For AWS Glue you would need to have knowledge of Spark, since all tasks are running in Apache Spark. Also you would need to have a knowledge of Python for ETL coding.
Another option is to use Flexter to convert XML to Athena. It automatically converts XML to Athena, without any manual coding. With Flexter you can convert any XML to Athena in seconds. Flexter fully supports the XSD spec. It can convert any XML format to a database, text, or big data formats such as Parquet, ORC, Avro.
For this post we will be converting the Common Reporting Standard (CRS) XML file to Athena using Flexter.
Common Reporting Standard (CRS)
The CRS XML Schema is designed to be used for the automatic exchange of financial account information between Competent Authorities (“CAs”). In addition the CRS could also be used for domestic reporting by Financial Institutions (“FIs”) to domestic tax authorities under the CRS.
The CRS schema is re-using the FATCA schema and elements of Standard Transmission Format “STF”, so there are some elements in the CRS schema that are not required for purposes of reporting and exchange under the CRS (e.g. Pool Report and Nationality). These elements are shown in the User Guide as optional, followed by “non-CRS”
You can download a sample CRS file from here.
Converting CRS XML to Athena with Flexter
Converting CRS XML to Athena is a fully automated process, without manual coding. You can do it in a couple of simple steps.
- Create a Data Flow. When you create Data Flow, Flexter creates a logical target schema and the mappings between the XML elements in the Source XML and the table columns in the relational target schema.
- Process and convert XML File.
The figure below summarises the two steps
We will now show you these steps in action.
Data Flow creation
First we generate a Data Flow using an XSD and/or a sample of representative XML files. A Flexter Data Flow is a logical target schema that maps the elements from the XML documents to table columns. Both are stored in Flexter’s metadata catalog.
Flexter can generate a target schema from an XSD, a sample of XML files or a combination of the two.
We issue the following command to Flexter with the location of our XSD and XML files
xsd2er -g3 -a CRS\ XSD.zip CRS\ XML.zip
Now that Data Flow has been created, we can convert CRS XML documents to a relational format.
We pass in the schema_id from the Data Flow we created in the previous step, the connection details to Athena, and the path to the XML documents we want to convert
xml2er -x <origin> -S -o -V -E athena-schema s3a://xml-bucket/
Query the data
Now that our XML file is converted, we can check the ER Diagram and learn more about the target schema.
In the last step check our data we can query it with AWS Athena. We converted the data to TSV in this instance. You can also use Flexter to convert your data to Parquet.
XML optimisation algorithms
Flexter can automate the conversion of XML to Athena. But wait, there is more. Flexter also ships with various AI optimisation algorithms
The purpose of the optimisation algorithms is to simplify the output of Flexter and make it as easy as possible to work with for downstream consumers. Please refer to our blog post The Secrets to Unlocking XML Data in an Optimised Format for Downstream Users
You can find out more about Flexter by visiting the Flexter product page or the Flexter data sheet.
You can download our XML conversion checklist The 6 Factors You Need to Get Right to Make Your XML Conversion Project a Success
Last but not least we would like to find out more about your use case. Contact us and tell us about your challenges of working with XML.