XML, XML,

Converting SDMX XML to BigQuery

June 22, 2021

If our data is in Avro, JSON, Parquet, etc. then you can load it easily to BigQuery. Got XML for BigQuery? BigQuery does not provide any native support to deal with XML documents. You can manually code and load it with Apache Beam – we dive into that in this post. Or, if you’re looking for a smoother way, why not let XML conversion software do the trick? We get into the nitty-gritty details of doing just that with Flexter, turning SDXML into BigQuery tables and columns. There’s a free online version of Flexter you can take for a spin to parse your XML into a relational format.

Loading XML using Apache Beam pipeline

Step 1. Convert the XML file to Python Dictionary.

In the first step we convert the XML file into a Python dictionary using the ‘xmltodict’ package. This package provides a method to parse the XML structure and convert it to a Python dictionary.

Step 2: Specify the schema of the output table in BigQuery.

In this step, we just need to define the schema of the table where we want to load the data. If the particular element has child elements, then we should use ‘record’ (struct) datatype to store it in BigQuery.

Step 3: Run an Apache Beam job to read XML and load it to the BigQuery table.

The next step is to write the Beam pipeline to take an XML file and use it to populate a BigQuery table. This job can be run using Google Cloud Dataflow.

In this approach, we are basically converting XML to a JSON like structure and then loading it into BigQuery. We are using the ‘record’ datatype to store the nested elements. The maximum nested depth allowed for the ‘record’ type is 15 levels. So, we cannot use this approach to load multilevel nested data which contains more than 15 levels. Also, there is an overhead of writing code for the apache beam pipeline.

In short, if we want to use BigQuery as a platform for a use case where XML processing is required then the only option is converting it from XML to CSV or a JSON like format (Python Dictionary). Once we have converted the file, we can use BigQuery to load or use Dataflow to run Apache Beam jobs.

The overhead of writing the code to change the format will be significant and you will face the following challenges:

  • XML conversion projects tend to run over budget or fail completely
  • Data engineers don’t have the required niche skills to efficiently convert XML documents
  • XML conversion projects tend to run into performance issues

Data warehouse legend Ralph Kimball agrees. In his book The Data Warehouse ETL Toolkit, he mentions that the “… construction of an XML parser is a project in itself – not to be attempted by the data warehouse team”.

We are aligned with this quote based on our own painful experience. We thought there must be a better way. Hence we came up with Flexter. Flexter is a data warehouse automation solution for XML, JSON, and industry data standards. It converts XML and JSON to a data warehouse, a database, a data lake, or big data formats such as ORC, Parquet, Avro.

In this blog post we use Flexter to automatically convert XML from the SDMX industry data standard to BigQuery. Let’s briefly introduce SDMX, which has seen an increase in popularity over the last few years. The European Central Bank (ECB) uses SDMX to exchange statistical data and metadata with its partners in the European System of Central Banks (ESCB) and other organisations world-wide

SDMX XML

SDMX (Statistical Data and Metadata eXchange) is an international initiative that aims at standardising and modernising (“industrialising”) the mechanisms and processes for the exchange of statistical data and metadata among international organisations and their member countries.

SDMX is sponsored by seven international organisations including the Bank for International Settlements (BIS), the European Central Bank (ECB), Eurostat (Statistical Office of the European Union), the International Monetary Fund (IMF), the Organisation for Economic Cooperation and Development (OECD), the United Nations Statistical Division (UNSD), and the World Bank.

You can download a sample of SDMX-ML (XML) files and the corresponding schema files (XSD) from the SDMX website.

Converting SDMX XML to BigQuery with Flexter

Converting XML to BigQuery with Flexter is a fully automated process. It requires two simple steps.

  • Create a Flexter Data Flow. The whole process is fully automated and you don’t need to create manual mappings
  • Convert your XML documents

The figure below summarises the two steps

Data Flow creation

First we generate a Data Flow using an XSD and/or a sample of representative XML files. A Flexter Data Flow is a logical target schema that maps the elements from the XML documents to BigQuery table columns. Both are stored in Flexter’s metadata catalog.

Flexter can generate a target schema from an XSD, a sample of XML files or a combination of the two.

We issue the following command to Flexter with the location of our XSD and XML files

XML processing

The Data Flow has been created. We are now ready to convert our XML documents to a BigQuery relational format (tables and columns).

We pass in the schema_id from the Data Flow we created in the previous step, the connection details to BigQuery, and the path to the XML documents we want to convert

Query the data

Next we can familiarise ourselves with the target schema using the ER diagram that Flexter has generated.

In a final step we can query the data directly on BigQuery or apply some downstream processing to it.

XML optimisation algorithms

Flexter can automate the conversion of XML to BigQuery. But wait, there is more. Flexter also ships with various AI optimisation algorithms

The purpose of the optimisation algorithms is to simplify the output of Flexter and make it as easy as possible to work with for downstream consumers. Please refer to our blog post The Secrets to Unlocking XML Data in an Optimised Format for Downstream Users

What’s next

You can find out more about Flexter by visiting the Flexter product page or the Flexter data sheet.

Why not give the free online version of our XML converter Flexter a try. See for yourself what XML conversion software can do for you.

Read a case study how ecobee uses Flexter to convert XML to BigQuery

Last but not least we would like to find out more about your use case. Contact us and tell us about your challenges of working with XML.