You don’t have many options for processing XML on BigQuery.
If our data is in Avro, JSON, Parquet, etc. then you can load it easily to BigQuery. But what about XML? BigQuery does not provide any native support to deal with XML documents. Converting XML to BigQuery is a manual, time consuming and error prone process even if you have an ETL tool.
One option is to convert XML using Apache Beam.
Step 1. Convert the XML file to Python Dictionary.
In the first step we convert the XML file into a Python dictionary using the ‘xmltodict’ package. This package provides a method to parse the XML structure and convert it to a Python dictionary.
Step 2: Specify the schema of the output table in BigQuery.
In this step, we just need to define the schema of the table where we want to load the data. If the particular element has child elements, then we should use ‘record’ (struct) datatype to store it in BigQuery.
Step 3: Run an Apache Beam job to read XML and load it to the BigQuery table.
The next step is to write the Beam pipeline to take an XML file and use it to populate a BigQuery table. This job can be run using Google Cloud Dataflow.
In this approach, we are basically converting XML to a JSON like structure and then loading it into BigQuery. We are using the ‘record’ datatype to store the nested elements. The maximum nested depth allowed for the ‘record’ type is 15 levels. So, we cannot use this approach to load multilevel nested data which contains more than 15 levels. Also, there is an overhead of writing code for the apache beam pipeline.
In short, if we want to use BigQuery as a platform for a use case where XML processing is required then the only option is converting it from XML to CSV or a JSON like format (Python Dictionary). Once we have converted the file, we can use BigQuery to load or use Dataflow to run Apache Beam jobs.
The overhead of writing the code to change the format will be significant and you will face the following challenges:
- XML conversion projects tend to run over budget or fail completely
- Data engineers don’t have the required niche skills to efficiently convert XML documents
- XML conversion projects tend to run into performance issues
Data warehouse legend Ralph Kimball agrees. In his book The Data Warehouse ETL Toolkit, he mentions that the “… construction of an XML parser is a project in itself – not to be attempted by the data warehouse team”.
We are aligned with this quote based on our own painful experience. We thought there must be a better way. Hence we came up with Flexter. Flexter is a data warehouse automation solution for XML, JSON, and industry data standards. It converts XML and JSON to a data warehouse, a database, a data lake, or big data formats such as ORC, Parquet, Avro.
In this blog post we use Flexter to automatically convert XML from the SDMX industry data standard to BigQuery. Let’s briefly introduce SDMX, which has seen an increase in popularity over the last few years. The European Central Bank (ECB) uses SDMX to exchange statistical data and metadata with its partners in the European System of Central Banks (ESCB) and other organisations world-wide
SDMX (Statistical Data and Metadata eXchange) is an international initiative that aims at standardising and modernising (“industrialising”) the mechanisms and processes for the exchange of statistical data and metadata among international organisations and their member countries.
SDMX is sponsored by seven international organisations including the Bank for International Settlements (BIS), the European Central Bank (ECB), Eurostat (Statistical Office of the European Union), the International Monetary Fund (IMF), the Organisation for Economic Cooperation and Development (OECD), the United Nations Statistical Division (UNSD), and the World Bank.
You can download a sample of SDMX-ML (XML) files and the corresponding schema files (XSD) from the SDMX website.
Converting XML to BigQuery with Flexter is a fully automated process. It requires two simple steps.
- Create a Flexter Data Flow. The whole process is fully automated and you don’t need to create manual mappings
- Convert your XML documents
The figure below summarises the two steps
First we generate a Data Flow using an XSD and/or a sample of representative XML files. A Flexter Data Flow is a logical target schema that maps the elements from the XML documents to BigQuery table columns. Both are stored in Flexter’s metadata catalog.
Flexter can generate a target schema from an XSD, a sample of XML files or a combination of the two.
We issue the following command to Flexter with the location of our XSD and XML files
xsd2er -g3 -a SDMX\ XSD.zip SDMX\ XML.zip
The Data Flow has been created. We are now ready to convert our XML documents to a BigQuery relational format (tables and columns).
We pass in the schema_id from the Data Flow we created in the previous step, the connection details to BigQuery, and the path to the XML documents we want to convert
xml2er -x <schema_id> -o "bigquery://sdmx_stats?temp=sonra_temp" SDMX\ XML.zip
Next we can familiarise ourselves with the target schema using the ER diagram that Flexter has generated.
In a final step we can query the data directly on BigQuery or apply some downstream processing to it.
Flexter can automate the conversion of XML to BigQuery. But wait, there is more. Flexter also ships with various AI optimisation algorithms
The purpose of the optimisation algorithms is to simplify the output of Flexter and make it as easy as possible to work with for downstream consumers. Please refer to our blog post The Secrets to Unlocking XML Data in an Optimised Format for Downstream Users
You can download our XML conversion checklist The 6 Factors You Need to Get Right to Make Your XML Conversion Project a Success
Read a case study how ecobee uses Flexter to convert XML to BigQuery
You can request access to the free version of Flexter.
Last but not least we would like to find out more about your use case. Contact us and tell us about your challenges of working with XML.