About this walk-through
In this walk-through we will load XML files processed by Flexter into BigQuery. Flexter is a powerful XML parsing tool that can convert complex XML files into readable format which can then be loaded into BigQuery which is an enterprise data warehouse available on the Google cloud platform.
We have chosen the XML data from FpML for this exercise. FpML (Financial Products Markup Language) is an XML message standard for the OTC Derivatives industry. The standard includes all categories of privately negotiated derivatives. We download the XML data from the FpML website.
We will then use the free version of Flexter to convert the XML data to BigQuery. BigQuery is a distributed database on the Google Cloud that can be used for Business Intelligence use cases. BigQuery does not support XML directly. We use Flexter to first convert our XML data to text (TSV). We can then easily load it into BigQuery
Convert FpML XML to Text Using Flexter
Flexter is a tool that can transform complex XMLs into readable files that can be used for analysis.
The step by step instructions for transforming the XML file using Flexter are shown below.
- Access Flexter from https://flexter.sonra.io/
- Click on the ‘Terms and Conditions’ checkbox and click ‘Try Flexter for Free’
- Upload the FpML (Financial products Markup Language) XML file that needs to be transformed and click ‘Continue’.
- On the next page we upload the XSD file (optional) for the XML. In this implementation, we do not use an XSD file. We skip this step and click on ‘Continue’
- On the next page, we provide the e-mail address to which the Flexter output must be sent and click ‘Continue’
- On the next page, click on ‘Finish’ and the output from Flexter will be delivered to the subscribed email address.
- You will receive an email from Flexter with a download link for the TSV files.
- For the XML file from FpML, a total of 14 TSV files were generated and downloaded.
Loading FpML data to BigQuery
In a next step, we load the TSV output from Flexter into BigQuery. This is the step prior to the start of any analysis. For this walkthrough, we are using Google BigQuery as the database management system.
About Google BigQuery
Google BigQuery is an enterprise data warehouse that can store large datasets and helps in superfast querying using Google infrastructure. Google BigQuery can be run using an API console which makes it easy to install and access. For simplicity reasons we will be using the BigQuery WebUI in this blog post.
Signing up for BigQuery
You can sign up for a free tier for BiqQuery with a data allowance. Just follow the steps below to get started.
- Click on ‘Try it free’
- Sign in using your Google credentials
- On the next page, select the country, checkboxes and click ‘Agree and Continue’
- Provide the user information and card details on the next page and click ‘Start my free trial’
- The Google Cloud Platform dashboard is displayed. Search for BigQuery in the search box.
- Click on BigQuery from the search results. The page redirects to Google BigQuery.
Loading data into BigQuery
In a next step we load the TSV files into BigQuery. Please follow these steps.
Creating a dataset
Dataset is the BigQuery terminology for a database. Please follow these steps to create a BigQuery dataset.
- Click on the link next to ‘My First Project’ and click on ‘Create New Dataset’
- Enter the DatasetID as ‘Flexter_Output’, select region and click ‘Ok’
Creating tables and loading XML data
The step by step procedure for creating tables and loading data into Google BigQuery using the web UI is given below. We can simultaneously create tables and load data into BigQuery.
Creating tables by detecting the schema automatically
This is the most convenient and quickest method of loading the data in BigQuery. In this technique the data columns are automatically detected by BigQuery thus avoiding the need for manually specifying each header name. This is very beneficial when loading tables that contain a large number of columns. We can always modify any columns that BigQuery detects incorrectly. In addition to this method, there are two other methods of loading data into BigQuery. In the first method, we can load the data from any file by specifying the schema manually. This method is synonymous with the previous technique with the exception being that this method needs a manual schema definition.
The second method involves the creation of an empty table and then loading the data either through a DDL script or from any file.
The steps for data loading are given as follows:
- Click on the ‘+’ button next to the dataset name.
- A create table window is displayed on the right.
- In the create table window, select ‘Create from source’
- Specify the table name as required
- Under the schema, select ‘Automatically Detect’
- Click on Create table.
- The table can be seen under the dataset.
- Click on the table in the left pane and preview to details to ensure that the load job has been completed as expected.
The other tables can be loaded into BigQuery in a similar manner and the final view can be obtained as shown below.
The data has now been loaded into BigQuery and is ready for analysis.
Querying on Google BigQuery
Queries can be executed in BigQuery on the Web UI. Clicking on the Compose Query button opens the query editor as shown.
The below query was executed to return the number of participants in different studies grouped by gender.
This query was executed to join two tables.
t1.FK_trade = t2.PK_trade;
The output was obtained as shown below.
FpML Data Schema
The figure below shows the schema that was auto-generated by Flexter from the FpML XML data.
The queries that require referencing of multiple columns in order to gain views pertaining to specific business questions/analysis can be formulated based upon this schema.
Which data formats apart from XML also give you the heebie jeebies and need to be liberated? Please leave a comment below or reach out to us.