XML,

Home » XML » Convert XML to AWS ATHENA

Convert XML to AWS ATHENA

by Maciek

Maciek is the Co-founder of Sonra. He has a knack for turning messy semi-structured formats like XML, JSON, and XSD into readable data. With a brain wired for product and data architecture, Maciek is the magic ingredient to making sure your systems don’t just work—they shine.

Published on July 22, 2017
Updated on November 20, 2024

Convert XML to Text. Load to S3. Query with Athena

In this walk-through, we will demonstrate the process of loading XML data into Athena – a query engine on AWS. We download the XML files from Clinicaltrials.gov and then convert them into TSV files using Flexter, a powerful XML parser from Sonra. We then load the TSV files into Amazon S3 and use Athena for querying and joining the data

XML data from ClinicalTrials

We have sourced the data from Clinicaltrials.gov , which is a registry and results database of publicly and privately supported clinical studies conducted around the world. The website provides a collection of trends, charts and maps that can help the users in visualizing the outcome or progress of a study based on the location or type of users. The data is available in XML format with single record and multiple record XMLs. The study information can be downloaded from the Search Results list. The ‘search results’ is a feature that help the users in filtering the data to be analyzed based on the number of studies, the number of table rows per study and the file format. In addition to this, a full study record XML document can be downloaded.

Processing the XML file

We will be using Flexter to convert the XML data to Athena. Athena is an interactive query service provider available on the AWS platform. Athena cannot process XML files directly and hence we use Flexter to first convert our XML data to text (TSV). It will then be easy to load the data into Athena via S3 storage.

Convert ClinicalTrials XML to Text Using Flexter

Flexter is a tool that transforms complex XMLs into readable files that can be used for analysis. The step by step instructions for transforming XML files using Flexter are shown below.

Access Flexter from https://sonra.io/xml-to-csv-converter/

Screen Clipping

Click on the ‘Terms and Conditions’ checkbox and click ‘Try Flexter for Free’

Screen Clipping

Upload the XML file that needs to be transformed and click ‘Continue’

Screen Clipping

In the next page, we can upload the XSD file (if available). For this implementation, the XSD file is not available and hence we will click on ‘Skip ‘and proceed to the next page.

Screen Clipping

Provide Email and click ‘Continue’
The conclusion page is displayed and the output will be delivered to the subscribed email address.

Screen Clipping

An email from Flexter with the link to the transformed data is obtained. The TSV files can be downloaded from this link which redirects to the Flexter page.

Screen Clipping

For the XML files from ClinicalTrials, a total of 13 TSV files were generated and downloaded.

Screen Clipping

Loading ClinicalTrials data to AWS

In the next step, we will be loading the TSV output from Flexter into Amazon S3. In the following sections, we have explained the process of setting up S3 and loading data.

Setting up Amazon S3 Data Storage

Amazon S3 is a web service on the AWS platform that enables business users to load and store data. Amazon S3 offers object storage with a simple web service that enables users to store and retrieve any amount of data from anywhere on the web.
The following steps highlight the basic process to load data into Amazon S3.

In the search box on the AWS console type S3 as shown below.

Click on S3 – Scalable storage in the cloud. You will be redirected to the S3 web service as shown below.

We create a data bucket in the next step. The bucket is a logical unit of storage on the AWS platform and provides the user with an option to store objects consisting of data and metadata which can be accessed from other utilities like Athena and QuickSight. The bucket offers users the option to create an unlimited number of objects. The create bucket page is shown below. The bucket name is specified as ‘flexteroutput’

The properties can be setup as necessary in the next page.

The permission/access levels can be set up in the next page as shown below.

Click on create bucket. We have created the S3 bucket which can be viewed on the S3 web service.

Loading XML output into S3 bucket

Amazon S3 provides a highly scalable and flexible data loading service which makes it possible for the user to load any type of data. The Flexter output obtained from the XML files contained a total of 13 TSV files that we load into the bucket. The following steps describe the data loading process into Amazon S3.

Go to the AWS S3 console.

Click on flexteroutput. You will be redirected to the below page.

Create folders for each of the TSV files – one folder per table as shown below.

Click on upload and select the file you want to upload into S3.
The TSV files need to be loaded into each folder in the bucket. For instance, the TSV file age_group should be loaded into the first folder.

Click on ‘Next’. In the next window, provide the user permissions as required.

Click next and specify the storage class and metadata (optional)

Click next to review the files to be uploaded and proceed to ‘Upload’

The files are loaded into Amazon S3 as shown above.
Repeat the same method to load each TSV file into the folders.

Amazon Athena

In the next step, we will be loading the data stored in S3 into Athena and execute SQL queries.
Amazon Athena is an interactive query service provided on AWS that allows users to query and analyse data using standard SQL. Athena is server-less, hence the users have no infrastructure to manage which makes it easy to use. From this point of view it is similar to Google BigQuery. Simply point to your data in Amazon S3, define the schema, and start querying using standard SQL. Most results are delivered within seconds.
Under the hood, Athena uses Presto with ANSI SQL support – a distributed SQL engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. The pair of Athena as the query engine and S3 as the storage engine is loosely coupled. This is comparable to the combination of Kudu and Impala from Cloudera.

Getting started with Athena – Data Loading

The step by step process of setting up Athena is listed below.

Go to https://aws.amazon.com/athena/

You will be redirected to the page shown below.

Click on Get Started. The below page is shown.

Click on Add table. The below page is displayed. Before adding any table, a database should be created.

Specify the database name as Flexter_Output and provide the table name as age_group. The location of the input data set in the S3 bucket should be specified as s3://flexteroutput/age_group/

Click next. In the following page, specify the data format. Since we have the file format as TSV, click on the TSV checkbox.

The next page is to define the table schema. Enter the column names and the data types as required.

Specify data partitions (if any) and click on create table. Partitions are primarily used when there is a large amount of data that can be split based on time for example. In such a scenario, we can split/partition the data by the month/year or any specified key for faster querying and cost efficiency. In our particular example, partitions are not necessary as the data to be queried is minimal and within the cost parameters. Click here for more regarding data partitioning.
The table is created as shown below.

CREATE EXTERNAL TABLE IF NOT EXISTS flexteroutput.age_group (

`FK_study` bigint,

`age_group` string

)

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'

WITH SERDEPROPERTIES (

'serialization.format' = ' ',

'field.delim' = ' '

) LOCATION 's3://flexteroutput/age_group/'

TBLPROPERTIES ('has_encrypted_data'='false');

The preview of this table is shown below:

From the preview, we can see that the first row in the data columns contains the header value. This is an unexpected outcome and can be considered as a shortfall in Athena. This can be avoided by manually removing the header from the TSV file prior to data loading. This seems to be the only way to handle this shortcoming as of time of writing.
The other TSV files are loaded into Athena in a similar manner. The final view of the database in Athena after data loading is shown below.

Data Schema

The schema of the data tables for this analysis is shown below.

Based on the above schema we can run complex queries that require joining multiple tables.

Sample queries in Athena

Athena follows standard SQL. Queries can be executed from the query editor in the console.

Query 1

1	SELECT * FROM study limit 10;

OUTPUT:

Query 2:

SELECT Gender,

COUNT(Gender)

FROM Study

GROUP BY Gender ;

OUTPUT:

A single value called ‘Gender’ is displayed in the output as AWS considers the header as a data row.

Query 3:

SELECT sponsors_lead_sponsor,

COUNT(Gender) AS count

FROM Study

GROUP BY sponsors_lead_sponsor;

OUTPUT:

Query 4:

SELECT sponsors_lead_sponsor,

AVG(RANK) AS AverageRank

FROM Study

GROUP BY sponsors_lead_sponsor;

OUTPUT:

Query 5:

SELECT MAX(RANK) AS MaxRank,

MIN(RANK) AS MinRank,

Gender

FROM Study

GROUP BY Gender;

OUTPUT:

Query 6:

SELECT t1.age_group AS Age_Group,

t2.*

FROM age_group t1

JOIN Study t2

ON t1.FK_study = t2.PK_study;

OUTPUT:

Which data formats apart from XML also give you the heebie jeebies and need to be liberated? Please leave a comment below or reach out to us.

About the author:

Maciek

Co-founder of Sonra

Follow Maciek:

Back to Blog

Cookie	Duration	Description
__cfruid	session	Cloudflare sets this cookie to identify trusted web traffic.
cookielawinfo-checkbox-marketing	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Marketing".
cookielawinfo-checkbox-necessary	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-preferences	1 month	This cookie is set by the GDPR Cookie Consent plugin to check if the user has given consent to use cookies under the "Preferences" category.
cookielawinfo-checkbox-statistics	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Statistics".
cookielawinfo-checkbox-unclassified	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Unclassified".
CookieLawInfoConsent	1 month	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
csrftoken	1 year	This cookie is associated with Django web development platform for python. Used to help protect the website against Cross-Site Request Forgery attacks

Cookie	Duration	Description
AnalyticsSyncHistory	1 month	Linkedin set this cookie to store information about the time a sync took place with the lms_analytics cookie.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	2 years	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
li_gc	2 years	Linkedin set this cookie for storing visitor's consent regarding using cookies for non-essential purposes.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
mgref	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
mgrefby	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.

Cookie	Duration	Description
G	1 year	Cookie used to facilitate the translation into the preferred language of the visitor.
SERVERID	session	This cookie is set by Slideshare's HAProxy load balancer to assign the visitor to a specific server.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_7H38LVR4Z5	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_44804396_1	1 minute	Set by Google to distinguish users.
_gat_UA-44804396-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
SIDCC	6 Months	The "SIDCC" cookie is used as security measure to protect users data from unauthorised access
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
AN	1 month
AS	session
ebEventToTrack	1 month
eblang	1 year
SNID	2 years	This cookie is set by the Google. This cookie is used by the map which helps visitors to identify and reach the facility.
SP	session
SS	session

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

Convert XML to AWS ATHENA

Convert XML to Text. Load to S3. Query with Athena