Converting and Analysing EU Tender data in XML with Flexter and Dataiku DSS

June 9, 2017

TED: EU tender notices in XML

The European Union publishes public procurement contracts across the EU and EEA on a daily basis. The raw data is made available in XML and can be downloaded from the Tenders Electronic Daily FTP servers.
In this tutorial we will see how the Flexter XML parser works hand in hand with Dataiku’s data science platform to generate insights from EU tender data in a few minutes.

TED XML Format Guide.

Besides the tender data itself, The TED FTP server also contains important documentation on how the procurement XMLs are structured. I would recommend to download the TED XML Format. General Description (login: guest/guest) guide to get a better understanding. The guide will help us to make sense of the output that Flexter produces.
Let me give you a brief summary on the most important points:

  • The XML is made up of multiple sections, e.g. Translation, Links, Technical, Forms.
  • The Forms section contains information on the tender notice itself. It’s the most important section of the XML and it contains the data we are mainly interested.
  • We need to know that there are different types of tender notices. Each notice type corresponds to a particular form, e.g. the prior information notice corresponds to Form 01. This notice informs the public that a tender will be published soon. Form 02 is the contract notice detailing the tender itself. Form 03 is the contract award notice detailing who won the tender etc.
  • There are a lot of additional details in the guide. For the purpose of this blog post we will focus on Form F02 (contract).

Downloading TED tender XML

Now that we have a high level understanding of how the TED XML is structured we can convert the XML data into a format that is easily consumed by Dataiku DSS. We first download some procurement data from the TED FTP server (guest/guest login). Let’s take some data from one of the days in April

We also need to download the XSD

Converting TED tender XML to CSV/TSV

Now that we have downloaded the XML and corresponding schema we can convert the data to a format that is easily loaded into Dataiku DSS. We will use the Flexter for this purpose.
We first upload our XML data

Next we upload our TED XML Schema

Next we submit our e-mail address. You will get an e-mail with a download link to your converted data once Flexter has finished parsing the XML tender data.

That’s it.

Once we have downloaded our tab separated output (TSV) files we can extract them. As you can see the name of the files correspond to the various Forms (tender types/notices).
With Flexter, we have achieved in minutes what normally would take days or weeks of manual coding.

Let’s peek into the F02_2014 file. If you remember from earlier on, Form F02 contains the data for the contract notice itself. As you can see it contains a primary key PK_F02_2014 and a foreign key FK_TED_EXPORT to the TED_EXPORT file.

Analysing TED tender XML data in Dataiku DSS

Let’s load the output from Flexter into Dataiku DSS. Dataiku provide a community edition of their powerful data science platform for free. You can also try out their enterprise edition for 14 days for free. For this post we are using the community edition.
Let’s first upload the two files we are mainly interested in: TED_EXPORT and F02_2014.

We can now start wrangling with the data. Let’s first join our two data sets before we do some analyses on the contract tender data.

Next we can perform various analyses on the data.
Analysis 1: Data Quality of e-mail addresses

Analysis 2: Data Distribution of Value of tender in Euros

Analysis 3: Number of tenders by authority type

Conclusion: XML Conversion to Dataiku DSS

It just took me ten minutes from downloading the data from the TED FTP server to converting the XML to tabular data, loading it into DSS and performing some quick analysis. This shows you what is achievable when two powerful tools such as Flexter and Dataiku DSS work together. You can achieve in minutes what normally would take weeks.
Which data formats apart from XML also give you the heebie jeebies and need to be liberated? Please leave a comment below or reach out to us.