A brief history of XML - From hype to useful data format

Vadim Mytarev Flexter, Hadoop, Spark, XML, XSD

Is XML really dead?

When it first became popular about 20 years ago, XML was meant to be the one and only format to serialize, encapsulate, and exchange data. The serialization format to end all serialization formats so to speak. This was a bold claim. Has it materialised? Over the last couple of years it has become clear that this bid for “world power” was a bridge too far. For exchanging simple pieces of information XML is just too verbose. Developers hate it. JSON now has taken the place of XML as the serialization format of choice on the web. Most if not all REST web services have switched to JSON. This makes perfect sense. There are just too many tags with XML, which slows it down somewhat. JSON is just a better fit for serializing a programming language object.

What about data analytics? During the hype days, some people even thought that XML would replace relational databases. What retrospectively looks like a bad joke was quite a serious proposition at the time. There were countless books on the subject and a few attempts to create XML databases were made. It became quite quickly clear that XML was not fit for purpose in those scenarios. Querying XML with XPath is an absolute pain. Just compare this to SQL.  There is no way of leveraging indexes or a cost based optimizer and you have to load the whole XML document into memory for query operations to be efficient.  Apart from relational databases we now also have some open source columnar compressed data formats such as Parquet or ORC that are a much better fit for data analytics than XML.

Has XML failed?

It is one thing to say that XML has not delivered on its promises, yet another one altogether to claim that it has failed or to say it is dead. Yes, it is not a good fit for exchanging data on the web neither is it a good fit for data analytics. However, there are countless examples where XML is used successfully to this day. What the story of XML tells us is that there is not one data serialization format to rule them all. We now have many formats at our disposal. Avro, Thrift, Protocol Buffers to name just a few. For a full list and description have a look at this Wikipedia article https://en.wikipedia.org/wiki/Comparison_of_data_serialization_formats. Each of these serves its own use case (well, actually some of those in the Wiki article are really obsolete).

XML Success Stories

What are some use cases where XML succeeded?

  • A significant number of enterprises use XML as a data exchange format. XML is the de facto standard for exchanging messages between enterprise applications in a Services Oriented Architecture. Messages that conform to the canonical model are converted back and forth to XML. If you have ever worked in an enterprise context you know that life there isn’t simple and neither is the type you of data and its relationships you come across. This is an environment where XML shines as a data format with an extensible schema to represent complex business processes in the real world.
  • Business processes between enterprises are more than ever inter-connected at a global scale. B2B data hubs often standardise on XML as their data exchange format.
  • Many industry standards have evolved over the years that are based on XML. Years of work and expertise have gone into these standards. In particular this is the case in finance (ESMA TRACE, MIFID, XBRL) retail, healthcare (HL7), life sciences (CDISC), and public sector (EU) just to name a few.
  • XML is used as a serialization format for RDFs (RDF/XML) in a semantic web context.
  • In the publishing industry, XML is used throughout the document processing work flow. It is also the standard for Office file formats such as Word, Excel, PowerPoint or the Google Docs equivalents.

XML = Pain

We have seen that XML can be quite useful and has found its own niches. The initial hype did not materialise. While not being ubiquitous it is still used widely, in particular in an enterprise context where things can get complex. As we all know, when things get complex things get difficult.

In theory XML is human readable. Unfortunately, we don't see that this is the case in practice (well, only for the most simple XML files found in configuration files or similar). XML schemas (XSDs) can become quite complex. We have seen XSDs that literally contain hundreds of entities/tables. When we visualise those schemas they look like the schema of a complex ERP system reminding one of a spider’s web. This complexity makes it very hard for data analysts and developers to work with. The man days spent on analyzing and processing XML exponentially increase with the complexity of the XSD.  A factor that compounds the problem is that most XSDs have not been designed with analytics in mind, e.g. transactions come with redundant reference data, real world relationships are not modelled correctly etc.

So what are your options for processing complex XML files into a relational format or a Big Data format such as Parquet/ORC (both formats are fit for data analytics)?

  • You can hire a bunch of developers and data analysts that try to make sense of the complex schema and try to manually extract the data from the XML by writing custom code. If you have an ETL tool you will find out sooner or later that it can’t handle the complexity of most industry standards or that it only semi-automates the process or that performance is shockingly bad.
  • You can use Flexter Data Liberator for XML. Do in one day what your developers/ETL tool would do in six months (if at all). Don’t worry about data volume, SLAs or performance. Flexter scales linearly. End of story.

We understand that you are sick of working with XML. Why not try out Flexter to find out how much fun it can be to process XML. Flexter is our platform that takes the pain out of converting XML files into a relational format or Parquet.

Which data formats apart from XML also give you the heebie jeebies and need to be liberated? Please leave a comment below or reach out to us.