Note: We have written an updated version of this post that shows XML conversion on Spark to Parquet with code samples.
Did you ever have to process XML files? Complex and large ones? Lots of them? No matter which processing framework or programming language you use it always is pain. It never is easy. It can be sure that it is very time consuming and error prone. Unless you have a very simple XML file you are guaranteed to run into problems. Based on a recent survey we did, more than 80% of XML processing projects ran into one or more of the following issues:
- Because of the complexity of the XSD/XML it took the developers forever to write the code to process the XML files. Development time grew exponentially with complexity
- For large volumes of XMLs or large XML files performance and hitting SLAs was an issue most of the time.
- Administration and maintenance was cumbersome especially when changes were made to the XSD
Why does this problem exist?
XML is supposed to be human readable, right? I’m sure you will beg to differ if you’ve ever had to deal with any of the widespread XML based industry standards such as HL7, Swift, OTA etc. Just by looking at those XSDs/XMLs you will get a headache. At least that is what happens to me.
XML is great as a data exchange format for operational systems. However, for data analytics it just doesn’t cut the chase. For that purpose we need data to be in a different format, a format that can be understood by BI and visualisation tools. I am talking about tables in a relational format or a big data format such as Parquet.
XML processing on Spark. Your options.
Spark is great for XML processing. It is based on a massively parallel distributed compute paradigm. With Spark, XML performance problems are a thing of the past. How about XSD/XML complexity? While there is a good wrapper library on Github that simplifies extraction of data based on XPath it is still very much a cumbersome, error prone, risky, and slow process to liberate your data from XML.
That is why we have developed Flexter for XML. It is written in Scala, runs on Spark, and it’s fast. It takes the pain out of processing XML files on Spark. Everything is automated. You just supply your XSD and your XML files and off you go. Flexter does all the work for you. It analyses the XSD, creates an optimised target schema, processes the XML, and spits out the data at the other end in a format of your choice: relational database tables, CSV/TSV, Parquet. It even has a couple of built-in optimisations that simplify the target schema and make downstream processing and consumption of data much easier. We have even built in an XSD browser that lets data analysts browse the source and target schemas and data lineage between the two. A detailed list of features is in our product section.
Benefits of using Flexter for Spark XML processing
- Reduce Cost: Reduce your development costs by up to 80%
- Meet SLAs: Never again miss a Service Level Agreement
- Scale Infinitely: Big Data ready. Flexter handles any volume of data.
- Reduce Risk: No specialist knowledge or niche skills needed
- Meet Deadlines: Meet your project timelines instantly
Which data formats apart from XML also give you the heebie jeebies and need to be liberated? Please leave a comment below or reach out to us.