XML, flexter, JSON, XML,

Home » XML » Converting XML and JSON to a Data Lake

Converting XML and JSON to a Data Lake

by Maciek

Maciek is the Co-founder of Sonra. He has a knack for turning messy semi-structured formats like XML, JSON, and XSD into readable data. With a brain wired for product and data architecture, Maciek is the magic ingredient to making sure your systems don’t just work—they shine.

Published on September 30, 2021
Updated on February 23, 2026

Data lakes are a popular design pattern in data analytics. A data lake is used to store a copy of data coming from operational source systems such as relational databases.

If you’re also working with JSON on modern data platforms and need step-by-step guidance on reading, parsing, querying and flattening it before landing it in a lake, see our Databricks JSON Guide Read Parse Query and Flatten Data.

You can choose from dozens of tools to populate a data lake from relational and structured data sources. However, it gets tricky when you want to store and query XML and JSON documents in the data lake. Often your only choice is to write your own XML / JSON parser. This is not a trivial exercise as this quote from Ralph Kimball shows:

“Because of such inherent complexity, never plan on writing your own XML processing interface to parse XML documents.

The structure of an XML document is quite involved, and the construction of an XML parser is a project in itself—not to be attempted by the data warehouse team.”

While this quote from the father of dimensional data warehousing was made in the context of the data warehouse, the same applies to the data lake.

The trouble with XML / JSON conversion projects for data lakes

XML conversion projects are well known for their high failure rates. They either run over time and budget or fail completely. To a lesser extent this also applies to JSON as JSON documents tend to be less complex than XML.

We typically see a combination of the following issues:

Lack of skills. Data analysts and data engineers are good at working with structured data in databases using SQL. They don’t have the niche skills such as XSLT, XSD, XQuery to unlock the data from XML or JSON.
The lack of skills delays projects or even leads to a complete failure of the project as poorly designed and implemented solutions are developed.
Projects need to go through a lengthy and complex development process taking weeks or months. We have seen some projects taking more than a year with several failed attempts. The result is that data is not available to decision makers.
Using an ETL tool still requires a significant development effort. These tools can handle simple XML/JSON documents but fail when things get more complex.
The lack of skills leads to solutions that don’t scale well. We have seen ETL processes running for 24+ hours to convert 50,000 XML documents.
Once everything has been deployed after weeks, months, or years a new version of an XML or JSON Schema has been released and the whole process of refactoring and reengineering has to be applied.

Automated conversion of XML / JSON to data lakes

We have experienced all of these issues ourselves in dozens of data lake and data warehouse implementations. We thought to ourselves: “There must be a better way”. Hence we created Flexter. Flexter is an automation solution for converting XML and JSON documents to data lakes on AWS (S3), Azure (Data Lake Storage Gen2), and GCP (Cloud Storage).

Using Flexter you can automatically convert your XML / JSON documents on a data lake to Parquet, ORC, Avro, CSV, TSV, PSV on cloud object storage and then consume the data downstream, e.g. in a data warehouse such as Snowflake or a query engine such as Athena etc.

The best thing is, everything happens automagically. You don’t have to go through a lengthy conversion project and risk project delays and failures. Install and configure Flexter and instantly start converting XML and JSON to a relational format.

Flexter can handle any data volume or complexity. You can scale up or scale out to a cluster of nodes to distribute your workload for Terabytes or even Petabytes of XML/JSON documents.

With Flexter’s metadata catalog you can even semi-automate the upgrade process from different versions of your schema. Identify the changes and then auto generate scripts to upgrade your target schema.

Convert any XML to a data lake on S3, Data Lake Gen2 or Google Cloud Storage

Flexter works with any volume of data using a simple three-step process:

Image	Step
	Step 1 In a rapid, one-time operation, we scan and traverse XML/JSON documents for information and intelligence.
	Step 2 We create a logical target schema and the mappings between XML/JSON elements and the database tables and columns.

	Step 3 We process and convert the XML/JSON documents each time new data arrives.

This produces guaranteed, high-quality results that remove the uncertainty and inaccuracy from laborious manual processes that may never provide the results you’re looking for.

Conclusion

Converting XML to a readable format is easy with Flexter. It is a fully automated and optimised process.

We would like to find out more about your use case. Contact us or book a demo and tell us about your challenges of working with XML and JSON.

You can find out more about Flexter by visiting the Flexter product page or the Flexter data sheet.

You can download our XML conversion checklist The 6 Factors You Need to Get Right to Make Your XML Conversion Project a Success

Read a case study how ecobee uses Flexter to convert XML to BigQuery

You can book a demo if you are interested in Flexter.

Last but not least we would like to find out more about your use case. Contact us and tell us about your challenges of working with XML.

About the author:

Maciek

Co-founder of Sonra

Follow Maciek:

Back to Blog

Cookie	Duration	Description
__cfruid	session	Cloudflare sets this cookie to identify trusted web traffic.
cookielawinfo-checkbox-marketing	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Marketing".
cookielawinfo-checkbox-necessary	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-preferences	1 month	This cookie is set by the GDPR Cookie Consent plugin to check if the user has given consent to use cookies under the "Preferences" category.
cookielawinfo-checkbox-statistics	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Statistics".
cookielawinfo-checkbox-unclassified	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Unclassified".
CookieLawInfoConsent	1 month	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
csrftoken	1 year	This cookie is associated with Django web development platform for python. Used to help protect the website against Cross-Site Request Forgery attacks

Cookie	Duration	Description
AnalyticsSyncHistory	1 month	Linkedin set this cookie to store information about the time a sync took place with the lms_analytics cookie.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	2 years	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
li_gc	2 years	Linkedin set this cookie for storing visitor's consent regarding using cookies for non-essential purposes.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
mgref	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
mgrefby	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.

Cookie	Duration	Description
G	1 year	Cookie used to facilitate the translation into the preferred language of the visitor.
SERVERID	session	This cookie is set by Slideshare's HAProxy load balancer to assign the visitor to a specific server.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_7H38LVR4Z5	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_44804396_1	1 minute	Set by Google to distinguish users.
_gat_UA-44804396-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
SIDCC	6 Months	The "SIDCC" cookie is used as security measure to protect users data from unauthorised access
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
AN	1 month
AS	session
ebEventToTrack	1 month
eblang	1 year
SNID	2 years	This cookie is set by the Google. This cookie is used by the map which helps visitors to identify and reach the facility.
SP	session
SS	session

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

Converting XML and JSON to a Data Lake

The trouble with XML / JSON conversion projects for data lakes

Automated conversion of XML / JSON to data lakes

Convert any XML to a data lake on S3, Data Lake Gen2 or Google Cloud Storage

Conclusion

Maciek

Converting XML and JSON to a Data Lake

The trouble with XML / JSON conversion projects for data lakes

Automated conversion of XML / JSON to data lakes

Convert any XML to a data lake on S3, Data Lake Gen2 or Google Cloud Storage

Conclusion

Related Articles

Cookies consent