Convert XML with Spark to Parquet

Chinmay Sinha Spark, XML

It can be very easy to use Spark to convert XML to Parquet and then query and analyse the output data. As I have outlined in a previous post, XML processing can be painful especially when you need to convert large volumes of complex XML files. Apache Spark has various features that make it a perfect fit for processing XML ...

Advanced Spark Structured Streaming - Aggregations, Joins, Checkpointing

Dorian Beganovic Spark

In this post we are going to build a system that ingests real time data from Twitter, packages it as JSON objects and sends it through a Kafka Producer to a Kafka Cluster. A Spark Streaming application will then parse those tweets in JSON format and perform various transformations on them including filtering, aggregations and joins. A table in a ...

Streaming Tweets to Snowflake Data Warehouse with Spark Structured Streaming and Kafka

Dorian Beganovic Kafka, Snowflake, Spark

Streaming Tweets to Snowflake Data Warehouse with Spark Structured Streaming and Kafka Streaming architecture In this post we will build a system that ingests real time data from Twitter, packages it as JSON objects and sends it through a Kafka Producer to a Kafka Cluster. A Spark Streaming application will then consume those tweets in JSON format and stream them ...

Apache Spark Quickstart Packages

Uli Bethke Apache, Spark, SparkSQL

We are pleased to announce three Apache Spark Quickstart Packages. The packages are designed for companies that want to explore and evaluate Apache Spark. Example Use Cases The quickstart packages can be used for various scenarios. I have listed some use cases below. You would like to evaluate a certain Spark feature and identify its benefits and limitations You don’t ...

About the author

Uli Bethke LinkedIn Profile

Uli has 18 years’ hands on experience as a consultant, architect, and manager in the data industry. He frequently speaks at conferences. Uli has architected and delivered data warehouses in Europe, North America, and South East Asia. He is a traveler between the worlds of traditional data warehousing and big data technologies.

Uli is a regular contributor to blogs and books, holds an Oracle ACE award, and chairs the the Hadoop User Group Ireland. He is also a co-founder and VP of the Irish chapter of DAMA, a non for profit global data management organization. He has co-founded the Irish Oracle Big Data User Group.

A brief history of XML - From hype to useful data format

Vadim Mytarev Flexter, Hadoop, Spark, XML, XSD

Is XML really dead? When it first became popular about 20 years ago, XML was meant to be the one and only format to serialize, encapsulate, and exchange data. The serialization format to end all serialization formats so to speak. This was a bold claim. Has it materialised? Over the last couple of years it has become clear that this ...