Convert XML with Spark to Parquet

Chinmay Sinha Spark, XML

It can be very easy to use Spark to convert XML to Parquet and then query and analyse the output data. As I have outlined in a previous post, XML processing can be painful especially when you need to convert large volumes of complex XML files. Apache Spark has various features that make it a perfect fit for processing XML ...

Advanced Spark Structured Streaming - Aggregations, Joins, Checkpointing

Dorian Beganovic Spark

In this post we are going to build a system that ingests real time data from Twitter, packages it as JSON objects and sends it through a Kafka Producer to a Kafka Cluster. A Spark Streaming application will then parse those tweets in JSON format and perform various transformations on them including filtering, aggregations and joins. A table in a ...

Streaming Tweets to Snowflake Data Warehouse with Spark Structured Streaming and Kafka

Dorian Beganovic Kafka, Snowflake, Spark

Streaming Tweets to Snowflake Data Warehouse with Spark Structured Streaming and Kafka Streaming architecture In this post we will build a system that ingests real time data from Twitter, packages it as JSON objects and sends it through a Kafka Producer to a Kafka Cluster. A Spark Streaming application will then consume those tweets in JSON format and stream them ...

Apache Spark Quickstart Packages

Uli Bethke Apache, Spark, SparkSQL

We are pleased to announce three Apache Spark Quickstart Packages. The packages are designed for companies that want to explore and evaluate Apache Spark. Example Use Cases The quickstart packages can be used for various scenarios. I have listed some use cases below. You would like to evaluate a certain Spark feature and identify its benefits and limitations You don’t ...

A brief history of XML - From hype to useful data format

Vadim Mytarev Flexter, Hadoop, Spark, XML, XSD

Is XML really dead? When it first became popular about 20 years ago, XML was meant to be the one and only format to serialize, encapsulate, and exchange data. The serialization format to end all serialization formats so to speak. This was a bold claim. Has it materialised? Over the last couple of years it has become clear that this ...