Convert XML with Spark to Parquet

Chinmay Sinha January 25, 2018

It can be very easy to use Spark to convert XML to Parquet and then query and analyse the output data. As I have outlined in a previous post, XML processing can be painful especially when you need to convert large volumes of complex XML files. Apache Spark has various features that make it a ...

Read More

Advanced Spark Structured Streaming – Aggregations, Joins, Checkpointing

Dorian Beganovic November 27, 2017

In this post we are going to build a system that ingests real time data from Twitter, packages it as JSON objects and sends it through a Kafka Producer to a Kafka Cluster. A Spark Streaming application will then parse those tweets in JSON format and perform various transformations on them including filtering, aggregations and ...

Read More

Streaming Tweets to Snowflake Data Warehouse with Spark Structured Streaming and Kafka

Dorian Beganovic November 20, 2017

Streaming Tweets to Snowflake Data Warehouse with Spark Structured Streaming and Kafka Streaming architecture In this post we will build a system that ingests real time data from Twitter, packages it as JSON objects and sends it through a Kafka Producer to a Kafka Cluster. A Spark Streaming application will then consume those tweets in ...

Read More

Apache Spark Quickstart Packages

Uli Bethke May 15, 2017

We are pleased to announce three Apache Spark Quickstart Packages. The packages are designed for companies that want to explore and evaluate Apache Spark. Example Use Cases The quickstart packages can be used for various scenarios. I have listed some use cases below. You would like to evaluate a certain Spark feature and identify its ...

Read More
1 2 3 4