Using Apache Airflow to build reusable ETL on AWS Redshift

Dorian Beganovic January 1, 2018

Building a data pipeline on Apache Airflow to populate AWS Redshift In this post we will introduce you to the most popular workflow management tool – Apache Airflow. Using Python as our programming language we will utilize Airflow to develop re-usable and parameterizable ETL processes that ingest data from S3 into Redshift and perform an ...

Read More

Advanced Spark Structured Streaming – Aggregations, Joins, Checkpointing

Dorian Beganovic November 27, 2017

In this post we are going to build a system that ingests real time data from Twitter, packages it as JSON objects and sends it through a Kafka Producer to a Kafka Cluster. A Spark Streaming application will then parse those tweets in JSON format and perform various transformations on them including filtering, aggregations and ...

Read More

Streaming Tweets to Snowflake Data Warehouse with Spark Structured Streaming and Kafka

Dorian Beganovic November 20, 2017

Streaming Tweets to Snowflake Data Warehouse with Spark Structured Streaming and Kafka Streaming architecture In this post we will build a system that ingests real time data from Twitter, packages it as JSON objects and sends it through a Kafka Producer to a Kafka Cluster. A Spark Streaming application will then consume those tweets in ...

Read More
1 2 3