Converting ACORD XML to Avro row storage

Uli Bethke April 26, 2018

In this example we will use Flexter to convert an XML file to the Apache Avro format. We then query and analyse the output in the Spark-Shell. Flexter can generate a target schema from an XML file or a combination of XML and XML schema (XSD) files. We will use the data from The ACORD ...

Read More

Using Apache Airflow to build reusable ETL on AWS Redshift

Dorian Beganovic January 1, 2018

Building a data pipeline on Apache Airflow to populate AWS Redshift In this post we will introduce you to the most popular workflow management tool – Apache Airflow. Using Python as our programming language we will utilize Airflow to develop re-usable and parameterizable ETL processes that ingest data from S3 into Redshift and perform an ...

Read More

Apache Spark Quickstart Packages

Uli Bethke May 15, 2017

We are pleased to announce three Apache Spark Quickstart Packages. The packages are designed for companies that want to explore and evaluate Apache Spark. Example Use Cases The quickstart packages can be used for various scenarios. I have listed some use cases below. You would like to evaluate a certain Spark feature and identify its ...

Read More

Big Data News: Streaming in the Extreme.. An evolution in Data Processing and Analytics

Uli Bethke January 29, 2016

Google’s Dataflow have submitted a project proposal to open source Dataflow through the Apache Software Foundation along with MapR on Streaming across Data Centers. As another week comes to a close, the wheels of our big data community continue to move in cycles of innovation and progress, which as always never fail to impress. Google’s ...

Read More
1 2