Big Data News: HyperLogLog with Spark and Open Source GZinga Compression

Uli Bethke October 16, 2015

Exploring the performance enhancements of HyperLogLog on Spark and adding splittable and seekable features to Gzip in a new open source project called GZinga Life is never dull in big data and as I left a great Spark Dublin meetup last night pondering the distributed performance enhancements of using dataframes in Spark, I was once ...

Read More

Big Data News – MapR’s Deep Dive into Apache Drill and Gartner’s Big Data Predictions for 2020

Uli Bethke September 4, 2015

30% of all Enterprises will use intermediaries for big data by 2017 Another week has passed and our big data community has been busy around the world. Some interesting movements in our industry has arisen with Doug Laney on Forbes making three predictions on the advancement of big data to 2020, including the rise of 3rd ...

Read More

Window Functions (aka Analytic Functions) in Spark.

Uli Bethke July 3, 2015

As of Spark 1.4.0 we now have support for window functions (aka analytic functions) in SparkSQL. At Sonra we are heavy users of SparkSQL to handle data transformations for structured data. We also use it in combination with cached RDDs and Tableau for business intelligence and visual analytics. [big_data_promotion] Spark SQL and Window Functions: The ...

Read More

Multiple Spark Worker Instances on a single Node. Why more of less is more than less.

Uli Bethke June 3, 2015

If you are running Spark in standalone mode on memory rich nodes it can be beneficial to have multiple worker instances on the same node as a very large heap size has two disadvantages: – Garbage collector pauses can hurt throughput of Spark jobs. – Heap size of >32 GB can’t use CompressedOoops. So 35 ...

Read More
1 2 3 4