Exploring the performance enhancements of HyperLogLog on Spark and adding splittable and seekable features to Gzip in a new open source project called GZinga
Life is never dull in big data and as I left a great Spark Dublin meetup last night pondering the distributed performance enhancements of using dataframes in Spark, I was once again struck by the continuous improvement that our community engages in with such passion for what we do!
Databricks published a great host blog called Interactive Audience Analytics With Spark and HyperLogLog by Eugene Zhulenev. Eugene covered his company’s (Collective) use case for taking an “impression log” and using it via HyperLogLog and Spark to return predictive analytics to a circa 2% degree of accuracy garnering considerable performance savings in the process.
He explores the different approaches on how to build an impression log transforming data into usable formats for the processing along with problems he encountered along the way with SQL issues and performance overheads. His solution was audience cardinality approximation using HyperLogLog and he goes on-to explain how this clever coding approach created value in getting data ready for Spark. It in short was the removal of lower “cookie level data”, than loading HyperLogLog objects into Spark. He than makes a great case for dataframes in Spark and some of the cool functions that can return great analytics. Bearing in mind the 90 or so functions in Spark are set to approximately double in number in the next iteration, so a fine solution can only get better.
Ever wonder how the great compression ratio of GZip could be deployed in files that are split and used in parallel processing? Well you are not alone! Ebay has pondered the same question and have now open sourced a project to add seekable and splittable features to Gzip. The project is called Project GZinga. Being able to read a processing split Gzip file is a value creating feature along with a splittable feature that the article explores where Gzips can be split into processing chunks for parallel processing on Hadoop, etc. The lower level exploration of how it works is a must read for looking to get the compression power of Gzip into a distributed environment for use in parallel processing with minimal overheads. It’s a big step forward for distributed data compression and a must read for the big data developer!
With news like that above, it’s no wonder that we hold our industry in such high esteem! With another week on the horizon, who knows what innovative news it will bring! Have a great weekend all!!
We are a Big Data company based in Ireland. We are experts in data lake implementations, clickstream analytics, real time analytics, and data warehousing on Hadoop and Spark. We also run the Hadoop User Group (HUG) Ireland. We can help with your Big Data implementation. You can get in touch today, we would love to hear from you!