Hadoop User Group Ireland Meetup (24 June 2015): An Introduction to Spark
Thanks again to everyone who attended the third Hadoop User Group Ireland meetup. Also thanks to Bank of Ireland Grand Canal Square for making the venue available. Participants in the event can send feedback to their Twitter and Facebook accounts: facebook.com/BOIGrandCanalSquare
Also thanks to Étienne from Idiro and Antonio from HP for their great presentations. We have all of the presentations made available for download.
The meetup was another great success with more than 50 participants. I have a couple of pictures from the event below.
Unfortunately, we were not able to video tape the event this time around. But if you have any questions just get in touch with us.
The presentations can be downloaded here.
There were a couple of points I missed to mention in my presentation on Spark:
- Spark achieves resilience through recomputing results from the underlying data or in some scenarios also uses checkpointing (e.g. Spark Streaming). You may wonder if we still need a distributed filesystem such as the MapR-FS or S3? The answer is yes. Just imagine you lose the node where the original data is located. You wouldn’t be able to recompute as the data is gone. In summary, it is still best practice to run Spark against a distributed file system.
- If you set an RDD down to HDFS or MapR-FS you should set the replication factor to 1 at the moment you can only override this on a per Spark application level). Resilience for RDDs is achieved recomputation from the lineage graph.
- While RDDs sit in memory, any shuffle/reduce phase will set the data down to disk for serialization before sending it across the network. So this is not pure in memory computing. Tachyon could help here as well.
- At Sonra we completely denormalize data sets (for querying only). Kimball and dimensional modelling is dead in the age of Big Data. Think of columnar compression algorithms.