Talk Abstract: Interactively Analyse 100GB of Data using Spark, Amazon EMR and Zeppelin.
You may have been hearing a lot of buzz around Big Data, Apache Spark, Amazon Elastic Map Reduce (EMR) and Apache Zeppelin. What’s the fuss about, and how can you benefit from these state of the art technologies?
In this highly interactive session, you will learn how to leverage Spark to rapidly mine a large real-world data set. We will conduct the analysis live entirely using an iPython Notebook to show you how easy it can be to get to grips with these technologies. In the first part of the session, we will use a sample of data from the Open Library dataset, and you will learn how to apply common Spark patterns to extract insights and aggregate data. In the second part of the session, you will see how to leverage Spark on Amazon EMR to scale your data processing queries over a cluster of machines and interactively analyse a large data set (100GB) with a Zeppelin Notebook. Along the way you will learn gotchas as well as useful performance and monitoring tips.
Bio: Raoul-Gabriel Urma is CEO and co-founder of Cambridge Spark, a leading learning community for data scientists and developers in UK. In addition, he is also Chairman and co-founder of Cambridge Coding Academy, a growing community of young coders and pre-university students. Raoul is author of the bestselling programming book “Java 8 in Action” which sold over 20,000 copies globally. Raoul completed a PhD in Computer Science at the University of Cambridge. During his PhD he initiated an open source research project at Google which adds enhanced runtime checking to Python. In addition, he holds a MEng in Computer Science from Imperial College London and graduated with first class honours having won several prizes for technical innovation. Raoul has delivered over 100 technical talks at international conferences. He has worked for Google, eBay, Oracle, and Goldman Sachs. He is also a Fellow of the Royal Society of Arts.