Reflections on Spark Summit 2017 - synvert Data Insights

About Spark Summit

Spark, a fast and general purpose engine for big data processing, has an annual European conference (named Spark Summit) for a bidirectional exchange between the creators/developers of Spark and a large user base. This year, the Spark Summit was held in Dublin between October 24th and 26th, and was based around Spark Streaming, Structured Spark, future releases, best practices for deployment, and using Spark in different environments. Yigit and I (Adriana), both Data Scientists at Data Insights, had the opportunity to attend and to glean some new knowledge and insights.

Spark Training

The first day was dedicated to trainings. It was helpful that some of this training was hands-on, and that we could exchange Spark experience with other attendees. The Data Science based training focused on a random forest machine larning model (https://www.analyticsvidhya.com/blog/2015/06/tuning-random-forest-model/) that has been used in a variety of use cases. Spark was presented as a great tool for implementing complex ML models, however, it is necessary to follow quite stringent best practice guidelines.

The main concepts for developing Spark applications include the following:

DataFrames are strongly suggested in comparison to RDD in Spark 2.0 and the following versions. This facilitates readability, optimization and implementation of ML pipelines
Using Pipelines while implementing machine learning models brings huge benefits, and permits one to keep a good structure of the code, even when comparing different models. (Transformer, Estimator)
Catalyst optimization

Talks

We attended a talk from BBVA on Classifying Text in Money Transfers. A nice use case on Sentiment Analysis was presented, and it was presented how TF-IDF, Word2vec and LAD can be implemented and production ready with Spark. Another interesting example of a basic machine learning method using Spark was presented by Leonard Dali (from Zalando) introducing a self implemented linear regression sparse case.

One presentation that we found really interesting was from Holden (IBM) and based on Spark testing. Writing our own spark code we had to face some issues of unit testing and testing at scale. We would definitely take into consideration the Spark Testing based on https://github.com/holdenk/spark-testing-base.

Considerations

Scala-Python-R?

The possibility to write spark code using Scala Python and R permits developers to use the language in which they are most comfortable. While this brings more flexibility, during the Spark summit Scala was strongly being suggested. In our opinion, the number of people using Scala for Spark will increase, but there are still constraints that make data scientists prefer Python.

In particular, even if a new library (Vegas Netfilx) was presented during the summit, the visualization library of Spark looks more professional and usable for the companies.

Udf or external Libraries

The main idea in Spark is to use their own standard functions, in particular when certain machine learning algorithms are implemented. In many use cases, we have had to use some external libraries to maintain the parallelization of Spark. Here is a good example of that

https://databricks-prod-cloudfront.cloud.databricks.com/public/13fe59d17777de29f8a2ffdf85f52925/5638528096339357/1867405/6918044996430578/latest.html