Published by Simon Dite at


Spark Summit Deep Learning – Volume 1

Spark Summit Deep Learning – Volume 1

A very interesting talk about Deep Learning pipelines on Spark was held by Tim Hunter (Ph.D. from UC Berkeley) and Sue Ann Hong (Ph.D. from CMU) at the Deep Learning Summit in Dublin, at Convention Center.

Deep learning can be summarized as a set of machine learning techniques which use multiple levels of nonlinear information processing to transform numerical inputs. Deep learning is very effective, especially in the presence of big datasets. It has become popular since 2006 and has been successfully applied to unsupervised feature extraction and to supervised classification problems, particularly in image and speech recognition, automatic translation and AlphaGo. Currently, there is a limited adoption of deep learning in the industry. Considering the huge amount of data that big companies have, it’s inevitable to accelerate the transition into deep learning for production.

Regarding the availability of Deep Learning on Spark:

DataBricks integrated “Deep Learning Pipelines” on Spark. It is an open-source library which focuses on ease of use and integration. Its primary language is Python. (When I heard this, it was a euphoric moment for me, since I see Python as a miracle for data scientists. It might not be the best for production, but I think Python is very convenient to get some first results and a gentle visualization of them.)

A typical Deep Learning workflow (see above) resembles any other machine learning model. It includes:

  • Loading the data
  • Interactive work
  • Training the model
    • Selecting an appropriate architecture for the neural network
    • Optimizing the weights
  • Evaluating results / retraining
  • Apply:
    • Pass the data through the NN to produce new features or output

Let’s dig a little bit deeper:

Loading Data

Spark provides support for image loading, and I find it quite useful since image recognition is a task which repeatedly comes into play.

from sparkdl import readImages  
image_df = readImages(sample_img_dir) # Read images into a dataframe


Two main topics can be referred under training:

Transfer learning

Popular pre-trained models are accessible through MLlib transformers: this makes it possible do transfer learning by loading some available predictors. I think, that it is always a good start to use transfer learning at the beginning of a project to get some insights and intuition about the problem and the dataset. In the below example, you’ll see the Inception-V3 model which was designed for ImageNet Large Visual Recognition Challenge:

predictor = DeepImagePredictor(inputCol="image",

Pre-trained models are generally not directly applicable for specific problems, whereas training huge amounts of data from scratch (end-to-end learning) requires a lot of computational resources and time.

Here comes the idea of using intermediate representations learned for any task. Based on my experience, I would strongly suggest the use of intermediate representations, especially for classification problems. An example of this can be using the weights of the first hidden layer of AlexNet for the weight initialization – please remember that this forces you to use the same hidden layer size of the pretrained network that you are using for the hidden layer.

For transfer learning, two core concepts in MLlib should be revisited: transformers and estimators. The input to a transformer is a Spark dataframe. The output is again a Spark dataframe which is ‘transformed’. Training models and feature transformers are some examples of transformers.

Estimator is a learning algorithm. It produces a model with the fit() function as in Python, e.g.

model = fit(train_df, params)

Hyperparameter tuning

One of the biggest bottlenecks of deep learning models is the number of hyperparameters to optimize. What Spark provides for coping with this problem (which is a serious one) is distributed hyperparameter tuning. It gives you the possibility to simultaneously experiment around with different settings of the hyperparameters. In the following example, two different models are being trained with different learning rates and mini batch sizes. In practice, we can again revisit two concepts of MLlib: estimators and evaluators.

As mentioned above, estimators indicate the learning algorithms. Evaluators are used to measure the accuracy of the trained algorithm on the validation data. Selection of the model can be also thought of as optimizing the parameters to obtain the best accuracy on validation data. Cross validator function is also available on MLlib as an estimator.

In the above case, we are assuming that we are willing to choose among ResNet50 and Inception, using cross validation on our pipeline. Spark takes care of loading the predefined architectures and distributing them.

pipeline = Pipeline([DeepImageFeaturizer(), LogisticRegression()])  
paramGrid = ParamGridBuilder().  
addGrid(modelName=[“ResNet50”, “InceptionV3”])  
cv = CrossValidator(estimator=pipeline,  
bestModel =

If you don’t prefer using transfer learning, you can use Keras estimator to train your model with your own structure, which executes on Tensorflow, CNTK and Theano.

model = Sequential()  
add(...) # definition & addition of layers

estimator = KerasImageFileEstimator(  
kerasOptimizer=”adam”, # adaptive moment estimation    
model =

The model can also be trained using multiple hyperparameters, where Spark again takes care of distributing the computations for you to find the best possible model.

Applying a Model for Production

The good news for the guys who are not specialized in deep learning, is that they can deploy a trained model or pipeline for the production on Spark SQL. The query will then look as follows:

SELECT image, my_object_recognition_function(image) AS objects  
FROM traffic_imgs

I think this is a great chance and a driving force to use Spark for the deployment of deep learning models. Cheers!

img img