A simple tutorial to learn how to implement a Shallow Neural Network (3 fully connected layers) using PySpark.

Photo by Jamie Street on Unsplash

This article is not intended to provide mathematical explanations of neural networks, but only to explain how to apply the mathematical equations to run it using Spark (MapReduce) logic in Python. For simplicity, this implementation only uses RDDs (and no DataFrames).

Similarly, I assume that you have Spark installed in your machine and that you can either run a spark-submit or a PySpark Jupyter-Notebook.

All the code provided in this tutorial is available on this GitHub Repository.

Just in case, here are some resources to set up your machine to be able to run the code:

Also, throughout this article, I will base my explanation on one of my previous medium articles that explains the math behind a 3-layer neural network. Most of the mathematical formulas I will provide are extracted and discussed here:

Gif created by author on imgflip.com

If you are already familiar with deep learning, you may have already encountered GPU/CPU memory limitations. This usually occurs when you try to provide too much input data (either through a large batch size or large input feature space) that resources cannot handle.

Therefore, with the previous statement, it seems almost impossible to provide several hundred gigabytes as input to your model. This is even more relevant for Machine Learning algorithms that mostly use CPU computation (Linear Regression, SVM, Logistic Regression, Naive Bayes, …).

Spark is a powerful solution for processing very large amounts of data. It allows to distribute the computation on a network of computers (often called a cluster). Spark facilitates the implementation of iterative algorithms that analyze a set of data multiple times in a loop. Spark is widely used in machine learning projects.

As you might already know, famous libraries like TensorFlow or Pytorch are generally used to build Neural Networks. One of the benefits of using these libraries is GPU computation that speeds up the training by allowing parallel computing.

The latest versions of Spark also allow the use of GPUs, but in this article, we will only focus on CPU computation (like most scratch implementations of neural networks) to keep it simple. This article proposes an implementation for learning purposes that does not fit industrial needs.

MNIST dataset sample, Image by author

For this tutorial, we will try to solve the well-known MNIST…

Continue reading: https://towardsdatascience.com/pyspark-neural-network-from-scratch-8a19ebad3904?source=rss—-7f60cf5620c9—4

Source: towardsdatascience.com