(testing signal)

Tag: PySpark

A PySpark Example for Dealing with Larger than Memory Datasets

A step-by-step tutorial on how to use Spark to perform exploratory data analysis on larger than memory datasets.Analyzing datasets that are larger than the available RAM memory using Jupyter notebooks and Pandas Data Frames is a challenging issue. This problem has already been addressed (for instance here or here) but my objective here is a little different. I will be presenting a method for performing exploratory analysis on a large data set with the purpose of identifying and filtering out…

Complete Guide to Spark and PySpark Setup for Data Science

Complete A-Z on how to set-up Spark for Data Science including using Spark with Scala and with Python via PySpark as well as integration with Jupyter notebooks

Photo by Rakicevic Nenad from Pexels


Apache Spark is a unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. It is fast becoming the de-facto tool for data scientists to investigate big data.

In similar fashion to most data scientists Python has always been my go-to programming language for anything from data collection via web scraping tools such as Scrapy and Selenium to data wrangling with pandas and machine leaning/deep learning with all the fantastic libraries available in Python such as Pytorch and Tensorflow.


2 Mistakes I Did at My First Job as a Data Scientist

Duplicate rows

This is a more specific mistake than the previous one and has a simpler solution: Just check for duplicates.

When I was learning data science, I was usually practicing with one dataset. Thus, I did not have to use join, merge, or concatenation operations a lot. However, in real life tasks, the data is spread out among many tables or data frames and you need to do many operations to combine the required data together.

Our tech stack is quite rich. I often use Python, SQL, R, and PySpark in my daily tasks. The functions and methods for combining data with these tools are similar but each has its own syntax.

I think the biggest risk when combining data is to create duplicate data points (or rows). If you apply the function properly, the resulting data frame or table will not have any duplicates.


PySpark Neural Network from Scratch

A simple tutorial to learn how to implement a Shallow Neural Network (3 fully connected layers) using PySpark.

Photo by Jamie Street on Unsplash

This article is not intended to provide mathematical explanations of neural networks, but only to explain how to apply the mathematical equations to run it using Spark (MapReduce) logic in Python. For simplicity, this implementation only uses RDDs (and no DataFrames).

Similarly, I assume that you have Spark installed in your machine and that you can either run a spark-submit or a PySpark Jupyter-Notebook.

All the code provided in this tutorial is available on this GitHub Repository.

Just in case, here are some resources to set up your machine to be able to run the code:

Also, throughout this article, I will base my explanation on one of my previous medium articles that explains the math behind a 3-layer neural network.


From pandas to PySpark

Now, let’s look at the syntax comparisons between the two libraries. Throughout this section, only PySpark outputs will be shown to keep the post less cluttered.

📍 1.1. Basics

Both libraries’ data objects are called DataFrame: pandas DataFrame vs PySpark DataFrame. Let’s import the data and check its shape:

# 🐼 pandas 
df = pd.read_csv('penguins.csv')
# 🎇 PySpark
df = spark.read.csv('penguins.csv', header=True, inferSchema=True)
df.count(), len(df.columns)

When importing data with PySpark, the first row is used as a header because we specified header=True and data types are inferred to a more suitable type because we set inferSchema=True. If you are curious, try importing without these options and inspect the DataFrame and its data type (similar to pandas, you can check data types using df.dtypes


Artificial Neural Network Using PySpark

An implementation of a neural network using PySpark for a binary class prediction use-case


Development & Testing of ETL Pipelines for AWS Locally

By Subhash Sreenivasachar, Software Engineer Technical Lead at Epsilon


AWS plays a pivotal role in helping engineers, data-scientists focus on building solutions and problem solving without worrying about the need to setup infrastructure. With Serverless & pay-as-you-go approach for pricing, AWS provides ease of creating services on the fly.

AWS Glue is widely used by Data Engineers to build serverless ETL pipelines. PySpark being one of the common tech-stack used for development. However, despite the availability of services, there are certain challenges that need to be addressed.

Debugging code in AWS environment whether for ETL script (PySpark) or any other service is a challenge.

  • Ongoing monitoring of AWS service usage is key to keep the cost factor under control
  • AWS does offer Dev Endpoint with all the spark libraries installed, but considering the price, it’s not viable for use for large development teams
  • Accessibility of AWS services may be limited for certain users


Solutions for AWS can be developed, tested in local environment without worrying about accessibility or cost factor.