(testing signal)

Tag: Dataframes

3 Ways to Render Pandas DataFrames

A practical guide to instant documentationPhoto by Thought Catalog on UnsplashSystems read files but Humans read documents. The more stories we document, the more collaborative analysis would be.Dataframes are the key structures of analysis. They hold the critical data supporting key decision makings, at several stages, in the process of solving a problem. Often, decision-making involves multiple stakeholders who need to look at the data held in these data frames.As a good data scientist and…

Enchanced Tabular Data Visualization (Pandas)

Simple but efficient techniques to improve pandas dataframe representation

From Pixabay

In this article, we’ll discuss some useful options and functions to efficiently visualize dataframes as a set of tabular data in pandas. Let’s start with creating a dataframe for our further experiments:

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(20, 40))# Renaming columns
df.columns = [x for x in 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMN']
# Adding some missing values
df.iloc[3,4] = np.nan
df.iloc[2,0] = np.nan
df.iloc[4,5] = np.nan
df.iloc[0,6] = np.nan
df.head()
Image by Author

Attention: the code from this article was run in pandas version 1.3.2. Some of the functions are quite new and will throw an error in the older versions.

Read more...

If You Can Write Functions, You Can Use Dask

By Hugo Shi, Founder of Saturn Cloud

I’ve been chatting with many data scientists who’ve heard of Dask, the Python framework for distributed computing, but don’t know where to start. They know that Dask can probably speed up many of their workflows by having them run in parallel across a cluster of machines, but the task of learning a whole new methodology can seem daunting. I’m here to tell you that you can start getting value from Dask without having to learn the entire framework. If you spend time waiting for notebook cells to execute, there’s a good chance Dask can save you time. Even if you only know how to write Python functions, you can take advantage of this without learning anything else! This blog post is a “how to use Dask without learning the whole thing” tutorial.

Read more...

Shuffling Rows in Pandas DataFrames

First, let’s create an example pandas DataFrame that we’ll reference throughout this article in order to demonstrate how to shuffle the rows in many different ways.

import pandas as pd
df = pd.DataFrame({
'colA': [10, 20, 30, 40, 50, 60],
'colB': ['a', 'b', 'c', 'd', 'e', 'f'],
'colC': [True, False, False, True, False, True],
'colD': [0.5, 1.2, 2.4, 3.3, 5.5, 8.9],
})
print(df)
colA colB colC colD
0 10 a True 0.5
1 20 b False 1.2
2 30 c False 2.4
3 40 d True 3.3
4 50 e False 5.5
5 60 f True 8.9

Continue reading: https://towardsdatascience.com/shuffling-rows-in-pandas-dataframes-eda052275635?source=rss—-7f60cf5620c9—4

Source: towardsdatascience.com

Imputing Missing Values using the SimpleImputer Class in sklearn

Learn how to use the SimpleImputer class to replace NaNs in your Pandas DataFrames

Imputing Missing Values using the SimpleImputer Class in sklearn

One of the tasks that you need to perform prior to training your machine learning model is data preprocessing. Data cleansing is one key part of the data preprocessing task, and usually involves removing rows with empty values, or replacing them with some imputed values.

The word “impute” means a value assigned to something by inference from the value of the products or processes to which it contributes. In statistics, imputation is the process of replacing missing data with substituted values.

In this article, I will show you how to use the SimpleImputer class in sklearn to quickly and easily replace missing values in your Pandas dataframes.

Read more...

How To Delete Rows From Pandas DataFrames Based on Column Values

First, let’s create an example DataFrame that we’ll reference across this article in order to demonstrate a few concepts that will help us understand how to delete rows from pandas DataFrames.

import pandas as pddf = pd.DataFrame({
'colA': [1, 2, 3, 4, None],
'colB': [True, True, False, False, True],
'colC': ['a', None, 'c', None, 'e'],
'colD': [0.1, None, None, None, 0.5],
})
print(df)
colA colB colC colD
0 1.0 True a 0.1
1 2.0 True None NaN
2 3.0 False c NaN
3 4.0 False None NaN
4 NaN True e 0.5

Continue reading: https://towardsdatascience.com/delete-row-from-pandas-dataframes-based-on-column-value-4b18bb1eb602?source=rss—-7f60cf5620c9—4

Source: towardsdatascience.com

Is Hands-On Knowledge More Important than Theory?

Photo by Katherine Volkovski on Unsplash
Read more...

How to Rename Columns in Pandas — A Quick Guide

A short guide on multiple options for renaming columns in a pandas dataframe

Photo by Giulio Gabrieli on Unsplash

Ensuring that dataframe columns are appropriately named is essential to understand what data is contained within, especially when we pass our data on to others. In this short article, we will cover a number of ways to rename columns within a pandas dataframe.

But first, what is Pandas? Pandas is a powerful, fast, and commonly used python library for carrying out data analytics. The Pandas name itself stands for “Python Data Analysis Library”. According to Wikipedia, the name originates from the term “panel data”. It allows data to be loaded in from a number file formats (CSV, XLS, XLSX, Pickle, etc.) and stored within table-like structures.

Read more...

PySpark Neural Network from Scratch

A simple tutorial to learn how to implement a Shallow Neural Network (3 fully connected layers) using PySpark.

Photo by Jamie Street on Unsplash

This article is not intended to provide mathematical explanations of neural networks, but only to explain how to apply the mathematical equations to run it using Spark (MapReduce) logic in Python. For simplicity, this implementation only uses RDDs (and no DataFrames).

Similarly, I assume that you have Spark installed in your machine and that you can either run a spark-submit or a PySpark Jupyter-Notebook.

All the code provided in this tutorial is available on this GitHub Repository.

Just in case, here are some resources to set up your machine to be able to run the code:

Also, throughout this article, I will base my explanation on one of my previous medium articles that explains the math behind a 3-layer neural network.

Read more...

From pandas to PySpark

Now, let’s look at the syntax comparisons between the two libraries. Throughout this section, only PySpark outputs will be shown to keep the post less cluttered.

📍 1.1. Basics

Both libraries’ data objects are called DataFrame: pandas DataFrame vs PySpark DataFrame. Let’s import the data and check its shape:

# 🐼 pandas 
df = pd.read_csv('penguins.csv')
df.shape
# 🎇 PySpark
df = spark.read.csv('penguins.csv', header=True, inferSchema=True)
df.count(), len(df.columns)

When importing data with PySpark, the first row is used as a header because we specified header=True and data types are inferred to a more suitable type because we set inferSchema=True. If you are curious, try importing without these options and inspect the DataFrame and its data type (similar to pandas, you can check data types using df.dtypes

Read more...

Working with Multi-Index Pandas DataFrames

Learn how to work with multi-index dataframes with ease

Most learners of Pandas dataframe are familiar with how a dataframe looks like, as well as how to extract rows and columns using the loc[] and iloc[] indexer methods. However, things can get really hairy when multi-index dataframes are involved. A multi-index (also known as hierarchical index) dataframe uses more than one column as the index of the dataframe. A multi-index dataframe allows you to store your data in multi-dimension format, and opens up a lot of exciting to represent your data.

In this article, I am going to walk you through how to manipulate a multi-index dataframe, and some of the pitfalls you may encounter. That said, strap on your seat-belt — it is going to be a roller coaster ride!

Read more...

Joining Pandas DataFrames

Learn how to merge Pandas Dataframes easily

Very often, your data comes from different sources. In order to help with your analytics, you often need to combine data from different sources so that you can obtain the data you need. In this article, I will talk about how you can merge (join) Pandas dataframes. Most articles on this topic use simplistic dataframes to illustrate concepts on dataframe joining — inner, outer, left, and right join. For me, a much better way to understand this topic is to use a more realistic example so that you can understand and be able to better retain the concepts.

Let’s get started!

The first thing to do is to create the two data frames. The first creates a list of flight numbers and the airports they are departing from:

import pandas as pddf_flights = pd.DataFrame(
Read more...

15 Python Snippets to Optimize your Data Science Pipeline

By Lucas Soares, Machine Learning Engineer at K1 Digital

Photo by Carlos Muza on Unsplash

Why Snippets Matter for Data Science

 
 
In my daily routine I have to deal with a lot of the same situations from loading csv files to visualizing data. So, to help streamline my process I created the habit of storing snippets of code that are helpful in different situations from loading csv files to visualizing data.

In this post I will share 15 snippets of code to help with different aspects of your data analysis pipeline

1. Loading multiple files with glob and list comprehension

 

import glob
import pandas as pd
csv_files = glob.glob("path/to/folder/with/csvs/*.csv")
dfs = [pd.read_csv(filename) for filename in csv_files]

2.

Read more...