By Matthew Przybyla, Senior Data Scientist at Favor Delivery

Photo by Bruce Hong on Unsplash [1].

Table of Contents

 

  1. Introduction
  2. Multiple Conditions
  3. Merging On Multiple, Specific Columns
  4. Summary
  5. References

Introduction

 
 
Whether you are transitioning from a data engineer/data analyst or wanting to become a more efficient data scientist, querying your dataframe can prove to be quite a useful method of returning specific rows that you want. It is important to note that there is a specific query function for pandas, appropriately named, query. However, I will instead be discussing the other ways that you can mimic querying, filtering, and merging your data. We will present common scenarios or questions that you would ask to your data, and rather than SQL, we will do it with Python. In the paragraphs below, I will outline some simple ways of querying rows for your pandas dataframe with the Python programming language.

Multiple Conditions

 



Sample data. Screenshot from Author [2].

As data scientists or data analysts, we want to return specific rows of data. One of these scenarios is where you want to apply multiple conditions, all in the same line of code. In order to display my example, I have created some fake sample data of a first and last name, as well as their respective gender and birthdate. This data is displayed above in the screenshot.

The example multiple conditions will essentially answer a specific question, just like when you use SQL. The question is, what percent of our data is Male gender OR a person who was born between 2010 and 2021.

Here is the code that will solve that question (there are a few ways to answer this question, but here is my specific way of doing it):

print(“Percent of data who are Males OR were born between 2010 and 2021:”,
 100*round(df[(df[‘Gender’] == ‘M’) | (df[‘Birthdate’] >= ‘2010–01–01’) & 
 (df[‘Birthdate’] <= ‘2021–01–01’)][‘Gender’].count()/df.shape
 [0],4), “%”)

To better visualize this code, I have also included this screenshot of that same code from above, along with the output/result. You can also apply these conditions to return the actual rows instead of getting the fraction or percent of rows out of the total rows.



Conditions code. Screenshot by Author [3].

Here is the order of commands we performed:

  • Return rows with Male Gender
  • Include the OR function |
  • Return the rows of Birthdate > 2010 and 2021
  • Combine…

Continue reading: https://www.kdnuggets.com/2021/08/query-pandas-dataframe.html

Source: www.kdnuggets.com