Stop sharing personally identifiable information in your DataFrames

Photo by Markus Spiske on Unsplash

A common scenario encountered by Data Scientists is sharing data with others. But what should you do if that data contains personally identifiable information (PII) such as email addresses, customer IDs or phone numbers?

A simple solution is to remove these fields before sharing the data. However, your analysis may rely on having the PII data. For example, customer IDs in an e-commerce transactional dataset are necessary to know which customer bought which product.

Instead, you can anonymise the PII fields in your data using hashing.

Hashing is a one-way process of transforming a string of plaintext characters into a unique string of fixed length. The hashing process has two important characteristics:

  1. It is very difficult to convert a hashed string into its original form
  2. The same plaintext string will produce the same hashed output

For these reasons, developers will store your hashed password in the website’s database.

haslib is a built-in module in Python that contains many popular hash algorithms. In our tutorial, we’re going to be using SHA-256 which is part of the SHA-2 (Secure Hash Algorithm 2) family of algorithms.

Before we can convert our string, in this example an email address, to a hashed value, we must first convert it into bytes using UTF-8 encoding:

import hashlib# Encode our string using UTF-8 default 
stringToHash = ''.encode()

We can now hash it using SHA-256:

# Hash using SHA-256 and print
print('Email (SHA-256): ', hashlib.sha256(stringToHash).hexdigest())


Email (SHA-256): 36e96648c5410d00a7da7206c01237139f950bed21d8c729aae019dbe07964e7

That’s it! Our fake email address has been successfully hashed.

Now that we can apply hashlib to a single string, it’s fairly straightforward to scale this example to a pandas DataFrame. We’re going to use credit card customer data, available on Kaggle, which was originally made available by Analyttica TreasureHunt LEAPS.

Scenario: you need to share a list of credit card customers. You want to retain the field ‘CLIENTNUM’ as a customer can have multiple credit cards and you want to be able to uniquely identify them.

import pandas as pd# Read only select columns using pandas
df = pd.read_csv('data/BankChurners.csv', usecols=['CLIENTNUM', 'Customer_Age', 'Gender', 'Attrition_Flag', 'Total_Trans_Amt'])
Image by author

After converting our ‘CLIENTNUM’ column to a string data…

Continue reading:—-7f60cf5620c9—4