How-to Setup Your Data Analytic Environment for Data Science — Part 1.3: Add Metadata Document to Your Dataset Using Apache Parquet
This tutorial elaborates on previous “Part 1.2: Label Variables in Your Dataset Using Apache Parquet” tutorial to add additional metadata to describe the variables as well as dataset.
This is part 1.3: Create Dataset Using Apache Parquet of How-to Setup Your Data Analytic Environment for Data Science series. It consists of:
As noted, before there are 1,000s of datasets in your organization’s repository where the most of these datasets have no corresponding data dictionary or out-of-date data dictionary if you were even able to find them.
You add extensive comments to your code. You should have more comments for your dataset since it is more important. It is much easier to detect and correct erroneous code than detecting erroneous data in your dataset. Correcting data in that dataset is more arduous by multiple factors. For every variable in your dataset, you need to understand:
- Data Lineage, preferably all the way back to the source system.
- Calculation if it is calculated field. For example, some numeric variables are capped and floored to treat outliers. This should be documented here.
Following are python packages needed for this tutorial:
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pqimport platformprint('Python: ', platform.python_version())
print('pandas: ', pd.__version__)
print('pyarrow: ', pa.__version__)Python: 3.7.11
We will be using freely available data from U.S. Centers for Disease Control and Prevention, called Social Vulnerability Index 2018 — United States, tract (https://catalog.data.gov/dataset/social-vulnerability-index-2018-united-states-tract). This is a good dataset since it is decently sized with over 100 variables/attributes. This dataset is downloaded in CSV file format and about 210 MB in size.
Since I am using Google Colab to create this tutorial, I will be using Google Colab feature to access CSV file on my Google Drive, which I have uploaded to my Google Drive.
from google.colab import drive
drive.mount('/content/gdrive')social_index = pd.read_csv('/content/gdrive/My Drive/Colab Notebooks/Analytic Environment/data/Social_Vulnerability_Index_2018_-_United_States__tract.csv', index_col = 0)
Exploratory Data Analysis
There are 123 variables in this sample dataset, but for the purpose of this tutorial, we will trim…
Continue reading: https://towardsdatascience.com/add-metadata-to-your-dataset-using-apache-parquet-75360d2073bd?source=rss—-7f60cf5620c9—4