Have you ever been given a new dataset to analyze and you didn’t even know where to start? It can be overwhelming to have millions of rows of data that you are unfamiliar with dumped on your lap.
Whether you are in finance, are a data analyst, or are making machine learning models, you need to have a deep understanding of the data you work with. Otherwise — and I’m speaking from experience here — it is very easy to, as I like to say, get lost in the sauce. If you combine and manipulate datasets before truly understanding them, you can create a mess that is more confusing than what you started with.
So, before you dive right in and try to calculate measures or get your new data into a visualization tool, here are 3 things that you should consider:
1. Field Data Type & Meaning
For a well-documented dataset, you can find both of these pieces of information in a data dictionary. However, you might not always get a data dictionary when you are pulling data, especially for smaller amounts of data (like an Excel file). So it’s important to take a look at each field in your table(s) and figure out:
- What is the current data type?
- Should you change the data type?
- What is the meaning of this field?
- Is the field named appropriately or should it be renamed?
Changing the data types when necessary at the beginning of your analysis can help you move faster when you get to generating analytics and insights. Additionally, sometimes datasets will come with technical column names (such as data downloaded from some ERP systems), so you will need to rename those fields using a mapping table unless you work with the data so often that you have memorized the 5 letter titles.
Most importantly, you have to understand what each field means in your dataset. You may be able to do this by asking a subject matter expert at your company, by seeing other analyses online if it is public data, or potentially by doing some exploratory data analysis.
2. Primary Keys
This might be obvious to some readers, but if you do not work with data all the time, it is worth stating: you need to know the primary key of your data table(s).
A primary key could be either one field or a combination of fields that makes a record unique. The most simple case would be a table with a record ID column. Since that column contains a different number for every record, it is a unique identifier.
Many datasets have a primary key that is a combination…
Continue reading: https://towardsdatascience.com/3-things-you-should-know-about-your-data-9980cb0eb46e?source=rss—-7f60cf5620c9—4