There is a tendency, even among people who should know better, to view the data that one has access to in an organization as being of perfect quality and utility. In reality, the data that any organization collects over time can range from being highly useful to a waste of computer cycles and processing effort, and an effective part of any data strategy is understanding what is a treasure and what is, to put it simply, an eyesore..

1. Entropy

Entropy is a measure of uncertainty associated with random variables.

Example: The meteorology department wants to tell whether it’s going to rain or not today. And they have the weather data collected from various devices. The data has attributes of wind, pressure, humidity and precipitation.

If you pick one value from the series of Humidity values, how certainly can it tell when it is going to rain or not? Is the entropy associated with Humidity random variable.

Photo by Nicolas Prieto on Unsplash

If entropy is too high, it indicates the Humidity variable has not potential to tell that it’s going to rain or not. If entropy is less, then Humidity is a good variable to be considered in further analysis.

2. Outliers

Outlier is a measure of unusualness associated with a random variable.

Though Humidity has a good potential to solve the problem, not all of it’s values can be useful to the calculation. Create a boxplot and determine the number of outliers.

If more percentage of values are lying outside the box, then the final outcome would be less accurate. In such a case, we need to discard the Humidity variable. Take one more variable and start with Entropy test.

3. Covariance

Covariance is a measure of relationship between two variables. How variable X changes when variable Y changes. X and Y may have different units of measurements.

Example, if Humidity decrease as Wind increases, then there is a relationship between Humidity and Wind. This relationship adds more value in solving the problem.

How many variables are there that have covariance with at least one other variable is the count we need to measure. Higher this count, more evidence we can derive towards the final outcome.

Good Dataset:

More number of variables that have strong covariance with few/more other variables.

Bad Dataset:

  • Less number of variables that have strong covariance with few other variables.
  • More number of variables that have weak covariance with many other variables.

A possible outcome of this assessment could like…

Continue reading: http://www.datasciencecentral.com/xn/detail/6448529:BlogPost:1066337

Source: www.datasciencecentral.com