Some Machine Learning techniques for data quality

Photo by on

“Garbage in, garbage out”, in the data world we have often heard this phrase which means if your data is “bad”, you can never make “good” decisions(bet you didn’t see this one coming:P).

The journey from “bad” to “good” is what Data Quality is. Now the bad data can mean a lot of things such as:

  • Data is not up to date, Timeliness
  • Data is not accurate, Accuracy
  • Data has different values for different users or there is no single source of truth, Consistency
  • Data is not accessible. Usability
  • Data is not available, Availability

paper nicely defines various dimensions of data, please read on to find more about it.

Data quality is important and pivotal for all domains of jobs but as a data engineer, it becomes a primary responsibility for us while delivering data we are delivering “good” data.

My experience:

For ensuring data quality I have also implemented rule-based solutions to take care of:

  • Bad schema
  • Duplicate data
  • Late data
  • Anomalous data

Which revolved around mainly having a clear understanding of what kind of data I am going to be feeding the system and of course in turn generalizing the same for the whole data pipeline framework.

Though the automated system helps from the move from a reactive approach to a pro-active approach, the problem with a rule-based system is

  • It can have too many rules for high cardinality, multi-dimensions data.
  • For every new error, every new anomaly, the Data Quality Framework needs some custom implementation, i.e. human intervention was inevitable in such a solution

To overcome, human intervention in the rule-based scenario, we need to look for a fully automated system. With many recent developments, ML is one of the domains which might help in achieving that.

Let’s see how the machines help us here in ensuring automated data quality or looking beyond the obvious?

Before discussing how let’s discuss why?

Why machine learning for Data Quality?

  • ML models can learn from tremendous amounts of data and can find hidden patterns in it.
  • Can take care of repetitive tasks
  • No need to maintain rules
  • Can evolve as the data evolves

But I would also like to point out though the above list looks like an election banner for ML as a candidate, using it depends on a use case to use case and also, ML generally doesn’t work well with small datasets or datasets which doesn’t exhibit any pattern.

Having said that, let’s look at some of the ML applications…

Continue reading:—-7f60cf5620c9—4