Or how to find distinct events in streaming pipeline

Deduplication of streaming data in Kafka ksqlDB or How to find distinct events in streaming…

As stated by the creators, “ksqlDB is a database that is purpose-built for stream processing applications”. For sure, when it first came into play it changed the world of streaming data processing. Being built on Kafka Streams, it allows you to create your streaming applications with plain old SQL language, from simple data analytics up to complete ETL flow.

This is it for the introduction. If you’re here, then you’re probably already familiar with ksqlDB. And here we are going to discuss some advanced techniques – deduplication, or distinction if you prefer. Let’s imagine the source system you’re consuming data from has duplicates and your goal is to get rid of them. In ksqlDB you operate continually over unbounded streams of events, so the process of removing repeated data is not that straightforward as it could be done with DISTINCT keyword in relational database. In this article I will describe one of the concepts to solve this issue.

Algorithm and implementation

Take a minute to look at the diagram below. It schematically describes the process of how the data flows from source to target stream while being transformed to remove duplicate messages. This flow consists of several logical steps which we will discuss later in this article.

So let’s dive into the concept. We start from the stream TEST_RAW which has duplicate messages. We create it with a simple structure as below.

Step 1

On the first step create a ksqlDB table TEST_WINDOW_DISTINCT_TABLE with grouping by the Primary Key within a tumbling window with 3 minutes duration as an example. This query will produce only distinct events within a specific time frame, which means that only the rows with unique ID will be shown in that window.

You may ask, “Why do we need this window? Why not to make a global group by which will, basically, provide global distinct values that we try to achieve?”.

Well, this is correct, but, it’s not the most optimal solution. If we do it this way, the table will contain all the rows for the whole time being. When it comes to a real use-case with multiple tables having multiple columns and millions of rows, the performance will degrade, not to mention the consumed disk space. Instead, we will keep only the chunk of data which came for the small time period.

Next, make a stream TEST_WINDOW_DISTINCT based on the same topic to be able to process your data…

Continue reading: https://towardsdatascience.com/deduplication-of-streaming-data-in-kafka-ksqldb-90e5cadd83e9?source=rss—-7f60cf5620c9—4

Source: towardsdatascience.com