A Dockerized tutorial with everything you need to turn a .csv file of timestamped data into a Kafka stream
With more and more data science work moving towards real-time pipelines, data scientists are in need of learning to write streaming analytics. While some great, user-friendly, streaming data pipeline tools exist (my obvious favorite being Apache Kafka.) It’s hard to develop the code for a streaming analytic without having a friendly dev environment that actually produces a data stream you can test your analytics on.
This post will walk through deploying a simple Python-based Kafka producer that reads from a .csv file of timestamped data, turns the data into a real-time (or, really, “back-in-time”) Kafka stream, and allows you to write your own consumer for applying functions/transformations/machine learning models/whatever you want to the data stream.
All materials are available in my GitHub time-series-kafka-demo repo. To follow along, clone the repo to your local environment. You can run the example with only Docker and Docker Compose on your system.
The repo has a few different components:
Clone the repo and cd into directory.
git clone https://github.com/mtpatter/time-series-kafka-demo.git
Start the Kafka broker and Zookeeper
The Compose file pulls Docker images for Kafka and Zookeeper version 6.2.0 from Confluent’s Docker Hub repository. (Gotta pin your versions!)
docker compose up
This starts both Kafka and Zookeeper on the same Docker network for talking to each other. The Kafka broker will be accessible on port 9092 locally, since the Compose file binds the local port to the internal image port.
Build a Docker image (optionally, for the producer and consumer)
If you’re not wanting to install the Python modules in the requirements.txt file, you can use a Docker image for the producer and consumer scripts.
From the main root directory:
docker build -t "kafkacsv" .
This command should now work:
docker run -it --rm kafkacsv python bin/sendStream.py -h
Start a consumer
We’ll start a consumer first for printing all messages in mock “real time” from the stream “my-stream”. The reason why we’re starting the consumer before the producer is that the producer will reproduce all the “pauses” in time between each of the timestamped data points. If you start the consumer after the producer, the consumer will process all the messages that are already in the queue immediately….
Continue reading: https://towardsdatascience.com/make-a-mock-real-time-stream-of-data-with-python-and-kafka-7e5e23123582?source=rss—-7f60cf5620c9—4