First of all, I’m talking about a solid and scalable Airflow infrastructure.
If you need a little airflow with to up 50 DAGs and just to up 2 developers building DAGs. you don’t need to care about what I will say, just deploy your airflow with docker-compose, ec2, or locally using celery and you will be happy.
These are my gold rules when to deploy airflow on Production:
- Do the airflow deployment using helm
- Create an external database. Don’t use the PostgreSQL container.
- Deploy in a managed cloud Kubernetes.
- Use Kubernetes Executor
In this article I will focus in this last one.
What problems I am concerned about to have solutions:
1 — Control over resources (memory, CPU) on the Kubernetes cluster.
2 — Many developers working on different projects and have different acknowledgment levels. So they use different operators and ways to build data pipelines.
3 — Idempotence and consistency DAGs
Let’s go to solve the problems:
What do we generally pay in the cloud? RESOURCES (MEMORY, CPU)
So, when someone creates a DAG with many tasks using, for example, PythonOperator for building a data frame. Do you know how much memory and CPU this DAG is using? If you are working with Celery its complex, because we run many tasks in the same worker. Each work is a pod in Kubernetes, so you need previously to set the worker resources, and it is a problem, because we have multiples tasks with different resources necessities.
Then, It is better to work with the Kubernetes executor and REQUIRE when someone writes a DAG they must set the resources. This way, each task is a pod on Kubernetes with different requested resources.
Look this simple sample (you can find more on airflow default samples):
Cool, so you ask me: how can I obligate someone to set the resources? Using your convincement power, of course. No, this can work but you can use de a unit test on your CI pipeline and do this validation. This is easier than the first option =).
Now, someone will say to me: In my company we use Airflow just for orchestration. We use Celery and we only use KubernetesPodOperator and we don’t do transformations in DAGs. Then, when we need to use other operators, first we measure de resource consume and we know that is safe to use because the tasks use low resources and we can control tasks number on the same worker with parallelism limitation.
PLUS: Airflow Kubernetes executor is more efficiently scalable than celery even…
Continue reading: https://towardsdatascience.com/why-we-must-choose-kubernetes-executor-for-airflow-28176062a91b?source=rss—-7f60cf5620c9—4