Prefect is both a minimal and complete workflow management tool. It’s unbelievably simple to set up. Yet it can do everything tools such as Airflow can and more.
You can use PyPI, Conda, or Pipenv to install it, and it’s ready to rock. More on this in comparison with the Airflow section.
pip install prefect
# conda install -c conda-forge prefect
# pipenv install --pre prefect
Before we dive into use Prefect, let’s first see an unmanaged workflow. It makes understanding the role of Prefect in workflow management easy.
The below script queries an API (Extract — E), picks the relevant fields from it (Transform — T), and appends them to a file (Load — L). It contains three functions that perform each of the tasks mentioned. It’s a straightforward yet everyday use case of workflow management tools — ETL.
This script downloads weather data from the OpenWeatherMap API and stores the windspeed value in a file. ETL applications in real life could be complex. But this example application covers the fundamental aspects very well.
Note: Please replace the API key with a real one. You can get one from https://openweathermap.org/api.
You can run this script with the command
python app.pywhere app.py is the name of your script file. This will create a new file called windspeed.txt in the current directory with one value. It’s the windspeed at Boston, MA, at the time you reach the API. If you rerun the script, it’ll append another value to the same file.
Your first Prefect ETL workflow.
The above script works well. Yet, it lacks some critical features of a complete ETL, such as retrying and scheduling. Also, as mentioned earlier, a real-life ETL may have hundreds of tasks in a single workflow. Some of them can be run in parallel, whereas some depend on one or more other tasks.
Imagine if there is a temporary network issue that prevents you from calling the API. The script would fail immediately with no further attempt. In live applications, such downtimes aren’t a miracle. They happen for several reasons — server downtime, network downtime, server query limit exceeds.
Also, you have to manually execute the above script every time to update your windspeed.txt file. Yet, scheduling the workflow to run at a specific time in a predefined interval is common in ETL workflows.
This is where tools such as Prefect and Airflow come to the rescue. Here’s how you could tweak the above code to make it a Prefect workflow.
Continue reading: https://towardsdatascience.com/the-prefect-way-to-automate-orchestrate-data-pipelines-d4465638bac2?source=rss—-7f60cf5620c9—4