I did not plan for it, it just rhymes. In this entry we will be looking at:
- Data storage and versioning with Delta tables.
- A structured, standard way of performing and registering experiments and saving models after each training iteration with MLflow Tracker.
- Registering models and managing lifecycle stages with MLflow Model Registry.
- Deploying models to production with MLflow Model Serving.
Just know that this article will not cover all the features that these tools have to offer. I am just giving an introduction so you can explore a little bit more in the documentation of each tool that are covered here. I will leave a link at the end of each section and in each item in the listing right above.
This step is optional. We are going to create a Databricks community edition workspace since it includes, by default, all the tools that will be covered.
First, head to this link. You will need to fill in the necessary information to create a Databricks account and use the community edition workspace. After that, you will log in into Databricks, which will lead you to this screen:
Next, you will create your compute instance. For that, head to the left sidebar and select Compute. Click on Create Cluster, then select the specifications as seen in the following picture.
Click on Create Cluster again, and wait for the cluster to start. That’s it, we are all set up for the interesting part.
As a side note, you can also install them on your own environments, but requires more effort. Read about installing
I can’t speak in general, since I am still a junior, but from what I have seen is that people is very comfortable working with CSV files, myself included. They are portable, easy to use, and easy to create. However, as data scientists working in the cloud, being charged by space and compute, this is not the file format you would want to use.
CSVs are computationally expensive to use as they are row-oriented, which means that you need to load the whole file even if you don’t want to use all the columns. They are not highly compressed formats, like parquet, and therefore, expensive to store. They can be easily corrupted, and many other reasons.
That being said, it is better to implement a data lake, with Delta in our case. Delta tables are built on top of parquet file format. They are fully specified, highly compressed and optimized for big volumes of data. In…
Continue reading: https://towardsdatascience.com/start-managing-your-models-lifecycles-better-307729fe6fe5?source=rss—-7f60cf5620c9—4