Before starting on any projects you would want to mount ubuntu to a drive so that it can access your files. When you first open your ubuntu, type in ls /mnt/ to see which drive you can mount on, assuming that you wish to mount on C drive, the command is cd /mnt/C/<folder>. After that, you can navigate to the drive that your data is stored in. For more information, click on this YouTube video.
Now that your Ubuntu is mounted onto some drive or folder, you can proceed with installing your virtual environment. If you are not using python3, you can pip install it.
mkdir test-venv ## this creates a drive called test-venv
virutalenv test-venv ## install new virtual environment inside the directory
source ./test-venv/bin/activate ## activate your new environment.
If you are seeing something like this, feel free to leave this open and we will get back to it in step 7.
Think about how you would like your data to flow through your projects and ultimately form insights. A simple, typical machine learning pipeline will look like the image below.
In this example, I will stick to the standard process of pipelining a machine learning model.
Most data base are relational and are stored in .db file format, therefore you can extract it using SQL. I have included a link to show how we can extract different types of data sets. The function outputs an extracted dataframe which will be inputted into data preparation python file.
This python file contains various ways you can feature engineer, encode and train test split.
The function defined in this file outputs 4 dataframe x_train, x_test, y_train, y_test. These 4 dataframes will be inputted into the model training.
In this example, I used target encoders for the ordinal features and converted categorical features to numerical features through get_dummies. However, I suggest that you try different encoders and see which helps best.
This is where the the modelling happens. I have added a hyper parameter tuning in this section. However, you can choose to split the hyper parameter tuning into a different sections.
The function outputs the Mean absolute error of XGBoost, vanilla Linear regression and LGBM.
The results from the modelling are output into the main run.py script. The run.py script then concatenates the results and output them onto a CSV file for easy comparison. Thereafter, its up to you to decide which model to use.
While LGBM produces the best…
Continue reading: https://towardsdatascience.com/how-to-machine-leaning-pipeline-beginner-2a736595cbfd?source=rss—-7f60cf5620c9—4