I can easily launch JupyterLab and use the cli tool glances to monitor real-time resources usage, which lets me know if there are any resource bottlenecks such as CPU utilization during preprocessing or GPU utilization during training leading to increase training batch size.

Setting up Kaggle CLI & Downloading the dataset

To get the Plant Pathology 2021 — FGVC8 dataset on our session, we use Kaggle CLI provided directly by Kaggle and indexed on PyPI.

pip install kaggle

For any with Kaggle CLI, you need to set up your credentials — logging to Kaggle web and create your personal key, which will be automatically downloaded as kaggle.json.

Step-by-step bow to download credentials file from personal Kaggle account: (1) select account, (2) scroll down and “Create New API Token”.

When we have our kaggle.json key we can upload it using JupyterLab it to our session (simple drag-and-drop) and move it to the correct destination for the Kaggle CLI as follows:

Step-by-step how-to setup Kaggle CLI: (1) install kaggle package, (2) drag-and-drop file with credentials and (3) move it to the expected system folder.
mv kaggle.json /home/jovyan/.kaggle/kaggle.json

Now, we are set to download the dataset to our session. The particular competition name to download is the same as the URL name. Also, the exact command for download is provided in the Data section in each Competition. Most datasets are distributed as compressed achieve, which we need to unzip to a destinations folder (I usually name the folder the same as the competition). So for our competition, we call the following commands:

# doanload dataset via Kaggle CLI
kaggle competitions download plant-pathology-2021-fgvc8
# unzip the dataset to a folder
unzip plant-pathology-2021-fgvc8.zip -d plant-pathology

Dataset pre-processing

Once extracted, we can pre-process the dataset. The original dataset is distributed in 4K resolution, which is way too large for most applications, so we downscale images to about 640px. This observation with reasoning is described in how to place on the leaderboards. We use a simple CLI tool ImageMagick.

apt install imagemagick
# apply to all JPEG images in given folder
mogrify -resize 640 plant-pathology/train_images/*.jpg

Eventually, you can write your own small script which would explore python multiprocessing and speed this conversion…

Simple script for multiprocessing image scaling.

Continue reading: https://towardsdatascience.com/how-to-prepare-your-development-environment-to-rank-on-kaggle-1a0fa1032b84?source=rss—-7f60cf5620c9—4

Source: towardsdatascience.com