By Zachary Warnes, Data Scientist
This post is meant for new and or aspiring data scientists trying to decide what model to use for a problem.
This post will not be going over data wrangling. Which hopefully, you know, is the majority of the work a data scientist does. I’m assuming you have some data ready, and you want to see how you can make some predictions.
There are many models to choose from with seemingly endless variants.
There are usually only slight alterations needed to change a regression model into a classification model and vice versa. Luckily this work has already been done for you with the standard python supervised learning packages. So you only need to select what option you want.
There are a lot of models to choose from:
- Decision trees
- Support vector machines (SVM)
- Naive Bayes
- K-Nearest Neighbors
- Neural Networks
- Gradient Boosting
- Random Forests
The list goes on and on, but consider starting with one of two.
Linear regression & Logistic regression
Photo by iMattSmart on Unsplash
Yes, fancy models like xgboost, BERT, and GPT-3 exist, but start with these two.
Note: logistic regression has an unfortunate name. The model is used for classification, but the name persists due to historical reasons.
I would suggest changing the name to something straightforward like linear classification to remove this confusion. But, I don’t have that kind of leverage in the industry yet.
from sklearn.linear_model import LinearRegression import numpy as npX = np.array([[2, 3], [5, 6], [8,9], [10, 11]]) y = np.dot(X, np.array([1, 2])) + 1 reg = LinearRegression().fit(X, y) reg.score(X, y)
from sklearn.linear_model import LogisticRegression from sklearn.datasets import load_breast_cancer X, y = load_breast_cancer(return_X_y=True) clf = LogisticRegression(solver="liblinear", random_state=10).fit(X, y) clf.score(X,y)
Why These Models?
Why should you start with these simple models? Because likely, your problem doesn’t need anything fancy.
Busting out some deep learning model and spending hundreds on AWS fees to get only a slight accuracy bump is not worth it.
These two models have been studied for decades and are some of the most well-understood models in machine learning.
They are easily interpretable. Both models are linear, so their inputs translate to their output in a way that…
Continue reading: https://www.kdnuggets.com/2021/08/select-initial-model-data-science-problem.html