(testing signal)

Tag: LinearRegression

Why Data Scientists are Needed Everywhere?

CareersI have long been asking myself, why data science is one of the hottest jobs of our century? I found the answer to this question while discussing linear regression with a Ph.D. researcher in Chemistry who is conducting research to develop a bio-plastic. So the answer lays in the scalability of the tools and techniques used by statisticians. In our case, I will take an example the linear regression and its power!!!I would reformulate the title of the article as “What Chemistry has in common with Finance?”. The answer is data! Each of the sectors has the data to study and the data…

Linear Regression

In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable.

Practical definition in ML

Given a dataset, we want to predict a range of numeric (continuous) values. One or several variables of the dataset predict (are correlated with) a numerical outcome (the future), which is usually another column in the data.… Read more...

House price prediction using linear regression

Lets review a classical model from Pierian Data. Your neighbor is a real estate agent and wants some help predicting housing prices for regions in the USA. It would be great if you could somehow create a model with Python and scikit-learn for her, that allows her to put in a few features of a house and returns back an estimate of what the house would sell for.

She has asked you if you could help her out with your new data science skills. You say yes, and decide that Linear Regression might be a good path to solve this problem!.

Your neighbor then gives you some information about a bunch of houses in regions of the United States,it is all in the data set: USA_Housing.csv.

The data contains the following columns:

‘Avg. Area Income’: Avg.…

Why Gradient Descent Works?

Everybody knows what Gradient Descent is and how it works. Ever wondered why it works? Here’s a mathematical explanationPhoto by Yuriy Chemerys on UnsplashWhat is Gradient Descent?Gradient descent is an iterative optimization algorithm that is used to optimize the weights of a machine learning model (linear regression, neural networks, etc.) by minimizing the cost function of that model.The intuition behind gradient descent is this: Picture the cost function (denoted by f(Θ̅ ) where…

Download book for data science beginners – Learn Data Science with R

The book Learn Data Science with R covers minimal theory, practical examples, and projects. It starts with an explanation of the underlying concepts of data science, followed by implementing them in R language. Learn linear regression, logistic regression, random forests, and other machine learning algorithms. The hands-on projects provide a detailed step-by-step guide for analyzing and predicting data.
The book covers the following topics –
R Language
Statistics and Mathematics


Heterogeneity is defined as a dissimilarity between elements that comprise a whole. When heterogeneity is present, there is diversity in the characteristic under study. The parts of the whole are different, not the same. It is an essential concept in science and statistics. Heterogeneous is the opposite of homogeneous.

Heterogeneous jelly beans!

In chemistry, a heterogeneous mixture has a composition that varies. For example, oil and vinegar, sand and water, and salt and pepper are all heterogeneous mixtures. Multiple samples of these mixtures will contain different proportions of each component.

In statistics, heterogeneity is a vital concept that appears in various contexts, and its definition varies accordingly.… Read more...

Scikit-Learn’s Generalized Linear Models

Or how to make sure the airplane’s altitude is not negative.

Five Annoyingly Misused Words in Data Science

Watch your language if you want to have an impactPhoto by Julien L on Unsplash

1. Predictive

A Practical Introduction to 9 Regression Algorithms

Linear Regression is usually the first algorithm that people learn for Machine Learning and Data Science. Linear Regression is a linear model that assumes a linear relationship between the input variables (X) and the single output variable (y). In general, there are two cases:

  • Single Variable Linear Regression: it models the relationship between a single input variable (single feature variable) and a single output variable.
  • Multi-Variable Linear Regression (also known as Multivariate Linear Regression): it models the relationship between multiple input variables (multiple features variables) and a single output variable.

This algorithm is common enough that Scikit-learn has this functionality built-in with LinearRegression().


Linear Regression Test Data Error With A Simple Mathematical Formula


How to Find Weaknesses in your Machine Learning Models

By Michael Berk, Data Scientist at Tubi

Any time you simplify data using a summary statistic, you lose information. Model accuracy is no different. When simplifying your model’s fit to a summary statistic, you lose the ability to determine where your performance is lowest/highest and why.

Figure 1: example of areas of the data where model performance is low. Image by author.

To combat this problem, researchers at IBM recently developed a method called FreaAI that identifies interpretable data slices where a given model has poor accuracy. From these slices, the engineer can then take the necessary steps to ensure the model will perform as intended.

FreaAI is unfortunately not open source, but many of the concepts can be easily implemented in your favorite tech stack.


Statistical Machine Learning: Kernelized Generalized Linear Models (GLMs)& Kernelized Linear…

This is often referred to as the “Kernel Trick”. The above procedure allows us to fit linear decision boundaries in high-dimensional feature spaces without explicitly calculating all of the features in said high-dimensional space into an explicit Feature Matrix X. This is even the case when our high-dimensional feature space of interest is infinitely dimensional! There is a considerable volume of literature on Mercers Theorem and Reproducing Kernel Hilbert Spaces (RKHS) that mathematically supports the above statement, but it’s beyond the scope of this article. Rather I’m going to provide an intuitive explanation supporting this claim based on simple linear algebra and dot products:

Say we have a Feature Matrix X with n observations and p features (i.e.


The Art of Hyperparameter Tuning in Python

Before we learn about the hyperparameter tuning methods, we should know what is the difference between hyperparameter and parameter.

The key difference between hyperparameter and parameter is where they are located relative to the model.

A model parameter is a configuration variable that is internal to the model and whose value can be estimated from data.

A model hyperparameter is a configuration that is external to the model and whose value cannot be estimated from data.

Another important term that is also needed to be understood is the hyperparameter space.


Mathematics Hidden Behind Linear Regression

Exploring statistics using Calculus


Huber and Ridge Regressions in Python: Dealing with Outliers

How to handle outliers in a dataset

Huber and Ridge Regressions in Python: Dealing with Outliers

Traditional linear regression can prove to have some shortcomings when it comes to handling outliers in a set of data.

Specifically, if a data point lies very far away from other points in the set — this can significantly influence the least squares regression line, i.e. the line that approximates the overall direction of the set of data points will be skewed by the presence of outliers.

In an attempt to guard against this shortcoming, it is possible to use modified regression models that are robust to outliers. In this particular instance, we will take a look at the Huber and Ridge regression models.

The dataset that is used in this instance is the Pima Indians Diabetes dataset as originally from the National Institute of Diabetes and Digestive and Kidney Diseases and made available under the CC0 1.0 Universal (CC0 1.0)
Public Domain Dedication license


Paradoxes in Data Science

Photo by Shadan Arab on Unsplash


Paradoxes are a class of phenomena which arise when, although starting from premises known as true, we derive some sort of logically unreasonable result. As Machine Learning models create knowledge from data, this makes them susceptible to possible cognitive paradoxes between training and testing.

In this article, I will walk you through some of the main paradoxes associated with Data Science and how they can be identified:

  • Simpson’s Paradox
  • Accuracy Paradox
  • Learnability-Godel Paradox
  • The Law of Unintended Consequences

Simpson’s Paradox

One of the most common forms of paradox in Data Science is Simpson’s Paradox.

As an example, let us consider a thought experiment: we carried out a research study in order to find out if doing daily physical exercises can help or not reduce Cholesterol levels (in mg/dL) and we are now starting to examine the obtained results.


How DataRobot Can Help Actuaries Build And Interpret Pricing Models

Accurate pricing is essential to protecting an insurance company’s bottom line. Pricing directly impacts the near-term profitability and long-term health of an insurer’s book of business. The ability to charge more accurate premiums helps the company mitigate risk and maintain a competitive advantage, which, in turn, also benefits consumers.

The methods actuaries use to arrive at accurate pricing have evolved. In earlier days, they were limited to univariate approaches. The minimum bias approach proposed by Bailey and Simon in the 1960s was gradually adopted over the next 30 years. The later introduction of Generalized Linear Models (GLM) significantly expanded the pricing actuary’s toolbox.

Over time, the limitations of GLM have driven pricing actuaries to research new, more advanced tools.… Read more...

A Simple Interpretation of Logistic Regression Coefficients

Odds ratios simply explained.

Image by Ian Dooley (source: Unsplash) — thanks Ian!

I’ve always been fascinated by Logistic Regression. It’s a fairly simple yet powerful Machine Learning model that can be applied to various use cases. It’s been widely explained and applied, and yet, I haven’t seen many correct and simple interpretations of the model itself. Let’s crack that now.

I won’t dive into the details of what Logistic Regression is, where it can be applied, how to measure the model error, etc. There’s already been lots of good writing about it. This post will specifically tackle the interpretation of its coefficients, in a simple, intuitive manner, without introducing unnecessary terminology.

Let’s first start from a Linear Regression model, to ensure we fully understand its coefficients.


KDnuggets Top Blogs Rewards for August 2021

Top Blog RewardsKDnuggets Top Blog Rewards Program now has the winners for the month of August – congratulations to all winners below! Here are the top 6 blogs published in August 2021 whose authors will share the $2,000 (USD) reward amount:

  1. Automate Microsoft Excel and Word Using Python, by Mohammad Khorasani
  2. The Difference Between Data Scientists and ML Engineers, by Kurtis Pykes
  3. Most Common Data Science Interview Questions and Answers, by Nate Rosidi
  4. 3 Reasons Why You Should Use Linear Regression Models Instead of Neural Networks, by Terence Shin
  5. Django’s 9 Most Common Applications, by Aakash Bijwe
  6. Learning Data Science and Machine Learning: First Steps After The Roadmap, by Harshit Tyagi

We started the rewards program to encourage more high-quality and especially original (unpublished) contributions to KDnuggets.


Linear Regression — The Behind the Scenes Data Science ! (Part-2)

Section-3 of the image-2 (above), gives us the data related to parameter estimates or coefficients for our regression model. Let’s understand this in detail.

Please see below table 2.3 representing this part for quick reference.

table-2.3 | Output statistics of Simple Linear Regression — Parameter Estimates (dummy data) (image by author)

Our regression model equation is give by : y-pred=B0 + B1*X1 + B2*X2..

Specifically for this model, y-pred = 0.209 + 0.001 * X


Parameter estimates OR regression coefficients are the values of B1, B2 .. etc. They can be thought of as the weightage or importance of independent variables (i.e. X1, X2, .. etc.).

Quick Recall:

Its worth recalling that what we get are parameter estimates (of population) based on the sample analyzed.


Top August Stories: Automate Microsoft Excel and Word Using Python; The Difference Between Data Scientists and ML Engineers

Here are the most popular August 2021 stories on KDnuggets. The first 6 blogs also won KDnuggets Top Blogs Rewards for August.

Platinum BlogMost Viewed – Platinum Badge (>32,000 UPV)

  1. Automate Microsoft Excel and Word Using Python, by Mohammad Khorasani (*)

Gold BlogMost Viewed – Gold Badge (>16,000 UPV)

  1. The Difference Between Data Scientists and ML Engineers, by Kurtis Pykes
  2. Most Common Data Science Interview Questions and Answers, by Nate Rosidi (*)
  3. 3 Reasons Why You Should Use Linear Regression Models Instead of Neural Networks, by Terence Shin

Silver BlogMost Viewed – Silver Badge (> 8,000 UPV)

  1. Django’s 9 Most Common Applications, by Aakash Bijwe
  2. Learning Data Science and Machine Learning: First Steps After The Roadmap, by Harshit Tyagi (*)
  3. How Visualization is Transforming Exploratory Data Analysis, by Todd Mostak (*)

Platinum BlogMost Shared – Platinum Badge (>1400 shares)

  1. The Difference Between Data Scientists and ML Engineers, by Kurtis Pykes

Gold BlogMost Shared – Gold Badge (>700 shares)

  1. How to Query Your Pandas Dataframe, by Matthew Przybyla (*)
  2. Bootstrap a Modern Data Stack in 5 minutes with Terraform, by Tuan Nguyen
  3. GPU-Powered Data Science (NOT Deep Learning) with RAPIDS, by Tirthajyoti Sarkar (*)
  4. 3 Reasons Why You Should Use Linear Regression Models Instead of Neural Networks, by Terence Shin
  5. Prefect: How to Write and Schedule Your First ETL Pipeline with Python, by Dario Radecic

Silver BlogMost Shared – Silver Badge (>400 shares)

  1. How Visualization is Transforming Exploratory Data Analysis, by Todd Mostak
  2. Practising SQL without your own database, by Hui Xiang Chua (*)
  3. Django’s 9 Most Common Applications, by Aakash Bijwe
  4. Automate Microsoft Excel and Word Using Python, by Mohammad Khorasani
  5. How To Become A Freelance Data Scientist – 4 Practical Tips, by Pau Labarta Bajo (*)
  6. Learning Data Science and Machine Learning: First Steps After The Roadmap, by Harshit Tyagi
  7. Most Common Data Science Interview Questions and Answers, by Nate Rosidi

(*) indicates that badge added or upgraded based on these monthly results.


Bayesian Linear Regression: Analysis of Car Sales with arm in R

Using Bayesian Linear Regression to account for uncertainty

Bayesian Linear Regression: Analysis of Car Sales with arm in R

Linear regression is among the most frequently used — and most useful — modelling tool.

While no form of regression analysis can ever approximate reality, it can do quite a good job at both making predictions for the dependent variable and determining the extent to which each independent variable impacts the dependent variable, i.e. the size and significance of each coefficient.

However, traditional linear regression can have shortcomings in that this method cannot really account for uncertainty in the estimates.

However, Bayesian linear regression can serve as a solution to this problem — by providing many different estimates of the coefficient values through repeated simulations.


Regression for Classification | Hands on Experience

Logistic Regression and Softmax Regression with Core Concepts

Regression for Classification | Hands on Experience

We all have developed numerous regression models in our lives. But only few are familiar with using regression models for classification. So my intention is to reveal the beauty of this hidden world.

As we all know, when we want to predict a continuous dependent variable from a number of independent variables, we used linear/polynomial regression. But when it comes to classification, we can’t use that anymore.

Fundamentally, classification is about predicting a label and regression is about predicting a quantity.

Why linear regression can’t use for classification? The main reason for that is the predicted values are continuous, not probabilistic.