# Tag: variance

Variance is a measure of variability in statistics. It assesses the average squared difference between data values and the mean. Unlike some other statistical measures of variability, it incorporates all data points in its calculations by contrasting each value to the mean. When there is no variability in a sample, all values are the same, […]
The post Variance appeared first on Statistics By Jim.

In the process of building a Machine Learning model, there is a trade-off between bias and variance. Read the full story…

A step-by-step tutorial to explain the working of PCA and implementing it from scratch in pythonImage By AuthorIntroductionPrincipal Component Analysis or PCA is a commonly used dimensionality reduction method. It works by computing the principal components and performing a change of basis. It retains the data in the direction of maximum variance. The reduced features are uncorrelated with each other. These features can be used for unsupervised clustering and classification. To reduce…

Ensemble modelling helps you avoid overfitting by reducing variance in the prediction and minimizing modelling method bias. Read the full story

What do we want to achieve?

We want to train an AI agent or model that can do something like this,

Image source: Prepared by the author using this Pixabay image (Free to use)

Variances, anomalies, shifts

Little more specifically, we want to train an AI agent (or model) to identify/classify time-series data for,

low/medium/high variance
anomaly frequencies (little or high fraction of anomalies)
anomaly scales (are the anomalies too far from the normal or close)
a positive or negative shift in the time-series data (in the presence of some anomalies)

But, we don’t want to complicate things

However, we don’t want to do a ton of feature engineering or learn complicated time-series algorithms (e.g. ARIMA) and properties (e.g.… Read more...

## A statistics tool for analysis on features relationship

Once these types of data have been cleaned, they do more than show organized data sets. They reveal unlimited possibilities, and AI analytics can reveal these possibilities faster and more efficiently than ever before.

Data scientists have always been expected to curate data into ‘aha’ moments and tell stories that can reach a wider business audience. But what is the cost of this curation?

The real signal is in the noise

Tidy data doesn’t help that much.

Every aggregation and pivot performed on datasets reduces the total amount of information available to analyze. That clever NLP topic mining on free text fields was no doubt very useful, but the raw text is more interesting. Perhaps those ‘meaningless’ raw sensor logs are just that, or not.

## Why regularized least squares is equivalent to the maximum posterior solution and is the optimal algorithm for normally distributed data

This article compares Tikonov L2 regularisation to the maximum posterior probability function. This one is quite algebra dense so I recommend picking up a pen and paper and trying it yourself!

One of the problems with many machine learning models are their tendency to overfit data. Overfitting is a phenomenon in machine learning that happens when the model learns the training data too well, making its ability to generalize worse.

There is a delicate balance between the bias and the variance of a learning machine.

## Regularisation is a powerful tool that can help tune models

At the start of this year, I had a very limited understanding of regularisation. To me, it was simply a tool to use when overfitting. If the model was having higher validation loss than train loss, adding L1 or L2 regularisation might be able to solve this issue. The truth is, there are many different ways to look at regularization and it is a powerful tool that can help tune models that are behaving differently.

Before understanding the solution (regularisation) we must first understand the problem which is overfitting.

Let us start from the overfitting diagnosis that all of us are familiar with: when the validation/test loss is higher than the training loss.

## Stratification, CUPED, Variance-Weighted Estimators, and ML-based methods CUPAC and MLRATE

When we do online experiments or A/B testing, we need to ensure our test has high statistical power so that we have a high probability to find the experimental effect if it does exist. What are the factors that might affect power? Sample sizes, sampling variance of the experiment metric, significance level alpha, and effect size.

The canonical way to improve power is to increase the sample size. However, the dynamic range is limited since the minimum detectable effect MDE is proportional to 1/sqrt(sample_size). Also, in reality, getting more samples or running an experiment for a longer time to increase the sample size might not always be easy or feasible.

## A hands-on introductory course on machine learning techniques for physicians and healthcare professionals.

In Part II of this course, we went through the basic steps of data exploration. We started by surveying the distributions of a dataset by generating histograms using `DRESS.histograms`. Then we studied the central tendencies and dispersions of the various features in the dataset using `DRESS.means`, `DRESS.medians`, and `DRESS.frequencies`. We demonstrated the use of `DRESS.heatmap` and `DRESS.correlations` to visualize the degree of correlations among the various features in the dataset. Along the way, we also introduced the concept of missing value imputation.

It is important to stress that a proper data exploration process involves a great deal of dataset-specific analysis and is highly dependent on relevant domain knowledge.

## Determining sales differences across groups

The primary purpose of using an ANOVA (Analysis of Variance) model is to determine whether differences in means exist across groups.

While a t-test is capable of establishing if differences exist across two means — a more extensive test is necessary if several groups exist.

In this example, we will take a look at how to implement an ANOVA model to analyse car sales data.

The analysis is conducted on a car sales dataset available at Kaggle.

The purpose of this analysis is to determine whether factors such as engine size, horsepower, and fuel efficiency differ across groups of cars based on both vehicle type and country of origin.

A one-way ANOVA is used to determine effects when using just one categorical variable.

### Are my (bio)pharmaceutical assay performances reliable? Only probability of success counts!

#### Alternative to traditional Gage R&R metrics for the pharmaceutical industry

By Thomas de Marchin (Senior Manager Statistics and Data Sciences at Pharmalex), Laurent Natalis (Associate Director Statistics and Data Sciences at Pharmalex), Tatsiana Khamiakova (Associate Director Manufacturing and Applied Statistics at Janssen), Eric Rozet (Director Statistics and Data Sciences at Pharmalex) and Hans Coppenolle (Director Manufacturing and Applied Statistics at Janssen). This article was originally presented at the conference NCB 2021.

## An understanding of key skill areas to identify talented machine learning engineers. Such an understanding will help in recruiting, allocating, and promoting the engineers.

Machine Learning Engineer (MLE) is one of the hottest roles these days. While many would associate such a role with Python, R, random forest, convolutional neural network, PyTorch, scikit-learn, bias-variance tradeoff, etc., a lot more things come in the path of these engineers. Things that an MLE needs to handle does not only derived from the field of Machine Learning (ML) but also from other technical and soft disciplines. As depicted in Figure 1, in addition to possessing ML skills, an MLE needs to know programming, (big) data management, cloud solutions, and system engineering.

## On the challenge of aggregating results in a convincing way

After considering whether , I want to discuss two other sports situations where people always argue about the right way to aggregate results

All “Big 3” Tennis players currently have 20 grand slams. So who is actually the best? (Leaving aside how that changes in the future…)

An intuitive idea is that the data does not only show us the players’ totals, but also an indication to how “versatile” vs “specialized” they are. The majority of Nadal’s grand slams were achieved in the French Open. We can penalize for that (or reward that) by choosing a different aggregation scheme. A common method, parameterized by p, is the Lp norms:

For p=1, this is just the total sum of the grand slams (For example, Federer’s grand slams vector is (6,1,8,5), and the L1 norm is 20).

## How descriptive statistics alone can mislead you

If you are new to data science and have taken a course to do preliminary data analysis, chances are one of the first steps taught into doing exploratory data analysis (EDA) is to view the summary / descriptive statistics. But what do we really intend to accomplish with this step?

Summary statistics are important because they tell you two things about your data that are important for modeling: location and scale parameters. Location parameters, in statistics, refer to the mean. Knowing this lets you know if your data is normal and whether there is potential skewness to help in modeling decisions.

For example, if the dataset is normal, modeling techniques like Ordinary-Least-Squares may be sufficient and powerful enough for predicting purposes.

Generally, a model for time-series forecasting can be written as

where yₜ is the variables to be forecasted (dependent variable, or response variable), t is the time at which the forecast is made, h is the forecast horizon, Xₜ is the variables used at time t to make forecast (independent variable), θ is a vector of parameters in function g, and εₜ₊ₕ denotes errors. It is worth noting that the observed data is uniquely orderly according to the time of observation, but it doesn’t have to be dependent on time, i.e. time (index of the observations) doesn’t have to be one of the independent variables.

## Some possible properties of time series

Stationarity: a stationary process is a stochastic process, whose mean, variance and autocorrelation structure do not change over time.

## Pre-requisite: This story is for both technical and business folks who have experience in running experimentation.

TL;DR — The goal of experimentation is to make a decision and not to chase a specific significance level.

It is day 14 of running the most critical experiment in your business unit. Your leadership pings you on work chat and asks, “Are we stat sig yet?”

You quickly run the t-test and report back, “hmm we are almost stats sig at 90% significance level.”

“Okay, but when are we going to reach 95% stats sig?”, the lead replies.

You point back to the variance plot to explain how some experiments just don’t get to 95% stats sig even if it was well designed. Not quite following the math, the lead responds, “but what do we need to do to get to stats sig?”

## The fastest multiple imputation method using XGBoost

In this blog, we shall discuss about how to use H2O to build a few supervised machine learning models. H2O is a Java-based software for data modeling and general computing, with the primary purpose of it being a distributed, parallel, in memory processing engine. It needs to be installed first (instructions) and by default an H2O instance will run on `localhost:54321`. Additionally, one needs to install R/python clients to to communicate with the H2O instance. Every new R / python session first needs to initialize a connection between the python client and the H2O cluster.

The problems to be described in this blog appeared in the exercises / projects in the Coursera course “Practical Machine Learning on H2O,” by H2O.

## Using UMAP for Dimensionality Reduction

Dimensionality reduction is one of the most important aspects while we are dealing with the large dataset because it helps in transforming data into a lower dimension so that we could identify some of the important features and their properties. It is used generally to avoid the curse of dimensionality which arises while analyzing large datasets.

Dealing with high dimensional data and can be difficult when we are working on numerical analysis or creating a Machine Learning model. Using a high-dimensional dataset can result in high variance and the model will not be generalized. If we lower down the dimensions we can make the Machine Learning models more generalized and avoid over-fitting.

Contributed by Yannick Kimmel. He  is currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between April 11th to July 1st, 2016. This post is based on his second class project – R Shiny (due on the 4th week of the program).

## Introduction

The culture of food and health (like the other aspects of culture) is constantly changing and is diverse (meaning high variance) in the USA. Obesity affects roughly 1 in 3 Americans, while diabetes affects roughly 1 in 10 Americans. I wanted to understand the relationship of food and health demographics. I thought this data would be important for policy makers and civic leaders who would be interested in changing their demographics for the better.

By Theodore Tsitsimis, Machine Learning Scientist

Bias-Variance trade-off is a fundamental concept of Machine Learning. Here we’ll explore some different perspectives of what this trade-off really means with the help of visualizations.

## Bias-Variance in real life

A lot of our decisions are influenced by others, when observing their actions and comparing ourselves with them (through some social similarity metric). At the same time, we maintain our own set of rules that we learn through experience and reasoning. In this context:

• Bias is when we have very simplistic rules that don’t really explain real-life situations. For example, thinking you can become a doctor by watching Youtube videos.
• Variance is when we always change our minds by listening to different groups of people and mimicking their actions.