In this example, we are going to create some data. What we want to do is define what our signal is going to be, introduce varying levels of noise, and see how R² reacts.
Our signal: The true coefficient of X being equal to 1.2
The noise: Random addition of numbers with a mean of 0 and increasing standard deviation.
To accomplish this we are going to use the following code to generate data.
x1 <- seq(1,10,length.out = 500)
y1 <- 2 + 1.2*x1 + rnorm(500,0,sd = 1)
For x1 the computer is going to generate 500 numbers between 1 and 10. For y1, we are going to take that number, multiply it by our true coefficient (1.2), add 2 to it, and then add a random number that is randomly distributed around 500 with a mean of 0 and a standard deviation of 1.
We will create 4 graphs and linear regressions to see how our results change with increasing standard deviation. It is important to realize that in each of these, our signal (the true coefficient being 1.2) will hold, but we will have varying levels of noise (that has an average of 0 in the long run).
If R² measures our signal or the strength of it, it should stay roughly equal. If it measures the noise in our data, then R² should plummet as we increase the noise in our model (holding the signal constant). So what happens?
Would you look at that. Our signal for each regression was 1.2, and every regression got right around 1.2. But notice how our R² seems to suffer every time we increase the noise.
This is why R² can be such a destructive model evaluator. Since it doesn’t even try to measure the signal, downplaying a model with a low R² can cause us to ignore when we are accurately identifying the signal in our data.
Earlier I told you that R² is the proportion of the variance in the dependent variable that is explained by our independent variable, but now, dear reader, I must confess that this too is more complicated than it seems.
Often when we say a word often enough we can delude ourselves into thinking we understand it, but it deserves our attention to really examine what “explains” means here. My fear is that many take this ambiguous word and decide it means “causes”. On a final note, we should show that R² cannot possibly tell us that X causes Y or has some kind of causal link between the two.
This simple experiment is actually rather easy to do. Let us take some X that does in fact have some causal effect on Y (the same code as before). Then, if R² does measure some kind of causal…
Continue reading: https://towardsdatascience.com/moving-away-from-r%C2%B2-4a89b1c70393?source=rss—-7f60cf5620c9—4