Regression Assumptions/Diagnostics in R
May 16, 2024
In this video, we explain regression assumptions such as normality of residuals, heteroskedasticity tests, and multicollinearity tests in R. We also discuss robust standard error and the remedy of multicollinearity. 00:00 Load Packages 1:40 Normality Tests (skewness, kurtosis) 4:55 Shapiro Wilk Test 5:56 QQ Plot 6:36 Heteroscedasticity (Breusch Pagan Test) 7:35 Breusch Godfrey Test 7:55 Remedy for Heteroscedasticity 9:22 Multicollinearity
View Video Transcript
0:00
Welcome to the Data Hall YouTube channel. In this video we are going to talk about the
0:04
regression assumptions in R. So we are going to discuss how do we test these regression assumptions
0:11
and what are certain remedies that we can use to solve these issues. So first we are going to
0:18
install the packages. I have already installed them so I'm not going to install it in this video
0:23
but if you wish to install them you can use this command. We do install Momentus, LM test, Sandwich and Car library. These are the
0:36
libraries that we are going to use for different purposes. For example we are going to use this
0:40
Momentus for skewness at contours then we are going to use LM test for heteroscedasticity
0:47
Sandwich for robustness standard test for robustness standard error to resolve the issue
0:54
of heteroscedasticity and then we have this Car library which we are going to use for various
0:59
inflation factor. So if you want to download this R script or the data that I'm going to use
1:05
with this video you can download it from the link given in the description. So let's load the data
1:12
we are going to use the auto data so if I can show you the auto data from the environment
1:19
I have minimized the environment we have 74 observations and 12 variables. Let me show you
1:25
the data we have different cars these are the make of the car, their model, their prices
1:32
their mileage per gallon, the number of repairs and then there is the weight of the car and
1:40
the length of the car. So what we are going to do is we are going to
1:47
use certain regression model so what we are going to use is we are going to
1:53
take price as a dependent variable and regress it on mileage weight and length
1:59
So first we are going to test the normality of the residual for that we are going to run
2:03
the regression using lm function specify the dependent variable the tedla sign and then the
2:10
independent variables and then the data that I'm going to use I'm going to use the auto data that
2:15
I have just loaded. Let me press control enter so the model had been executed although I cannot
2:23
see the model in my console because I have to execute the second command which is summary model
2:30
and we can see that we have their coefficient, their standard errors, their t test and their
2:36
R square and other model fitness values. But what I am interested is in testing the normality of
2:44
the residual because one of the assumptions is that the residuals should be normally distributed
2:52
so we are not going to test the normality of each of the variable but we are going to test the
2:56
normality of the residual by itself. So what we are going to do is we are going to extract
3:02
the residual from this model so what we do is we use the razzit function and then specify the model
3:09
that we have that we have created and we'll take this would extract the residual from the model
3:17
and we would save these and this new object called residuals. So if I execute this it would create
3:24
a vector that would save these residuals. Now what we are going to do is we are going to
3:29
test the normality of this residual so we first test the skewness using the skewness kurtosis test
3:38
this is coming from the momentous library so first we are going to execute the skewness function
3:46
Okay so what we did is we didn't loaded the library so I'm just going to quickly load
3:51
all the libraries okay let's get back press ctrl enter we get the skewness of residuals. Now
4:01
ideally for normally distributed data the skewness should be close to zero
4:06
the higher the skewness is from zero I mean the positive number indicates that it is positively
4:14
skewed distribution and a negative number would indicate that it is a negatively distributed
4:20
skewness. So currently our shape is positively skewed let's check kurtosis, kurtosis should be
4:29
three and now this gives us the kurtosis rather than excess kurtosis if any test gives you excess
4:36
kurtosis then you would have to then its value should be zero but in this case its value should
4:40
be close to three. Now we can see that it is somewhat close to three right it isn't exactly
4:47
three but it is somewhat closer to three so from that perspective let's move forward to
4:55
now this gives us the skewness and kurtosis but it isn't a test per se right so what we are going
5:01
to do is we are going to use certain test in this case we have Shapiro will test so what we do is
5:07
we use Shapiro dot test function and it utilize it on the residuals that we just created
5:15
What we are interested in is the p value so if p value is less than 0.05 then we say that the
5:23
residuals are not normally distributed because the null hypothesis of Shapiro will test is
5:31
normal distribution and the alternate hypothesis is that the series is not normally distributed
5:36
So in this case because it is less than 0.05 we reject the null hypothesis in favor of alternate
5:42
hypothesis which is that the data is not normally distributed so from Shapiro will test we can see
5:50
that residuals are not normally distributed so this assumption is violated. Let's also look at
5:58
a visualization using QQ plot we do QQ norm and then use the residuals this would create a QQ
6:09
plot but we need a diagonal line that would come from QQ line. Now the closer the values are to
6:19
this diagonal line the more closer this series is to a normal distributed series but in this case
6:28
we can see that it deviates from that from that line and so it is not normally distributed
6:36
Let's move towards heteroscedasticity let's test heteroscedasticity. So first we are going to do Bruch-Pagan test now the null hypothesis of Bruch-Pagan test is that
6:46
the that the data is that the variants are homoscedastic whereas the alternate hypothesis
6:54
is that they are heteroscedastic. So we are going to do BP test and specify the model rather than
7:02
residuals right in the previous assumption we were testing the normality of residuals so we
7:08
used residuals but in this case we are going to use model that we have created
7:14
Okay so let's press control enter and it would give us certain values we are interested in the
7:20
p-value and in this case it is less than 0.05 that means that the data is heteroscedastic
7:28
we reject the null hypothesis of homoscedasticity in favor of the alternate
7:33
hypothesis of heteroscedasticity. Similarly we have Bruch-Pagan test sorry Bruch-Godfrey test and the null and alternate hypothesis of Bruch-Godfrey test is exactly the same
7:46
as of the Bruch-Pagan test again using Godfrey test we reject the null hypothesis of homoscedasticity
7:54
So now we know that our data is heteroscedastic so what remedy do we have now remember when when
8:01
the data is heteroscedastic the coefficients are not biased but our standard errors are either
8:09
overestimated or underestimated so what we are going to do is let me take the screenshot of
8:14
this test so I have taken the screenshot of this test so that we can compare it with the test that
8:20
we are just going to perform so what we are going to do is we are going to the remedy is that we use
8:26
robust standard error instead of the the normal standard error that we get for that we we are
8:32
going to use this line of code now these two line of code are combinedly coming from the sandwich
8:38
library and the LM test library this this coefficient test is coming from the LM test and
8:46
then this one is coming from the sandwich library so if I execute this it would give me
8:52
exactly the same coefficient as you can compare them side by side they are exactly the same
8:58
coefficients but the standard errors are now different right so so so what it would do is it
9:08
would this robust error would just as the name suggests would would apply a different formula
9:16
and adjust error for that to cater that issue so let's move to multicollinearity and multicollinearity
9:25
is when our independent variables are correlated for that first we check the correlation between our
9:35
independent variables I have included the dependent variable but mainly we are interested in our
9:40
independent variables we can see that the weight and length have quite a high correlation anything
9:47
higher than 0.9 would be an indication of a multicollinearity it isn't a conclusion that
9:55
there is multicollinearity but it is just an indication for more thorough understanding we
10:01
should look into the variance inflation factor so what I do is we I use vif and specify the model
10:09
and remember again I have to specify the model and we get these values of variance inflation factor
10:17
so anything greater than 10 would specify that there is multicollinearity some some research
10:24
paper suggests that it should be greater anything greater than 5 would be considered multicollinearity
10:30
but anyways in our case we can see that there is multicollinearity because the variance inflation
10:35
factor is greater than 10 so what we do is one of the remedy there are multiple remedies but one of
10:40
the remedy is that we have to drop one of these multicollinear variable because they present the
10:47
same information so what we do is we drop one of them so in this case let's drop length and include
10:53
weight and let's execute the regression once again and see if if the variance inflation factor would
11:02
resolve now you can see that I have executed the model again and using the new model test
11:09
statistics I have executed the variance inflation factor and I can see that
11:15
and the variance inflation factor has now reduced so I hope this video was useful do
11:22
subscribe to this channel to stay updated and do hit the bell icon thanks for watching this video
#Science
#Vehicle Specs
# Reviews & Comparisons