Regression Assumptions/Diagnostics in R

0:00
Welcome to the Data Hall YouTube channel. In this video we are going to talk about the
0:04
regression assumptions in R. So we are going to discuss how do we test these regression assumptions
0:11
and what are certain remedies that we can use to solve these issues. So first we are going to
0:18
install the packages. I have already installed them so I'm not going to install it in this video
0:23
but if you wish to install them you can use this command. We do install Momentus, LM test, Sandwich and Car library. These are the
0:36
libraries that we are going to use for different purposes. For example we are going to use this
0:40
Momentus for skewness at contours then we are going to use LM test for heteroscedasticity
0:47
Sandwich for robustness standard test for robustness standard error to resolve the issue
0:54
of heteroscedasticity and then we have this Car library which we are going to use for various
0:59
inflation factor. So if you want to download this R script or the data that I'm going to use
1:05
with this video you can download it from the link given in the description. So let's load the data
1:12
we are going to use the auto data so if I can show you the auto data from the environment
1:19
I have minimized the environment we have 74 observations and 12 variables. Let me show you
1:25
the data we have different cars these are the make of the car, their model, their prices
1:32
their mileage per gallon, the number of repairs and then there is the weight of the car and
1:40
the length of the car. So what we are going to do is we are going to
1:47
use certain regression model so what we are going to use is we are going to
1:53
take price as a dependent variable and regress it on mileage weight and length
1:59
So first we are going to test the normality of the residual for that we are going to run
2:03
the regression using lm function specify the dependent variable the tedla sign and then the
2:10
independent variables and then the data that I'm going to use I'm going to use the auto data that
2:15
I have just loaded. Let me press control enter so the model had been executed although I cannot
2:23
see the model in my console because I have to execute the second command which is summary model
2:30
and we can see that we have their coefficient, their standard errors, their t test and their
2:36
R square and other model fitness values. But what I am interested is in testing the normality of
2:44
the residual because one of the assumptions is that the residuals should be normally distributed
2:52
so we are not going to test the normality of each of the variable but we are going to test the
2:56
normality of the residual by itself. So what we are going to do is we are going to extract
3:02
the residual from this model so what we do is we use the razzit function and then specify the model
3:09
that we have that we have created and we'll take this would extract the residual from the model
3:17
and we would save these and this new object called residuals. So if I execute this it would create
3:24
a vector that would save these residuals. Now what we are going to do is we are going to
3:29
test the normality of this residual so we first test the skewness using the skewness kurtosis test
3:38
this is coming from the momentous library so first we are going to execute the skewness function
3:46
Okay so what we did is we didn't loaded the library so I'm just going to quickly load
3:51
all the libraries okay let's get back press ctrl enter we get the skewness of residuals. Now
4:01
ideally for normally distributed data the skewness should be close to zero
4:06
the higher the skewness is from zero I mean the positive number indicates that it is positively
4:14
skewed distribution and a negative number would indicate that it is a negatively distributed
4:20
skewness. So currently our shape is positively skewed let's check kurtosis, kurtosis should be
4:29
three and now this gives us the kurtosis rather than excess kurtosis if any test gives you excess
4:36
kurtosis then you would have to then its value should be zero but in this case its value should
4:40
be close to three. Now we can see that it is somewhat close to three right it isn't exactly
4:47
three but it is somewhat closer to three so from that perspective let's move forward to
4:55
now this gives us the skewness and kurtosis but it isn't a test per se right so what we are going
5:01
to do is we are going to use certain test in this case we have Shapiro will test so what we do is
5:07
we use Shapiro dot test function and it utilize it on the residuals that we just created
5:15
What we are interested in is the p value so if p value is less than 0.05 then we say that the
5:23
residuals are not normally distributed because the null hypothesis of Shapiro will test is
5:31
normal distribution and the alternate hypothesis is that the series is not normally distributed
5:36
So in this case because it is less than 0.05 we reject the null hypothesis in favor of alternate
5:42
hypothesis which is that the data is not normally distributed so from Shapiro will test we can see
5:50
that residuals are not normally distributed so this assumption is violated. Let's also look at
5:58
a visualization using QQ plot we do QQ norm and then use the residuals this would create a QQ
6:09
plot but we need a diagonal line that would come from QQ line. Now the closer the values are to
6:19
this diagonal line the more closer this series is to a normal distributed series but in this case
6:28
we can see that it deviates from that from that line and so it is not normally distributed
6:36
Let's move towards heteroscedasticity let's test heteroscedasticity. So first we are going to do Bruch-Pagan test now the null hypothesis of Bruch-Pagan test is that
6:46
the that the data is that the variants are homoscedastic whereas the alternate hypothesis
6:54
is that they are heteroscedastic. So we are going to do BP test and specify the model rather than
7:02
residuals right in the previous assumption we were testing the normality of residuals so we
7:08
used residuals but in this case we are going to use model that we have created
7:14
Okay so let's press control enter and it would give us certain values we are interested in the
7:20
p-value and in this case it is less than 0.05 that means that the data is heteroscedastic
7:28
we reject the null hypothesis of homoscedasticity in favor of the alternate
7:33
hypothesis of heteroscedasticity. Similarly we have Bruch-Pagan test sorry Bruch-Godfrey test and the null and alternate hypothesis of Bruch-Godfrey test is exactly the same
7:46
as of the Bruch-Pagan test again using Godfrey test we reject the null hypothesis of homoscedasticity
7:54
So now we know that our data is heteroscedastic so what remedy do we have now remember when when
8:01
the data is heteroscedastic the coefficients are not biased but our standard errors are either
8:09
overestimated or underestimated so what we are going to do is let me take the screenshot of
8:14
this test so I have taken the screenshot of this test so that we can compare it with the test that
8:20
we are just going to perform so what we are going to do is we are going to the remedy is that we use
8:26
robust standard error instead of the the normal standard error that we get for that we we are
8:32
going to use this line of code now these two line of code are combinedly coming from the sandwich
8:38
library and the LM test library this this coefficient test is coming from the LM test and
8:46
then this one is coming from the sandwich library so if I execute this it would give me
8:52
exactly the same coefficient as you can compare them side by side they are exactly the same
8:58
coefficients but the standard errors are now different right so so so what it would do is it
9:08
would this robust error would just as the name suggests would would apply a different formula
9:16
and adjust error for that to cater that issue so let's move to multicollinearity and multicollinearity
9:25
is when our independent variables are correlated for that first we check the correlation between our
9:35
independent variables I have included the dependent variable but mainly we are interested in our
9:40
independent variables we can see that the weight and length have quite a high correlation anything
9:47
higher than 0.9 would be an indication of a multicollinearity it isn't a conclusion that
9:55
there is multicollinearity but it is just an indication for more thorough understanding we
10:01
should look into the variance inflation factor so what I do is we I use vif and specify the model
10:09
and remember again I have to specify the model and we get these values of variance inflation factor
10:17
so anything greater than 10 would specify that there is multicollinearity some some research
10:24
paper suggests that it should be greater anything greater than 5 would be considered multicollinearity
10:30
but anyways in our case we can see that there is multicollinearity because the variance inflation
10:35
factor is greater than 10 so what we do is one of the remedy there are multiple remedies but one of
10:40
the remedy is that we have to drop one of these multicollinear variable because they present the
10:47
same information so what we do is we drop one of them so in this case let's drop length and include
10:53
weight and let's execute the regression once again and see if if the variance inflation factor would
11:02
resolve now you can see that I have executed the model again and using the new model test
11:09
statistics I have executed the variance inflation factor and I can see that
11:15
and the variance inflation factor has now reduced so I hope this video was useful do
11:22
subscribe to this channel to stay updated and do hit the bell icon thanks for watching this video

Livestream Starting Soon

Regression Assumptions/Diagnostics in R

thedatahall.com

Multimeters in a Nutshell: Measure Voltage Using a Multimeter

Logistic Regression Machine Learning | Logistic Regression Tutorial | Tutorialspoint

Reinforcement Learning Algorithms | Machine Learning Tutorial | TutorialsPoint

Area Grade 6 Mathematics

What 385nm Actually Does! - Uniformation GK3 Pro Review - 385nm vs 405nm

Difference Between Phytoplankton and Zooplankton in 1 minute || Phytoplankton vs Zooplankton

Find the Area of the Third Square in a Right Triangle

Find the sum of a⁴ + b⁴

Russian Documentary on PD-8 Engine

Curves and polygons Grade 6 Mathematics

Two-way interaction Using R Part4

Three Way Interaction in R | The Data Hall

Data Visualization Using ggplot2 in R

Mastering Moving Averages in SQL: A Comprehensive Guide | EssentialSQL

Petrified Wood: Waterjet Cutting a Fossil for Stunning Wood Grain Reveal

Livestream Starting Soon

Up next in 10

Regression Assumptions/Diagnostics in R

thedatahall.com