Correlation Analysis in R
923 views
May 16, 2024
In this video we discuss different types of correlation in R e.g. pearson correlation, spearman correlation, Kendall's Tau correlation. Then we discuss the correlation that we can apply to categorical variables e.g. Tetrachoric Correlation, Polychoric Correlation. Lastly, we discuss listwise, casewise and pairwise correlation. 00:00 Intro to video 0:45 Pearson correlation 3:15 spearman correlation 4:43 Kendall's Tau correlation 5:18 Tetrachoric Correlation 6:20 Polychoric Correlation 7:37 listwise deletion 8:55 Pariwise correlation. Website: thedatahall.com As an Amazon Associate, I earn from qualifying purchases.
View Video Transcript
0:00
Welcome to the Data Hall YouTube channel
0:01
In this video, we are going to talk about how do we test correlation or apply correlation in R
0:08
There are different types of correlations. So we are going to discuss PSN correlations, spearman correlation, the Kandal Tau correlation
0:16
and then we are going to, these are the numeric, these are the correlation that we apply on numerical data
0:22
continuous data. And then we have categorical correlation related to categorical variables where we would
0:29
see at a tetrachoric correlation, then we would see at polychoric correlation, and we are also going
0:38
to discuss how do we do list-wise, case-wise and pair-wise deletion or correlation
0:44
So let's start with Pearson correlation. This is the mostly applied correlation
0:50
So we apply this on the continuous data. What we do is we use the first let's load the data
0:57
We are going to use the iris data that contains the length and weight of, sorry, length and width of different flowers and their species
1:08
So we are going to work with the length and width of these two columns
1:17
So when we do Pearson correlation, we use the CUR function and let's apply this correlation on all the data
1:26
So what we do is we use the iris data frame. We do not select any columns or rows
1:33
So all the rows would be taken and column one to four. So let's apply this on column one to four
1:42
And we are going to apply the Pearson correlation. So if I apply this, we can see that we have got the correlation
1:49
And this is the correlation between seepel length and simple length. So that means it would be one because it is the correlation between itself, the variable itself
2:01
This is the correlation between sepal width and sepal length. And we can see this is the correlation with petal length and sepal width, sepull length
2:10
So this is any value greater than 0.7 would be considered a high correlation
2:16
Any value between 0 and 0 would be considered moderate correlation and anything lower than 0 is considered a low correlation Also when we talk about correlation there is no dependent or independent variable This is just an association between two variables
2:34
We cannot say that petal length would have an impact on sepal length or sepal length
2:40
would have an impact on petal length. But it is just an association between these two variables
2:46
So there is no dependent or independent. There is no causality. We are just looking at the correlations associations
2:53
Let's say we just wanted to do the correlation between two variables
2:58
So we would select the first variable and the second variable and the rest of the one would be exactly the same
3:04
Apply this and we get that the correlation between CPL length and battle length is 0.87
3:10
This is what we got over here. That is the positive correlation
3:15
Next, let's move to Spearman correlation. And spearman correlation applied when we have certain
3:22
ordinal data or ranks in our data. So again, the only change that we would do is instead of the word Pearson
3:31
we would use the word spearman and the rest of the command would exactly be the same
3:38
So we get that the spearman correlation is 0.88 between these two variables
3:43
But we can also get the significant test, the significant values, whether this correlation is statistically significant or not
3:50
and that is what we get using this line of code. The only different is we use the exact equal to false
3:57
The rest of the command is same except for the function also because now instead of correlation function
4:03
we are using correlation test. And this is the P value that we are interested in
4:07
If P value is less than 0.05, then we would reject the NEL hypothesis
4:14
which suggests that there is no correlation or the correlation between these two variables is equal to zero and the alternate is that the
4:22
correlation is not equal to zero so we reject the null hypothesis in favor of alternate hypothesis
4:28
and that means that statistically speaking we are statistically significant that there is a positive
4:35
correlation between these two variables so a p value less than 0.05 so just a significant correlation
4:42
Let's move to Kandal Tau and Kandall Tao is also used when we have an ordinal data
4:49
The rest of the command is exactly the same. We just change the word Kindal and we get the correlation and the p value also so this this correlation
5:02
test would give you the p value and the p value and as well the correlation value right
5:07
so you if you are interested in in the test statistics then just apply the core cvr.test
5:14
function so we can also see that it is statistically significant so let's move to correlation
5:19
of categorical variable where we are going to use the tetrachoric correlation and for that we are
5:26
going to load the psych library and now i have already installed this library but if you haven't
5:32
installed it then you would install it using install dot packages and specify the name of the library
5:38
that you are going to load press control enter and it would be installed but i have already
5:43
install it so i am just going to comment this out so let's load this library the library is
5:49
We are just going to create a table that would contain a matrix that would contain certain data
5:58
So this is how the data looks like. We are going to apply the correlation on this data
6:04
Let's assume that is certain categories, right? So we just use the word tetrachoric and then the function tetrachoric and then specify the data that we want to apply it on
6:17
And this is what we get our correlation. Then we have polycorac correlation
6:25
Now this is applied when we have more than two or more ordinal categorical variable, right
6:31
So there should be ordinal, some order in the data, right? The categorical values rather than just being different categories
6:42
For example, gender is just categories. There is no order in that
6:46
So we use the tetrachoric correlation, but if we look at the customer satisfaction or service quality, then there obviously five is either more or less customer satisfaction, right
6:59
So there is an, it is an order in data. So let's just create this data
7:05
So I have created two vectors that is customer satisfaction and service quality
7:09
And I want to look at the correlation between these two variables
7:13
So I use the polychoric, the function polychorec. and then specify the both the vectors or columns if you if your data was in a data frame plus control enter and we okay so I haven loaded the the library for that we are going to use the polychoric library
7:33
and press control enter and we get the the correlation last we are going to
7:39
discuss the list wise or pairwise correlation so list wise and case wise is one in
7:44
the same thing list wise means that what we are going to do is we we just apply
7:49
the correlation on the data where all the observations are available, right? So for example, if we
7:59
have four columns and the, so if data is missing in any of that column, then that specific
8:06
row would be excluded if we use list wise or case wise. So let's generate some data, right? And let's
8:14
input. So we have customer satisfaction, product quality and sales performance. The
8:18
are randomly generated data. Let me show you this data. So these are certain numbers
8:26
It is numerical data, continuous data. And let me introduce some missing values, right
8:31
So we have introduced some missing values in our data set. So what I'm going to do is I'm going to apply the list-wise correlation
8:40
And for that, what I'm going to do is within this correlation function, I'm going to use the
8:46
use parameter and specify complete observations. That means where all the data is available and that would apply the correlation
8:55
If I were to do peer-wise correlation, that means apply the correlation when, like for example
9:01
when it is going to apply the correlation between customer satisfaction and product quality
9:05
then we just need to include the rows that contain the data for both service quality and
9:14
customer satisfaction. This is what we call peer-wifference. So for each pair we should have the data
9:19
So the number of observation that we would have for this specific correlation
9:23
the correlation between product quality and customer satisfaction, would be different from the number of observations that we have for the
9:30
let's say, sales performance and customer satisfaction. So this is what we call pairwise dilation or pairwise correlation
9:40
We just use the pairwise word and the rest of the command is exactly the same
9:44
Let me press control. Enter and we get the peer-wise correlation. So I hope this was useful
9:51
Do subscribe to this channel and do hit the back