Generate Sample Data in R
7K views
May 16, 2024
We need sample data to work with some tasks in R, we can generate sample data using different methods. 00::00 Manually Generate Values 0:41 Normal Distribution Data 1:56 Categorical sample data 2:25 Categorical and Continuous Sample data Website: thedatahall.com As an Amazon Associate, I earn from qualifying purchases.
View Video Transcript
0:00
Welcome to the DataHall YouTube channel
0:02
In this video we are going to talk about how do we generate a sample data in R
0:06
When we are going to work with certain projects or when we are going to learn something we would
0:11
need certain data and usually what we do is we generate some random data
0:16
So one way of doing that is to generate data manually. So what that means is that we use this data frame function and specify all the values and
0:26
the variables by ourselves for example we have different countries years GDP growth data we have
0:32
specified them manually and that would create a data frame or data that we can work with
0:38
Now that is one way of doing that. The second is that let's say if we want to generate some normal distribution
0:44
what we can do is we can use our norm function specify the number of values we want
0:52
the mean value of that series and the standard deviation of that series
0:56
But obviously when we do this each time when it would give us the data
1:01
it would be a different set of data. So let me show you this data this is how the data looks like if I generate it again
1:10
and if I can again show you the data it is totally different from the one that we previously generated
1:17
Now if you want your code to be replicable to be reproducible in future
1:22
then what you need to do is set a seed. So if I set the seed execute this command and show you the data this is how it would look like
1:30
if I can again set the seed and execute the data each time it would be same data that we previously
1:37
had because I have set a seed. Now if you do it at your own system using this exact seed then you would get the exact same data
1:45
at your end. We can have other kind of distributions not just normal distribution we can have uniform
1:52
distribution binomial distribution etc. We can work with them but let's use the wrap function what wrap function is
2:00
would do is it would repeat certain values. For example I have men women and children in my data set I want to create a data set that
2:09
would have these three categories and for each category I want 30 observations
2:13
So I'm going to use the wrap function if I can show you the data this is high data we have
2:21
30 observations for men women and children's age. Lastly we can use the sample function now the sample function is quite diverse it can do
2:31
multiple things I'm again going to set the seed first now what it does is let's say I want to
2:36
generate certain individual's data students data you would have their id now in this case I have
2:42
just used the sequence to generate id I want to use the sample function but I also want their age
2:47
I want to generate a sample a random data that would have the age what I want but what I want
2:54
is that the that age should not be in decimal digits and that age should be in specific range
3:01
So what I do is I specify the sample function specify the range the number of observations I
3:07
want and whether I want it to be replaced or not so what that means is that if it generated a value
3:14
if it generated an individual with an age of 20 do you want another individual with the same age or
3:20
not obviously we can have multiple individuals with the same age so that why we have set replace
3:25
equal to true now what what other cases can be that let's say I want to generate an student id
3:31
now each student id would be different so let's just say that would be in range of
3:37
1 0 500 and we want 500 observations and in this case I want it to be set to false
3:49
So what that means is that if it has assigned a value the specific value to a single individual
3:55
I do not want this in this value to be assigned to any other individual so that's why I have set
4:01
replace equal to false. Similarly we can work with categorical data as we did with the rep function
4:08
for example I want to generate a sample of male and female we want 500 observations again we can
4:14
have multiple males and multiple females in our data we can have more than two categories for
4:20
example we want to generate the education data we have four categories whether an individual can have
4:27
a primary level of education or a secondary level of education bachelor's or master's
4:33
again multiple individuals can have the same kind of education so if I can execute this
4:40
I made a mistake over here if I can execute this we can have the data and this is how the data
4:48
would look like you can see that we have multiple individuals with age 33 but we would have only
4:54
one individual with each student date that's what this replace equals to fall function would do
5:02
So I hope this video was useful do subscribe to this channel do hit the bell icon and thanks for
5:07
watching this video