In statistical modeling and data analysis, dummy variables are often used to represent categorical data. The dummy variables are binary variables represented as either 0 or 1. This article deals with creating dummy variables in R.
Let’s create the following data set for students containing their score of different subjects, by using the following commands
set.seed(123) students <- data.frame( student_id = 1:100, gender = sample(c("Male", "Female"), 100, replace = TRUE), residence = sample(c("Urban", "Rural"), 100, replace = TRUE), math_score = rnorm(100, mean = 70, sd = 10), english_score = rnorm(100, mean = 75, sd = 8), science_score = rnorm(100, mean = 72, sd = 9) )
The above command will create a data set which contains 100 observations for each variable. The replace parameter is set to TRUE, indicating that sampling is done with replacement. When sampling with replacement, the same value can be selected more than once. The data set created from the above command is the following
Creating dummy variables in R
One way to create the dummy variables for the categorical variables is to use ifelse()
function. This function is part of the base R, and thus doesn’t require any package to be installed. The categorical variables in the above data set are gender and residence, and to create dummies of these categorical variables, we use the following commands.
students$gender_dummy <- ifelse(students$gender == "Male", 1, 0) students$residence_dummy <- ifelse(students$residence == "Urban", 1, 0)
The above command creates two new variables named gender_dummy and residence_ dummy, and assigns value 1 to male and urban in the gender and residence variables. Thus, the dummies created for these two variables will be following
In this example, gender_dummy will be 1 for Male and 0 for Female, and residence_dummy will be 1 for Urban and 0 for Rural. The ifelse() function checks the condition specified in the first argument of the command and assigns the corresponding values specified in the second and third arguments based on whether the condition is true or false. You can adjust the values assigned to the second and third arguments according to your preference or specific requirements for encoding the dummy variables.
The ifelse()
function in R is also commonly used to create dummy variables based on multiple conditions. It allows you to assign different values to a variable depending on whether a specified condition is true or false. For instance, in the given data set, we can use multiple conditions to create the dummy variable
students$result_binary <- ifelse(students$math_score > 70 & students$english_score > 75, 1, 0)
In this command, the result_binary column will be assigned the value 1 if both conditions are met i.e. if math score is greater than 70 and english score is greater than 75, it will be assigned the value 0 otherwise. The result we get is following
Another way to create dummies in R using the base R package is by using the model.matrix() function. The model.matrix() function is commonly used for creating design matrices in the context of linear modeling and regression analysis. To create dummies, using this function, follow the given command
dummy_vars <- model.matrix(~ gender + residence + math_score + english_score + science_score, data = students)
Lt’s break down the above command to understand it better. The first argument of the command, model.matrix() function, is used to create a design (model) matrix from a formula. The next part of the command, containing the variable names, is a formula, specifying the variables to include in the design matrix. The ~ symbol separates the dependent variable (which is assumed to be the intercept) from the independent variables. In this case, it includes the variables gender, residence, math_score, english_score, and science_score as independent variables.
Thus, the purpose of using model.matrix() in this context is to create a matrix where categorical variables (gender and residence in this case) are converted into dummy variables. The dummies created from above command will be following
Using fastDummies Package
Another way to create dummy variables is using dummy _cols() function. This function is part of fastDummies package, and to use the function we need to install this package by using the following command
install.packages("fastDummies") library(fastDummies)
Once the package has been installed and loaded, we can now use the following command, to create dummy variables for gender and residence.
dummies<- dummy_cols(students, select_columns = c("gender", "residence"), remove_selected_columns = TRUE)
In the above command, the dummy_cols() function is from the dummycols package in R, and it is used to create dummy variables (binary 0/1 variables) for categorical variables in a dataset. The select_columns argument specifies which columns in the students data frame should be transformed into dummy variables. In this case, it’s the “gender” and “residence” columns.
Similarly, the remove_selected_columns, argument, when set to TRUE (use FALSE if you want to preserve the orignal columns), indicates that the original columns specified in select_columns should be removed from the data frame after creating the dummy variables. The dummy variables will replace the original categorical columns in the dataset. So, after executing this command, the students data frame will be modified to include dummy variables for the “gender” and “residence” columns, and the original categorical columns will be removed. The output we get from the above command will be following
If you want to preserve the original data, you may not include the remove_selected_columns parameter.
students <- dummy_cols(students, select_columns = c("gender", "residence"))
There are more than one way to create dummies for categorical variables in R, and it depends on the user which method he chooses. Dummy variables are further used for regression analysis or other data analysis purpose.