In R, you can change the data type of variables in data frame for data and statistical analysis. Before going further about changing data type in R, it’s important to know the difference between a vector and data frame. While the vector is a data set with same type of data, data frame contains data in rows and columns form, creating a tabular data.
As the article focuses on changing data type in dataframe, we import a data set by using the following command.
Download Example Filedata <- read.csv("example_data.csv")
To use the example data we used, you can first download the following data set file and then load it in R, or you can use your own file.
Once the data set has been loaded from the above command, it can be seen that data type is data frame, given that data contains observations in rows and columns. This is shown as below
Although it’s visible that data frame contains different types of observations in it, if we want to check the structure of data frame, we can use the str() function. This function will provide us a concise summary of data frame, including information about columns, their data types and first few values in each column. The command for checking the structure of data frame will be following
str(data)
This will provide the following information about structure of data frame
The type of data type of variables is different, which includes integer, character and numeric data as shown in front of the names of variables.
In R, one change the data type of variable or column from one data type to another by using the functions that start with “as.”. So if , we want to change data type of one the above variables, say wage, from integer to numeric data, the following command should be used
datawage <- as.numeric(data$wage)
The above command will change the data type of wage variable from integer to numeric in R. Remember to write the name of data frame and then use the dollar sign with the variable, which needs to be changed from one data type to another, as used in above command (data$wage).
Change Data Type in R using transform() function
There is another way to change the data type of columns by using the transform() function. The transform function() essentially produces similar results like “as” function, but in a lesser confusing way. Now, we want to change the data type of variable, say race, from character to factor data type. Factors are commonly used in R to represent categorical variables. In race, there are black, white or other categories present. The following command will be used to change data type of race
data <- transform(data,race = as.factor(race))
This command will assign certain values to the categories of the race variable. If we check the data structure of variables, the race variable will have 2,3, 4 and 5 values assigned to the categories of race, as shown in the image below. Assigning values to categories make data analysis much easier.
Similarly, data type of certain variable can be checked by using the following command
is.factor(data$race)
Here, we checked whether the race variable, is factor or not. As the race is a factor variable, the output shows “TRUE” for this command. The same command, if used for hours or ID variable, will give the FALSE output, because they are integer or numeric.
Change Data type of variables in R using mutate()
Let’s explore another way to change the data type of variables of the data frame. Another way to do this by using the mutate() function, which is part of the tidyverse package. The first step, thus, is to install and load the tidyverse package by using following commands
install.packages(tidyverse)
library(tidyverse)
Now, if we wish to change the data type of marriage variable from character to factor, having single and married as two categories, we can use the as.factor function along with the mutate function.
data <- data %>% mutate(married = as.factor(married))
The variable has been converted from character data type to factor data type having 3 categories. Although there are two categories; single or married for variable married, the category (“”) represents any missing values present in the variable. If we access the structure of data types again, the following output is shown,
We are done with the changing data type of variables, but if we want to play further with R, we can check a few things. For instance, our data analysis require us to create a logical data type variable for the wage, which shows us the wage in subsequent column is up to a certain number or not, say 30. So, when we have a requirement for a variable, we can first check whether the requirement is being met or not. To check, whether the wage is greater than 30, we use the following command
data$wage > 30
The above command gives following results, where TRUE is when age is greater than 30, and FALSE when it’s less than 30, and missing values are represented by NA.
To save these results into a separate, column, we use the following command, which creates a new column for the logical output.
data <- data %>% mutate(high_wage = wage>30)
Using methods mentioned above, we can change the data type of variables in R.