Data comes in various formats, and managing multiple data files can be a difficult task. Often, the data format we are working with is a mix of CSV, Excel, and text files, each containing valuable information. Combining these files into a single, structured dataset is important for effective data analysis. This article is based on explaining how we can merge multiple CSV, Excel and tab delimited files in R.
Combine Multiple CSV files
We can combine multiple CSV files in R using the tidyverse package, which is a collection of packages for data manipulation and visualization. Before you start, make sure you have the tidyverse package installed. If the package isn’t installed, install and load the package by using following commands
The next step is to set your working directory, for the set of files that you are going to use, which can be done from the command given below
setwd("E:/The Data hall Articles/example 1")
In the above command, we set the working directory in a sub-folder named example 1. This folder contains multiple CSV files. To combine all these CSV files into one, use the following command.
full_data <- list.files( pattern = "*.csv", full.names=T) %>% lapply(read_csv) %>% bind_rows()
In the above command, the list.files function lists all the files in the working directory and the pattern = “*.csv” argument in the list.files function specifies a pattern to match when listing files in a directory. In this case, it’s looking for files whose names contain “csv”. The asterisk (*) is a wildcard character that matches any sequence of characters, and “csv” is the specific code it is looking for.
As shown in the image below, other than CSV files, there is an R file imported too which contains csv in its name.
Thus, all files containing csv will be imported using the above command. The function
lapply is used to read each CSV file using the read_csv function. Finally, bind_rows combines the data frames into a single dataset using rows, which is then stored in the full_data variable.
In short, this command lists all CSV files in the specified directory, reads each CSV file and then combines the resulting data frames into a single dataset, which is stored in the full_data variable. This is a concise and efficient way to combine multiple CSV files into one dataset in R.
Once the command is run, the following data set is loaded in R, containing 26 observations of 2 variables.
The above command is about combining and importing files containing the word csv. However, it doesn’t specify the extension of the file that needs to be imported. Thus, to specify the extension of the file, which in this case is CSV, we use the following command
full_data <- list.files( pattern = "\\.csv$", full.names=T) %>% lapply(read_csv) %>% bind_rows()
This command is different from the above one, just in the pattern argument. In the pattern parameter, regular expression
\\.csv$ is designed to match file names with the specific extension “.csv”. In this expression, the double backslash
\\ is an escape character instead of wild card. Following the dot, the characters “csv” are specified to match the file extension. Finally, the dollar sign
$ asserts that the pattern must occur at the end of the file name.
Using this command, only files as CSV extensions are imported, as shown in image below.
Combine Files with Different Number of Columns
Now, if we want to combine CSV files with different number of columns, i.e. one file has two columns while the second one has three or more columns, we can use the similar command as above with slight changes. Remember to set the working directory for the files you want to be combined. In this case, our working directory will be set using the following command
setwd("E:/The Data hall Articles/example 2")
The command for combining CSV files with different number of columns will be
full_data2 <- list.files( pattern = "*.csv", full.names=T) %>% lapply(read_csv) %>% bind_rows()
The list.files( pattern = “*.csv”, part lists all files with the “csv” in the working directory. It uses the pattern argument to specify a regular expression that matches files having “csv” and combines CSV files with different columns into a single CSV file.
Thus, the result, stored in the full_data2 variable, is a single dataset that combines all the CSV files, even if they have different numbers of columns.
Combine Files with Different Column Type
If you are dealing with CSV files that have varying column data types, i.e. character data, numeric data etc. and you want to maintain consistency by reading all columns as character types, R can handle these inconsistencies gracefully. To import such kind of CSV files, use the following command
full_data2 <- list.files( pattern = "*.csv", full.names=T) %>% lapply(read_csv, col_types= cols(.default = "c")) %>%x bind_rows()
This command is different from other commands in such a way that by using this command, multiple CSV files from a directory with potentially different column types will be imported while ensuring that all columns are imported as character data types. The col_types argument sets the default column type to character while importing a csv file.
Combine Tab delimited files in R
One can also combine multiple tab-delimited files in R. Tab-delimited files are a type of data file where data values are separated or delimited by a tab To combine multiple tab-delimited files in R, you can use the
read.delim() function to read each file and then combine them into a single data frame. If we have multiple tab delimited files saved in example 4 folder, first we set the working directory by using the following command
setwd("E:/The Data hall Articles/example 4")
Once the directory is set, we use the following command to combine and import tab delimited files in R.
tab_delimit <- list.files( pattern = "*.txt", full.names=T) %>% lapply(read_delim) %>% bind_rows()
In the above command, list.files is used to list all files in working directory, and read.delim() is a function from the readr package used to read tab-delimited files (like comma-delimited, etc.) Similarly, bind_rows() is used to combine data frames row-wise (stack them on top of each other). The data will be stored in tab_delimit, as shown below
Combine Multiple Excel files in R
Just like CSV and txt files can be combined in R, we can also combine multiple Excel files to work with them in one go in R. To combine these files, first load the readxl library by using the following command
Now, again set the working directory to the folder in which your Excel files are saved. In this case, we would set our working directory using the following command
setwd("E:/The Data hall Articles/example 5")
Once the working directory directory is set, now we combine the Excel files present in the example 5 folder. To combine these files, we use the following command
excel_files <- list.files( pattern = "*.xlsx", full.names=T) %>% lapply(read_excel) %>% bind_rows()
This command, like the commands for CSV and tab delimited commands, combine the Excel files in R, using the read_excel function which reads the excel files and combines the data in row-wise using the bind_rows() function. The rest of the functions work in the same way as they did for CSV and tab delimited files.