Cleaning, Extraction and Modifying String Variables

In this article, we’ll explain how to clean, extract, and modify string variables in a variety of ways. In both programming and data analysis, string variables play a crucial role since they often contain textual information that must be manipulated and processed. Data consistency and accuracy can be improved by “cleaning” string variables by deleting unnecessary characters like spaces, punctuation, and other symbols. To extract anything from a string means to separate it into its component parts using some kind of criterion or pattern. The dataset we will be using in this article of one variable that is string variable “names” as shown below figure:

Picture1
Download Example File

We will generate a new variable that will be the duplicate of the variable “name”. All the commands will be run on one variable while keeping the other variable the same. This will allow you to keep the original data while monitoring the changes. The below command will be used for that purpose:

gen name2 = name
Picture2

In this article, we will change the variable “name” while keeping the variable “name2” as original observations.

Trim Function:

It’s common for problems to arise when downloading data from specific websites, such as the downloaded data may have trailing or leading space, and even multiple spaces that are often unnecessary and should be removed to guarantee clear and useable data. When a string ends with a blank space, it is said to have trailing blank spaces. Instead, the data begins with a blank space known as a leading blank space. To remove the space, we use the trim function.

replace name=rtrim(name)

The names with space at the end will be removed after running the below command.

Picture4
Picture3

Two observations in our dataset have been modified by eliminating trailing spaces. To remove the leading spaces the following command can be use:

replace name=ltrim(name)
Picture5

No observation in our dataset has leading space, so no change has been made. We have an alternative command that also remove any leading or trailing space in our dataset. 

replace name=trim(name)
Picture6

Spaces at the beginning and end of a value are eliminated so that only the content is left by removing all unnecessary spaces. Another command that can be used is also given below:

replace name=strtrim(name)
Picture7

This command is meant to tidy up the “name” variable by removing trailing or leading spaces. The command that uses to remove the consecutive internal space within a specific observation is given below:

replace name=itrim(name)
Picture8

You will notice that internal space within the last four observations has been removed. It is possible to encounter unicode characters that are neither numerals nor strings when copying information from PDFs or websites. These symbols can be represented graphically as emojis, special symbols, or other forms. So, to overcome that, below given commands can be used:

This command will remove any spaces or other whitespace characters from the start of the string values.

replace name=ustrltrim(name)

This command will remove any spaces or other characters from the end of the string values.

replace name=ustrrtrim(name)

This command will remove any spaces or other whitespace characters from the end or start of the string values.

replace name=ustrtrim(name)

In this command, “u” represents the Unicode.

Picture9

Extracting Specific Sub-Text

To extract the specific characters from the variable “name”, we can use the substr function. There are three parameters that we need to provide to the substr i.e. the name of the variable, the position of character from where to start extraction process and the number of characters to extract.

generate first=substr(name,3,5)

The generate is a command that will generate a new variable named “first”. This substr() function will specify the substring extraction operation. This function will pull out a substring beginning at the third character position and continuing until the fifth character position of each string value.

Picture10

To extract the word from the name, word function can be use as specified below:

generate get_word=word(name,1)

The generate is a command that will generate a new variable named “get_word”. The word() function will specify the variable in which we are interested, for example in our case we are interested in “name” and the position of the word we want to extract.

Picture11

The command will extract the first word of the name. Note: The Stata assumes that anything separate space is a word.

String Length

If we want to count the number of characters of a variable then, strlen function can be use as specify below:

generate length=strlen(name)
Picture12

generate command will generate the variable(length). strlen function will count the number of characters in the name.  For example, it says that Sirajuddin Cassim has 17 characters.

To count how many words are there in a name, the below command will be used:

generate word_count = wordcount(name)
Picture13

Its say that Sirajuddin cassim has 2 words.

Remove or Change part of String

We can also remove or change some parts of the string variable (name) using the function subinstr. Let’s explain it with the below command:

generate name11=subinstr(name,"ahmad","A.",.)

The parentheses of the subinstr include four arguments. The first argument is the name of the variable we are looking at. Second, the string ahmad that we want to replace. Third, the string (A.) with which we want to replace with Ahmed. Lastly, a “.” to change the Ahmad wherever it is occurring.

Picture14

The ahmad has been replaced with an A. Similarly, if want to make change in first occurrence then instead of dot we can use 1. The “a” in all the first occurrence will be replaced with the “aaaa”.

generate name12=subinstr(name,"a","aaaa",1)      
Picture15

Note: only small letter “a” will be change. Also if we want to change all the “a”  “a” into “aaaa” then replace 1 with dot.

generate name13=subinstr(name,"a","aaaa",.)
Picture16

Position of Specific Character

We can also find the position of the specific character or word in the individual’s name using the below command.

generate position=strpos(name,"Mr.")
Picture17

A value of 1 is assigned to the variable “position” if the name begins with “Mr,” and a value of 0 otherwise.

Change the Text Case

All the string values (names) can also be converted into the upper case using the command below in which strupper function has been used.

gen upper=strupper(name)
Picture18

To change the string values (names) into lowercase, use the below command in which strlower function has been used.

gen lower=strlower(name)
Picture19

To change the string values (names) into the proper case, use the below command in which the strproper function has been used. A proper case is one in which the word’s first letter will be capitalized.

gen proper=strproper(name)
Picture20

We can also replace the space within the names with the _ by using the below command. It basically change the string into the name.

gen string_to_name = strtoname(name)
Picture21

Match Words

The string value can also be matched using strmatch function. It checks whether there is a specific name or not. It gives 1 if a match is found and 0 otherwise.

gen match1 = strmatch(name, "Mr.")
Picture22

As there was no Mr. alone. There are names after Mr. therefore, we will use the below command that includes *.

gen match2 = strmatch(name, "Mr*")
Picture24

Value 1 was assigned to all string values (names) that began with “Mr.”

To split the name into different variables, use the below command.

Splitting

split name,generate(N)

split is the command. “name” is the variables and will generate variables (N1, N2, and N3).

Picture25

To separate the name after a specific character such as “.” Then it that case we use parse and write argument within the parentheses as shown in below command:

split name,generate(N) parse(.)
Picture26

We can also limit the split by assigning a limit like 2 in our case, as shown in the below command:


Picture27

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x
Tweet
Share
Share
Pin