Using the Encode Command In Stata to Convert String Variable |

Often times, we have to work with categorical variables when doing statistical analysis. Categorical variables are those variables that have two or more categories or groups. For example, a variable representing sectors in a stock market could have categories like tech, automobiles, pharmaceuticals, cement, banks etc. These categories of string type may also be needed as part of regressions, in which case, it is important that these categories can be referred to as a number.

We can encode categorical string variables into numeric using the encode command in Stata. This means that each unique category of the variable will be assigned a numerical code. Value labels will also be assigned by Stata automatically to indicate what categories these numerical codes represent. This would then help us understand which numeric value refers to which category of the variable, while also allowing the categorical variable to be used in regressions.

To illustrate how it is done, we will use a dataset that has individual data on the variables shown in the figure below.

Encode a Categorical Variable Using Stata’s Menu

To encode a variable using Stata’s menu, the following steps can be followed:

Data > Create or change data > Other variable-transformation commands > Encode value labels from string variable

In the dialogue box that opens, add the string variable that you want to convert. Then add name you want to give to the new, encoded variable. You can also specify value label names if needed.

We will explore the encoding of categorical string variables into numeric ones using the encode command.

**Convert Categorical String Variables to Numeric: The encode Command in Stata**

The general syntax of the encode command is:

encode string_variable, generate(new_numeric_variable)

Let’s encode the ‘age_grp’ variable and name the new numeric variable as ‘encode_age’.

encode age_grp, generate(encode_age)

The new variable is generated.

**Assign Custom Labels When Using encode command in Stata**

The encode command always assigns numeric values starting from 1 in Stata. But when we want to run regressions with binary/dummy variables, it is important to have the variable coded with 0 and 1. Let’s see how Stata encodes ‘sex’, a binary variable.

encode sex, gen(gender)

codebook gender

The category for female is assigned the value of 1; and male, the value of 2. The empty values were coded as missing. We want the female category to be coded as 0, and the male category as 1.

In order to do that, let’s create a label called ‘sexlabel’ which will store these custom codes and labels.

label define sexlabel 0 “female” 1 “male”

This command won’t produce any output. It just defines the label and the value labels it contains in Stata’s memory. You can confirm that the label has been created either through the Variable Manager using the menu option: Data > Variable Manager > Manage (in front of the ‘Value label’ field). You can also list out all the labels in your dataset using the command:

lab list

Now, the encode command can be used with the option label() which is used to indicate the way each category is to be labelled when the variable is encoded.

drop gender

encode sex, generate(gender) label(sexlabel)

The variable is now coded correctly in the manner we wanted and as specified in the label called ‘sexlabel’.

Note also that when defining labels, case-sensitivity needs to be kept in mind. If you were to label the values of 0 and 1 with ‘Female’ and ‘Male’ instead of ‘female’ and ‘male’, Stata won’t be able to use this label with the ‘sex’ variable correctly. The string values in this variable are in all lower case. Stata will not assign 0 to “female” in a variable if you wrote “Female“ when defining the label (“female” != “Female”).

Not Encoding if Value Labels Are Unspecified in Stata

If we define a label for a variable but only add value labels for some categories while omitting others, we can tell Stata to not encode that variable. This is done using the option noextend. This option tells Stata that if some values present in the variable are absent from the label defined for it, then the variable should not be encoded, by using encoding command in Stata.

For example, for the ‘race’ variable, our data has three categories: White, Black and Hispanic.

tab race

Let’s define a label called ‘racelabel’ that has values for only two categories, and then use this label to encode the ‘race’ variable (with the incomplete list of value labels) while using the noextend option.

lab def racelabel 1 ”Black” 2 “White”

encode race, generate(encode_race) label(racelabel) noextend

Because a value label for the Hispanic category was not defined in the label that we are trying to assign to ‘encode_race’, Stata is not encoding the ‘race’ variable at all. This is because we used the noextend option.

**Reducing The Size of the Data Using encode command in Stata**

Data with a lot of string variables tends to be large in size as strings take up more storage than numeric data. If a dataset has repeating strings like a person’s name, company names, sector names etc., the data size can end up being very large. The data size can be reduced by using encode command in Stata.

As can be seen from the Data panel on the bottom right, the current size of the dataset is 1.22K. Let’s encode the ‘race’ variable correctly, and drop all the string variables.

encode race, gen(encode_race)

drop age race sex

The data size now is only 585 bytes, a reduction of almost 50%.

An important thing to note about the encode command is that it should not be used to encode a variable that contains numbers which are stored as strings. When numbers are stored as strings, the command destring is used to turn the string variable into a numeric type.

Converting Numeric Data to String in Stata – The decode Command

If we want to convert numeric data to string type, the decode command is used in the same manner that encode was used.

decode encode_age, generate(string_age)

We can also specify the number of digits that the string variable should retain from the numeric variable using the maxlength() option. The number of digits is specified in the brackets.

Let’s drop the ‘string_age’ variable and decode ‘encode_age’ again with a maximum length of 2 characters to be retained.

drop string_age

decode encode_age, generate(string_age) maxlength(2)

As can be seen, only the first two characters of ‘encode_age’ make it to ‘string_age’ if we specify so in the max length() option. This new variable may look like it is of numeric type because it only contains numbers, but it is a string variable. In addition to the fact that we just generated it using decode, it is also red.

Using the Encode Command In Stata to Convert String Variable

Related Article: How to Convert String variable into numeric in Stata