How to Generate Dummy Variables in Stata
Dummy variables are categorical variables that take on binary values of 0 or 1. For example, a dummy for gender might take a value of 1 for ‘Male’ observations and 0 for ‘Female’ observations. Coding string values (‘Male’, ‘Female’) in such a manner allows us to use these variables in regression analysis with meaningful interpretations. In this post we are going to understand how to generate dummy variable in Stata. The idea of generating dummy variable is well explained in Chapter 10 of A Gentle Introduction to Stata by Alan C. Acock
In this article we use the 1978 Automobile dataset built into Stata. This data can be accessed through the command:
Before beginning to work with a dataset, it is a good idea to examine the variable list, storage type and labels. The descriptive information for this dataset is displayed through:
To begin with our discussion around dummy variables, let’s observe the variable for repair records in 1978, rep78.
tabulate rep78, missing
The tabulate command above allows us to see that the variable is characterised by five categories coded numerically from 1 to 5. The option of
missing allows us to observe the number of missing values (.) in the variable as well. We will explore five methods of generating dummy variables in this article:
- Two-step method
- One-step method
- Dummy based on an inequality condition
- Dummies for multiple categories
- Dummies based on multiple conditions
Two-Step Method to Generate Dummy Variable in Stata:
generate rep2 = 1 if rep78==2
This command generates a new variable named ‘rep2’ which takes on the value of 1 only for observations where rep78 is equal to 2. Where rep78 equals 1, 3, 4, 5, rep2 will be populated with missing values (.).
replace rep2 = 0 if missing(rep2)
This command deals with the missing values generated in rep2. It replaces all observations in rep2 with a 0 if rep2 has missing values.
However, this also means that rep2 takes on a value of 0 when rep78 had missing values (.). This is an inaccuracy that needs to be addressed.
The (incomplete) command above served to illustrate the importance of being mindful of missing data in relevant variables; otherwise data cleaning, variable creation and other data operations will be plagued with data entry and misspecification errors. We shall modify this command to account for missing values in rep78 as well.
replace rep2 = 0 if missing(rep2) & !missing(rep78)
This additional conditional directs Stata to populate rep2 with 0 if there are no missing values in rep78.
!missing indicates ‘not missing’, where
‘!’ is an operator for ‘not’.
Another Way of Generating Dummies:
There is another similar but slightly different approach to generating a dummy variable. Let’s generate a dummy, rep3, that takes a value of 1 when rep78 is equal to 3. This too involves two commands:
generate rep3 = 0 if !missing(rep78) replace rep3 = 1 if rep78 == 3
In this case, we first generate rep3 which equals zero whenever rep78 does not have missing values. In the case of missing values in rep78, rep3 will also have a missing value. We then replace rep3 with 1 whenever rep78 takes on a value of 3.
One-Step Method to Create Dummy Variable in Stata:
generate rep4 = rep78 == 4 if !missing(rep78)
This command achieves the exact result that we obtained for rep2 and rep3, but within one command.
Dummy Created Based on an Inequality Condition
In the above examples, we generated dummies based on static conditions. Now, we wish to generate a dummy repg, which takes on a value of 1 if rep78 is greater than or equal to 3.
generate repg = rep78>=3 if !missing(rep78)
repg takes on a value of 1 only when rep78 is greater than or equal to 3 and does not have a missing value.
Dummies For Multiple Categories
We saw how rep78 had five categories with numeric values of 1 to 5. Instead of generating a dummy variable for each category individually, we can use the tabulate command with the option of gen to create five dummies at once.
tabulate rep78, generate(dummy)
This comprehensive command creates five dummy variables from the five categories of rep78. The dummies take on the names of ‘dummy1’, ‘dummy2’, ‘dummy3’, ‘dummy4’ and ‘dummy5’. ‘dummy1’ will equal 1 whenever rep78 equals 1, ‘dummy2’ will equal 1 whenever rep78 equals 2, and so on (the command also takes into account the issue of missing values we discussed before).
Dummy Created Based on Multiple Conditions
What if we wish to create a dummy variable that takes on the value of 1 whenever more than one conditions are satisfied? To illustrate this, let’s bring in the ‘price’ variable in our example.
We want to create a dummy (called ‘dummy’) which equals 1 if the price variable is less than or equal to 6000, and if rep78 is greater than or equal to 3. Both these conditions need to be met simultaneously. If, for example, price is less than or equal to 6000 but rep78 is not greater than or equal to 3, ‘dummy’ will take on a value of 0.
generate dummy = price<=6000 & rep78>=3 if !missing(price, rep78)
The command above makes use of operators like
& and the conditional
if qualifier to achieve that. It also addresses missing values in both the price and rep78 variables.