In this article, we will learn how to convert categorical variables into numerical variables (Also look at destring and encode).
Suppose we have a dataset on patients and the drugs they have been administered. If drugA is administered to a patient, the variable shows ‘yes’, if not, it shows a ‘no’. The following image depicts what the data looks like:
Download Example FileWe would like to code the ‘yes’ and ’no’ entries as 1 and 0 respectively. Additionally, we would also like to add a new variable which equals 1 if any of the three drugs were administered.
Coding the string data
In order to convert these strings to binary codes, we can individually replace each string for each drug variable as follows:
replace drugA="1" if drugA=="yes" replace drugA="0" if drugA=="no"
Note, that 0 and 1 are enclosed in inverted commas since the type of all three drug variables is a string for now. The command above will change the drugA variable as follows:
For a more efficient and quicker way of making the same replacement in the other two variables, we can use a foreach
loop. This is especially helpful and recommended if you have a larger number of variables to make changes to.
foreach var in drugB drugC { replace `var'="1" if `var'=="yes" replace `var'="0" if `var'=="no" }
The local 'var'
used in this example can be named however you would like. The local’s job is to simply store the list of variables specified at the start of the loop. The local will then be used inside the loop to refer to each variable in the list one by one. In this case, the loop will run first for ‘drugB’, and then all over again for ‘drugC’.
Running the code above will make changes to these two variable values as well:
Related Book: Data Management Using Stata Michael N. Mitchell
From String to Numeric
However, note that these values are stored in red color, indicating that they are string values rather than numeric ones. To rectify this, we use the destring
command:
destring, replace
This command destrings every string value and replaces them with the numeric counterpart. This will also be apparent by the black color of the stored data:
Generating a new variable
Now we would like to generate a binary variable called ‘anydrug’ that would take the value of 1 if any of the three drugs were administered i.e. if any of the three drug variables has a value of 1. The egen
command will be utilized to make this task quick:
egen anydrug=rowtotal(drugA-drugC)
The rowtotal
function used with the egen
command produces the sum of all values present in the rows specified in the brackets, which in this case were all the variables starting from ‘drugA’ to ‘drugC’. If one drug was administered, this variable would equal 1; if two were administered, it would equal 2, and so on. Obviously if no drug was given to the patient, the variable would equal 0. We would like the latter observations to have a value of 0 for the ‘anydrug’ variable, and 1 for any value greater than 0.
replace anydrug=1 if anydrug>0
This command replaces any value greater than zero with 1.
The rowtotal
functions ignore any missing values when adding variables. Note that a value for ‘drugC’ is missing for patient 8, but the command still returned a row total of 1. If we want Stata to take missing values into consideration and reflect that in the added output, we simply use the gen command.
gen anydrug2=drugA+drugB+drugC
This command will return a missing value for the generated variable for patient 8.
Another way to go about it…
There is another, perhaps a more easy and straightforward way to go about our task above. Assume that we no longer care about destringing or encoding the drug variables. We can use the gen command along with the OR operator indicated by the vertical bar |
to directly create the ‘anydrug’ variable.
gen anydrug3=1 if drugA=="yes" | drugB=="yes" | drugC=="yes"
This will simply generate ‘anydrug3’ to be equal to 1 if any of the variables from ‘drugA’ to ‘drugC’ are equal to ‘yes’. This approach eliminates the need to convert them to numeric type, add them, and use an if condition to replace them with 1. For any patient who did not recieve any of the drugs, this variable will have a missing value.
The following command will take care of these missing values and replace them with a 0.
replace anydrug3=0 if (drugA=="no" | drugB=="no" | drugC=="no")
This command replaces the variable with 0 if any of the drug variables equal “no”. Alternatively, this command could also be modified as:
replace anydrug3=0 if missing(anydrug3)
Any value that was missing after our initial gen
anydrug3 command will be replaced with a 0.