Problems with Regressions Involving Categorical Variables

When doing regressions of an outcome variable on covariates that involve categorical variables, there are some issues that one needs to be mindful of with such analysis. Some common problems associated with regressions involving categorical variables are:

Dummy Variable Trap

Usually, when using categorical variables in a regression model, we tend to use dummy variables to represent the different categories. For example, for a categorical variable called ‘female’, 0 could represent males, while 1 could represent females in a dataset.

There can also be more than two categories for a variable. For example, to denote a person’s education level, there can be four variables called ‘primary’, ‘secondary’, ‘undergraduate’, and ‘postgraduate’. One of these would equal 1 to denote a person’s highest level of education while the other three would be equal to zero.

The dummy variable trap arises when all the dummy variables for all categories are added to a regression. For example, one could add all four education dummies in a regression model hoping to see the effect of education on an outcome variable. But this would lead to the problem of multicollinearity whereby the dummy variables are linearly dependent on each other. In the case of the four education variables, the sum of all four will always be equal to 1 (since one of them will always be 1 while others will be 0).

This leads to unstable coefficient estimates and inflated standard errors.

To avoid the dummy variable trap, simply drop one category from your regression. The interpretation of coefficients will then be done in comparison to the dropped category (also known as the base category).

When all the categories are recorded in one variable, Stata automatically drops the base category (the first category assigned the lowest number) in every regression with a binary/categorical variable as well.

Multicollinearity

Even if you do avoid the dummy variable trap, the problem of multicollinearity can still plague your analysis if you’re adding multiple categorical variables or interacting them with other variables. This usually happens when independent variables are highly correlated, making it difficult to interpret individual variable effects and conduct accurate hypothesis testing because of inflated standard errors.

This is because OLS cannot return estimates for the effect of ‘x1’ on ‘y’ while holding ‘x2’ constant ‘x2’ also changes when ‘x1’ changes.

For example, it does not make sense to have a variable for a person’s BMI and weight (BMI = weight/height) together in a regression. If one’s weight increases, so will the BMI. One of these redundant variables would have to be dropped.

If our independent variables have an exact linear relationship between them, the problem becomes that of perfect multicollinearity. For example, consider a scenario where we would like to see a person’s time allocation for various activities throughout the day and their effect on the person’s grades. Our covariates include ‘study’, ‘sleep’, ‘leisure’, ‘work’ and ‘other’. All of these variables denote the number of hours that a person spends on each of these activities in a day. Thus, the sum of these linearly related variables will always equal 24 i.e. adding all of them together in the regression will cause perfect multicollinearity. In this case, one variable will have to be dropped.

When interacting variables, we must be careful in making sure that a redundant category does not exist. For example, one may be inclined to have two variables for gender (‘male’, ‘female’) and two variables for marital status (‘married, ‘unmarried’).

When interacting the two in a regression, it is enough to only interact one gender variable and one marital status variable. So, it will be valid to perform a regression of, say, ‘income’ on ‘male’, ‘married and ‘male’*’married’. This interaction term will equal 1 for married males.

The base category will then be the combination of variables where both the interacted variables are equal to 0 i.e. unmarried females (the ‘income’ for this will be denoted by the constant). Income predictions for other combinations (married females and unmarried males) can be calculated using the constant, ‘male’, ‘married’ coefficients as needed.

Alternatively, one can just use the margins male#married command after the regression to get income predictions for all categories.

Bottom line is, make sure your covariates are not too strongly correlated. This article goes in depth about how you can check for multicollinearity in Stata.

Choice of Base Category

Sometimes, your choice of a base category when working with categorical variables becomes important too. This is mainly for two reasons: interpretation and meaningfulness.

One should be careful when interpreting the coefficients for different categories against a base category. In a simple OLS regression, the constant denotes the predicted outcome value if the other covariates equal zero. The beta coefficients represent the difference in the dependent variable between the base category and the other categories. Interpreting the magnitude and direction of these differences might require careful consideration. To calculate their predicted effect, the correct combination of coefficients must be added or subtracted.

If you want the study the effect of one’s education level (categorical variable) on their income, it makes little sense to use the ‘high school’ category as the base category. This is because comparing the effect of an undergraduate degree or a masters degree (the other two categories) with the effect of high school makes little sense. It might be a better idea to use the category for undergraduate education level so that it can be compared to the effect of a masters education. (While it is possible to calculate and compare the effect sizes for all categories in Stata, it is quicker and more efficient to report relevant ones in your regression).

Thus, the choice for a base category must be meaningful so that correct comparisons can be made with respect to the coefficients of other categories.

Violation of the Linearity Assumption

Some other issues to be mindful of include considering the presence of a potentially non-linear relationship between the categorical variable and the outcome variable. For example, when studying the effect of age on earnings, a category for very young adults might not show a steep increase in earnings, but the earnings curve might become steeper as a person gains more experience with age. This curve would then begin to get flat again once one’s earning potential begins to plateau or they decide to stop working altogether. In such cases, models might be overfitted or underfitted.

Ordinal Variables

Some categorical variables do not have an equal spacing between categories. A variable for education level might assume four years for high school, three or four years for undergraduate degrees, one to three years for a masters degree, and four to eight years for PhDs. Treating ordinal variables as being nominal can result in incorrect conclusions sometimes.

Missing Data for Categories

If data for some categories is missing, it can cause a major bias in our analysis and results. For example, individuals belonging to a higher social class/tax bracket might not report their salaries and this cause missing values in the data that are not random. Regression results might be overestimated or underestimated in such cases.

Not only does this make a sample unrepresentative and difficult to replicate, a resulting reduction in sample size can also reduce statistical power and lead to incorrect inferences and hypothesis testing.

Missing values can sometimes lead to regressions showing spurious correlations between variables due to the pattern of missing data even though such relationships don’t exist. (Remember to not make matters worse by replacing a missing value with a 0. It makes no sense if experienced individuals with a high level of education are earning $0 a year).