In the earlier parts (part 1, part 2, and part 3) of this series, we have seen, how certain variables can change the outcomes of our regression model. We thoroughly explored such phenomena from basic (intercept dummies) to the two-way interactions. Likewise, three different variables (continuous or categorical) can also simultaneously impact our dependent or outcome variable. The last Part of this series aims to explain when we have two categorical or continuous variables that act as moderating variables. In the literature, we often call it, the three-way interaction.
We start the last part by executing the following commands which are used for installing and loading the mandatory libraries:
install.packages(c("ggplot2"," viridis "," cowplot ")) library(ggplot2) library(viridis) library(cowplot)
To load the data, execute the following commands again:
set.seed(123) data <- data.frame(id = 10001:10500, age = sample(20:45, 500, replace = TRUE), marks = sample(20:100, 500, replace = TRUE), salary = sample(1000:5000, 500, replace = TRUE), gender = sample(c('Male', 'Female'), 500, replace = TRUE), education = sample (8:18, 500, replace = TRUE))
In the above command, an important thing to note is, that we have generated a sample of 500 data points for the [education] variable from a range between 8 and 18 years of education. In previous articles [education] was used as a categorical variable however, this time we have created a continuous variable.
To see and familiarise yourself with the data, execute the following command:
View (data)
Categorical and Continuous Variables
We start by examining the three-way interaction between multiple variables (categorical (gender) and continuous). For the interaction execute the following command
three_way_model <- lm (salary ~ education * gender * age , data = data) summary(three_way_model)
The last above command is used to see the results of the model. The second last command regresses [salary] on [education]. It also explains the moderating role played by [gender] [education], and [age].
In the above figure, the first three coefficients show the main effect, while the last four coefficients show the moderation [:] role. An important thing; the aim of this article, is the difference between two-way and three-way interaction. [genderMale:age:education] shows three-way interaction while the rest of the interactions are two-way interactions.
Note: we have used the shortcut way; using the[*] sign. Another way of doing this is to use the moderation sign [:] in the command. See part 3 for details.
Interpretation of dummy regression is often tricky due to the dropping of multiple categories (to avoid the dummy trap; multicollinearity). Here, in the above figure, the interaction effect suggests that, as compared to [female] being [male] could increase the [salary], keeping other things constant.
Another easy way of doing this exercise is to plot the results of the regression model. Here we will be focusing on plotting our model using a graph by executing the following commands:
We use the following command to fit our regression model:
age_intervals <- seq(min(data$age), max(data$age), length.out = 5) education_intervals <- seq(min(data$education), max(data$education), length.out = 5)
When we have a large number of observations. Plotting all of them at once makes the graph messy (as shown in part 3). So, using the above command, we specified [education] in intervals; and created intervals for the [age] and [education] variables based on the minimum and maximum values.
To predict salaries for both genders for each age and education interval
predicted_salaries <- NULL for (age in age_intervals) { for (education in education_intervals) { predicted_salaries <- rbind(predicted_salaries, data.frame(age = age, education = education,gender = "Male", predicted_salary = predict(three_way_model, newdata = data.frame(gender = "Male", age = age, education = education))), data.frame(age = age, education = education, gender = "Female", predicted_salary = predict(three_way_model, newdata = data.frame(gender = "Female", age = age, education = education))))}}
The above list of commands aims to predict salary based on [gender], [age], and [education] and their interaction. They create a dataset [predicted_salaries
] by predicting salaries for different combinations of [age], [education], and [gender]. For each combination of [age] and [education], it predicts [salary] separately for [males] and [females].
To plot our results, execute the following command:
ggplot(predicted_salaries, aes(x = age, y = predicted_salary, color = factor(education))) + geom_line() +labs(title= "THREE-WAY-INTERACTION", x = "Age", y = "Predicted Salary", color = "Education") + facet_wrap(~gender, scales = "free")
This command plots our results using using [ggplot2] library. Here, the multiple lines represent different levels of [education] (as shown in the multicolour legend). Any label can be given to the axis and any variable can be placed on any axis. Moreover, a title for the plot can be used in the above command. For instance, here [x] represents the horizontal axis. we used/named our horizontal axis as [Age].
The above figure states that initially (at age 20), a person with [8 years of education] on average earns less than a person with [18 years of education] (keeping others constant). However, as [age] increases, the increase in a person’s [salary] having [8 years of education] is more swift as compared to a person having [18 years of education].
So far, we have analysed the three-way interaction effect among continuous and categorical variables. However, multiple continuous variables can also impact our outcome variable, both individually and simultaneously. To explore this phenomenon, execute the following command:
cont_three_way_model <- lm(salary ~ marks * age * education, data = data)
To define age, education, and marks intervals for prediction
age_intervals <- seq(min(data$age), max(data$age), length.out = 2) education_intervals <- seq(min(data$education), max(data$education), length.out = 2) marks_intervals <- seq(min(data$marks), max(data$marks), length.out = 2) predicted_salaries <- expand.grid(age = age_intervals, education = education_intervals, marks = marks_intervals) predicted_salaries$predicted_salary <- predict(cont_three_way_model, newdata = predicted_salaries) predicted_salaries$combined <- paste(predicted_salaries$education, predicted_salaries$marks, sep = "_")
the above commands are used to combine values from the [education] and [marks] and predict salaries; [predicted_salaries]. To create a unique identifier for each combination of education and marks [predicted_salaries$combined] is used.
To plot the results
ggplot(predicted_salaries, aes( x = age, y = predicted_salary, color = combined)) + geom_line() + labs (title = "THREE-WAY-INTERACTION (Continous Variables)", x = "Age", y = "Predicted Salary", color = "Education & Marks") + scale_color_manual(values = viridis_pal(option = "viridis")(length(unique(predicted_salaries$combined))))
Here is the results of the final command:
The interpretation of the above figure is as [Age] increases the salary of a person with [18 years of education] and having [100 marks] decreases, while the [salary] a person with [8 years of education] and having [100 marks] salary increases
In this comprehensive series, we assessed the impact of categorical and continuous variables on predicting salary levels. Furthermore, we have explored, how the relationship between variables can be influenced by different interactions by presenting both graphical and tabular representations of our models. In this part have seen how three variables collectively play the role of moderation in our model.
Thanks for always visiting thedatahall.com. Stay tuned for more insightful tutorials.