A Comprehensive Guide to Panel Data Regression in R

Panel data regression, also known as longitudinal or cross-sectional time-series data analysis, is a powerful statistical method for examining how independent variables affect a dependent variable over both time and individual units (cross-sectional units). For more details about panel data see Theory of Panel Data

In this tutorial, we will walk through the process of performing panel regression in R Studio. We’ll cover beyond the basics. From loading data and preparing it for analysis to running different types of panel regression models using both manual and built-in methods

Prerequisites

Before we get started, make sure all the libraries are installed and loaded in R-Studio.

We can do this by using the following commands:

install.packages(c("plm", "tidyverse", "stargazer"))

If we need to install a single package, we can modify the above command like this:

install.packages("plm ")

Library Usage:

Here’s what each library is used for:

plm: For estimating panel regression models.
lmtest: For conducting statistical tests.
stargazer: For creating formatted regression tables.
Tidyverse: Multiple Functions such as data manipulation, formatting etc.

Loading and Preparing Your Data

After installing all the libraries, the second task is, to load these required libraries using the following commands

library(plm) library(lmtest) library(stargazer)

Next, load the data. For this illustration, we will use a hypothetical dataset named ‘nlswork.csv’, which contains information on wages, hours worked, and individual characteristics over several years. For importing the data, we can use the following command (more details about importing and exporting a data set in R can be found here)

Download Data

data <- read.csv("nlswork.csv")

This command loads our data from ‘nlswork.csv’ and names it [data]. We can create new variables and store results in them or rename variables using the assignment operator [<-].

In R, we can view/explore our data in the [Environment] tab, located in the upper right box (marked by a red arrow in the figure). To view our data and get familiar with it, use this command:

view(data)

A snapshot of our data is provided here

Now that our data is loaded, let’s move on to panel data regression.

Running Panel Regression Models

Panel data regression can be performed in multiple ways. We will start with the basic simple or Pooled OLS model.

Pooled OLS Model

The Pooled OLS model assumes a common constant for all observations. It’s the simplest panel regression model and assumes that all individuals share the same relationship between the dependent variable and the independent variables.

we can run it using the [lm] function for pooled OLS. The command for this model is here:

model <- lm(ln_wage ~ hours + ttl_exp, data = data)

The lm function is used for linear regression. Here [model] is the name of the object that will store the results of our model. [ln_wage] is the dependent variable, [hours] and [ttl_exp] are the independent variables. As stated, [data] is the name of our data which we loaded. This command just regresses wages on [hours] and [ttl].

To see the model and its results use the following command

summary(model)

This command just gives the summary of the model as depicted here

However, the use of the simple/pooled OLS method is quite rare because there is often variation among firms, individuals, groups and countries. To account for this variation, we use Fixed Effect and Random Effect panel data models.

Fixed Effects Regression (Manual)

Fixed Effects (FE) models are useful for panel data with individual-specific effects. The model allows for different constants for each panel unit (e.g., firms). In this example, first we’ll do it manually using the lm (Linear Model) function to develop our understanding, but we will also discuss a built-in way using [plm] library. The command for manually doing the FE model is here under:

model <- lm(ln_wage ~ hours + ttl_exp + factor(idcode), data = data)

The rest of the model is the same as our previous linear regression model. The [factor(idcode)] function generates dummies for categorical variables. This command automatically drops one dummy to avoid the dummy trap (as shown in the following picture [factor(idcode)1] has been dropped automatically.

The above command analyses the relationship between wages and the specified independent variables but it also introduces the dummies to capture individual fixed effects in the data. To see the summary of the model, (again) use this command:

summary(model)

Fixed Effects Regression Model (Built-in)

Alternatively, we can use the [plm] library to do fixed effects model using the built-in function for FE. First, set the data to indicate the panel structure.

For numerical variable (as in our case) we use this command:

pdata <- pdata.frame(data, index = c("idcode"))

The above command is used to create a “panel data” or set a “panel data frame” using the `plm` library.

Here [data] is the main data that we want to convert into panel data. The `index` argument specifies the variable(s) that uniquely identify the entities or individuals in panel data to capture heterogeneity (which idcode in our case).

This command creates a new panel data frame [pdata] by converting our existing data into a format suitable for panel data analysis. Once the panel data frame is created, we can run any FE or RE) model with it.

Now using our already created dataframe, we will run the FE model using the built-in command. As stated, the results of both the manual method and the built-in method have to be the same.

The command is hereunder:

as part of the fixed-effects model, which serves the same purpose as individual-specific dummy variables.

model_fe <- plm(ln_wage ~ hours + ttl_exp, data = pdata, model = "within")

The difference between the manual and built-in methods is that the built-in command automatically includes dummies using the factor(idcode) term. However (here the point to be noted), in the built-in command, dummies are automatically included as part of the fixed-effects model specified by “within” argument. These fixed effects serve the same purpose as individual-specific dummy variables in accounting for unobserved heterogeneity across entities. Thus, the coefficient of both has to be same. The rest of the intuition is same. More about the theory can be found here.

For the results, use:

summary(model_fe)

The results of the fixed effects model are hereunder:

Note: if we use a fixed effect model with a time-invariant variable it will just omit that variable as it creates the problem of multicollinearity (we will suggest to give it a try).

Random Effect Model

The Random Effects (RE) model accounts for both within-unit and between-unit variations. We can run it using the `plm()` function with the “random” model (just replace the within with random in the previous command to execute the command our the RE)

Rest of the intuition is same

model_re <- plm(ln_wage ~ hours + ttl_exp, data = pdata, model = "random")

For the results use this command

summary(model_re)

The above figure shows our results of the RE model.

Year Fixed Effect

The Inclusion of year fixed effects is a crucial strategy in regression analysis when working with time-series or panel data. By incorporating these fixed effects, using time dummies, we can control for year-specific variations and trends, allowing us to isolate the true relationships between our independent variable(s) and the dependent variable. This approach enhances model accuracy, supports causal inference, and facilitates meaningful comparisons across different years.

We can include time variant dummies (here which is [years]) by adding [factor(year)] to our regression command:

model_fe_year <- lm(ln_wage ~ hours + ttl_exp + factor(year) + factor(idcode), data = data)

the intuition behind the methodology is the same.

Execute the following command for the results

summary(model_fe_year)

Here comes the results:

Model Selection: Breusch-Pagan Test and Hausman Test

Now, the question arises: which model should we select for the regression from the above three models, for this, we conduct two tests; the Breusch-Pagan and Hausman Tests.

Breusch-Pagan Test

The Breusch-Pagan test helps us decide whether to use a pooled OLS model or move to fixed effects or random effects model. The intuition behind this test is to assess whether there is heteroscedasticity among firms/individuals in the model. The null hypothesis in the Breusch-Pagan test assumes that there is homoscedasticity and the alternative hypothesis suggests that there is heteroscedasticity. The Breusch-Pagan test calculates a test statistic, based on the residuals from the regression model. This test statistic follows a chi-squared distribution.

If the p-value associated with the test statistic is below a chosen significance level (say less than 0.05), we will reject the null hypothesis. This indicates that there is evidence of heteroscedasticity in the model (in that case we consider moving to FE/RE model using Hausman Test). On the other hand, if the p-value is above the significance level, we may fail to reject the null hypothesis, suggesting that there is no strong evidence of heteroscedasticity.

To perform the Breusch-Pagan test in R, follow these steps:

Step 1: Estimate the RE panel data regression using the using the already discussed RE model commands

model_re <- plm(ln_wage ~ hours + ttl_exp, data = pdata, model = "random")

Step 2: perform the Breusch-Pagan test using the bptest function:

bp_test <- bptest(model_re)

Step 2 involves performing the Breusch-Pagan test using the [bptest()] function on the results obtained from the random effects (RE) panel data regression conducted in Step 1.

Finally, to get/see the test statistics, just execute the following command:

bp_test

In our case, the BP test’s P value is less than 0.05(as shown in the above figure), so we proceed to the Hausman Test.

Hausman Test

Our p-value states that, pooled OLS is not appropriate for our model as heteroscedasticity is present (no common constants). But within the family of panel data models there are two models which can tackle this problem: FE model and RE model. The Hausman test helps us choose between these two models.

Hausman Test compares the coefficients from the fixed effects and random effects models to determine if they significantly differ from each other. The null hypothesis in the Hausman test is that the coefficients in the fixed effects and random effects models are consistent with each other. The alternative hypothesis is that one model is inconsistent with the other. If the p-value from the Hausman test is small (typically less than 0.05), we reject the null hypothesis. This suggests that the coefficients from the two models are significantly different. In such a case, you would typically prefer the FE model.

The reason behind choosing the FE model is, RE model assumes no Heteroscedasticity, so in the presence of Heteroscedasticity. we move to FE. (see details here Theory of Panel Data).

To perform the Hausman test in R, execute the following commands:

Estimate the FE model

model_fe <- plm(ln_wage ~ hours + ttl_exp, data = pdata, model = "within")

Estimate the RE model

model_re <- plm(ln_wage ~ hours + ttl_exp, data = pdata, model = "random")

Perform the Hausman test

hausman_test <- phtest(model_fe, model_re)

Print/see the results of the Hausman test

print(hausman_test)

Finally, the results of the test are presented here using our last command

Here the p-value is less than 0.05, so we reject the null hypothesis, indicating that the fixed effects model is more appropriate due to the presence of unobserved individual-specific effects (heteroscedasticity).

Generating a Regression Table

In the above figures, it’s hard to read the results of our models. To present the regression results neatly follow the following steps/commands. They guide us about ‘how to create a formatted regression table’ using [stargazer] library.

To give the list of the models

models_list <- list(model_fe)

To list independent variable

independent_variables <- c("hours", "ttl_exp")

Giving labels and title to the Model

model_labels <- c(paste("Fixed Effects (", paste(independent_variables, collapse = ", "), ")", sep = "")) stargazer(models_list, title = "Fixed Effects Panel Data Regression", align = TRUE, style = "default", type = "text", column.labels = model_labels)

In the following figure the model in presented in a precise way

Conclusion

In this guide, we’ve covered the essentials of panel regression in R Studio and learned how to load and prepare panel data, run different types of panel regression models, and perform model selection tests. With this knowledge, we can analyse panel data effectively and present our results professionally.

1 Comment

Newest

Oldest Most Voted

Inline Feedbacks

View all comments

German Guy

1 month ago

Thank you for the guide. I got to the point where it is nessessary to include robust parameters to control for heterpskadisticity and autokorrelation. This would be nice to include it in the article as well.