In real-world, it’s quite common to have missing values in panel data or time series data, which can result in unbalanced panels for panel data and problems in data analysis for time series data. These missing observations can occur due to a variety of reasons. For example, macroeconomics data could not be accessible for all countries from the same starting date, or the frequency of data collection for particular variables may fluctuate with time. Data can also be either present yearly, but after a certain time if data collection has changed from yearly to quarterly, some data could be lost in these quarters.
This issue of having missing values in panel or time series data can be handled by using tsfill
command in Stata. With the help of this command, missing values are filled in the data. For instance, the tsfill command may be used to turn imbalanced panel data into balanced panel data when dealing with it. Also, while working with time series data, the tsfill command has the ability to generate missing years and then fill in the relevant gaps.
To get a clearer picture of the tsfill
command, we use the following time series data, which contains missing values.
Once the data has been loaded, we move to data editor window to check which values are missing from the data. As shown in the picture below, the data contains two variables, years and GDP growth in these years. Since the data is in the form of time series, it shows the GDP growth for consecutive years. Upon closer examination, we notice that the value of GDP growth for the year 2015 is missing from the data. Similarly, values of years 2020 and 2021 are also missing. This makes the data unreliable for data analysis and modeling.
Related Article: How to Create and Use Business Calender in Stata
To analyze data with missing values, we use tsfill
command in Stata. This command creates a row for the missing year, i.e. year 2015 in data, and help us model data with ease and without any biases in the modeling.
The missing values in time series data can also be handled manually, instead of the tsfill command. To do so, use the following path in Stata
Statistics > Time Series > Set up and Utilities > Fill in Gaps in Time Variables
However, this article follows the tsfill command and its usage, so our focus will be on the command. Before we use tsfill command for filling gaps in the given data, we need to tell Stata about the type of data being used in Stata. This could be done by using tsset command
. Tsset’s function is to handle time-series setting of a dataset. We can specify that the data in memory is time series by using the tsset command. This command is used with the variable name which indicates the the values related to time, i.e. in this data, the year variable represents time, thus it would be used with tsset command. To specify Stata about the type of data we are using, use the following command
tsset year
Using this command provides detail about the data, including the year from which data starts and the year in which it ends. It also informs us whether gaps are present in the data or not. As it can be seen from the following image, there exists gaps in our data
Once it’s informed to Stata that data type is time series, we can proceed to use tsfill command in the Stata, to fill gaps in the data. To fill these gaps, simply use the following command, without instructing Stata any further, because we have already informed Stata about the type of data.
tsfill
Using the above command will fill the gaps by generating the previously missing years i.e. 2015, 2019 and 2020. The newly generated rows for the given years don’t have any GDP growth values, because values of GDP growth are missing from the data source.
There are many ways to fill these missing values of GDP growth. One can either take the average of all the existing values and fill the gaps, or generate the missing values based on the previous values present in the data. To do so, Stata Command ipolate
can be used to predict and fill the missing values in the subsequent years. This ipolate
command generates the missing values in Stata . To fill these missing values, use the following command
ipolate gdp_growth year, gen(ipolate_gdp)
Another column will be generated by ipolate command for the values of GDP growth.
The next step is to sort values in the data, although our data being used here is already sorted, if data is arranged randomly, it will be arranged in sequence using the sort command.
sort year
Similarly, we can also fill the missing value based on the previous values in Stata. To fill the missing values based on previous values, use the following command
replace gdp_growth=gdp_growth[_n-1] if missing(gdp_growth)
Filling Gaps in Panel Data in Stata
Till now, we dealt with time series data only. In case, the similar issue of missing values arise in Panel data, it can also be solved in the Stata. To visualize this, we will import the panel data by downloading the below file.
Once the data has been imported, we will go to data editor window, and check whether data for each year is present or not. As it can be seen from the image below, the data for many years including from 2011 to 2013 is missing and similarly data for certain years of unit 2 is also missing. As we know, Panel data has both entities and time periods in it, so to have complete data, the entities and time period should match, which doesn’t match in our data.
For instance, data for entity 2 starts from 2005, while for entity 1, it starts from 2010. Similarly, for entity 3, values start from 2016 and the data of previous years is missing. Along with lack of starting data for common year, years n between the data are also missing, as 2012 and 2013 is missing for entity 1, and for 2012 and 2014 for entity 2 and So on.
It’s important to keep in mind, that panel data may also be divided into balanced and unbalanced panel data. Balanced panel data refers to datasets where all entities have observations for all time periods. In contrast, unbalanced panel data has missing observations for one or more entities over time. It’s important to account for these missing observations, as they can impact the accuracy of analyses and conclusions drawn from the data. Therefore, it’s essential to identify and address missing values when working with panel data. Using the xtset
command, Stata can notify us about the missing values in the data.
To identify the missing values, we use the following command,
xtset firm year
As panel data includes entities and time, mentioning the variables that reflect entities and time is crucial when working with panel data. These variables, in this data, are year and firm. Once identified, they should be specified with the xtset command for Stata to keep it in its memory and make the analysis according to panel data.
The following image clearly shows that missing values are present in the panel data.
To fill the gaps in years in panel data, the previously used tsfill command will be used.
tsfill
Related Article: Rolling Beta and rolling mean, median in Stata
However, tsfill command here only generated rows for the missing years, and didn’t take into account the common starting and ending year missing for the three entities.
As we already discussed, to have a balanced panel data, it’s necessary to have common starting and ending. To make the data a balanced panel data, we generate the common year for all three entities by using the following command
tsfill, full
It will generate the common starting year; 2005 and ending year ; 2019 for all the entities, along with generating missing years in these entities
The missing values, generated using the tsfill command, now represent a complete picture of the data. Having discussed the importance of balanced panel data earlier, we check the status of data using xtset command again
xtset firm year
The Stata now shows the data as Balanced panel data with no missing years.
Although the missing years are generated in the data, these years still lack the values of stock prices. The missing values can also cause problem in analyzing panel data in Stata. However, these values are missing from data source and can only be estimated either by taking average of all the existing values or replacing the missing values with previous values.
In this case, when a number of values are missing, the previous values can be replaced for the missing values. To do so, we use the following command
replace stock_price=stock_price[_n-1] if missing(stock_price)
Related Article: Use of System Variables, difference between _n and _N in Stata
Note that if the previous value is also missing, the current values will also remain missing. The previous missing value is usually the first value/starting of a certain entity in the panel data.
The missing values are created for the newly generated years, with respect to the previous values, as shown in the image below.
However, the Stata generated missing values for entity 3 based on entity 2, instead of leaving the starting values of entity 3 empty. This has happened because we didn’t specify Stata about the entities and just instructed it to generate missing values based on previous values.
To replace the missing values based on previous values with respect to each entity, we need to specify Stata about it. Again, Import the panel data, use the xtset command to instruct Stata about the type of data; panel data and generate missing years in the data by using tsfill full
command.
The next step is to inform data about the presence of different entities and how missing gaps should be filled according to each entity. To do so, we use the by
command in Stata with the entity variable; firm, along with the previously used replace command.
This will specify Stata that different entities are present in data. Thus, it would generate the missing values for each entity with respect respective to that entity only.
bys firm: replace stock_price=stock_price[_n-1] if missing(stock_price)
Stata, thus, follows the instruction provided above, and fill missing gaps with correspondence to each entity. As shown in the following image, the starting values of entity 3 are missing from data source too. These missing values of entity 3 are now, not, filled by replacing previous values of entity 2.
Although we have filled missing values with respect to their entities, the issue of remaining missing values, who haven’t been replaced by their previous values, as in entity 3, still exists. This problem can be solved by following a few steps. First step is to sort the variables firm and year in ascending and descending order, respectively. To sort these variables, use the command given below
gsort firm -year
The reason we used negative sign “-” with year is because we are sorting it in descending order.
Once the variables are sorted in their respective manners, data will look like this
Now again, using the bys command can fill the missing values of entity 3 by using its previous values. Using the bys command
bys firm: replace stock_price=stock_price[_n-1] if missing(stock_price)
Carryforward Command in Stata
So far we have used the process of filling gaps with previous values either by using replace command in the data. However, there is one simple alternative to all this, which can replace the missing values with previous existing values. This alternative is known as carryforward
command in Stata. This command will carryforward non-missing values from one observation to the next, substituting the previous value for missing values. To use the carryforward command, it has to be installed in Stata using the following option, because carryforward command is not built-in Stata
ssc install carryforward
Once the command has been installed, it will be used to carryforward the previous values in the missing values. If as a precaution, we want to retain our original values, and use the carryforward command to fill missing values in a new column, the following command should be used
bys firm: carryforward stock_price,gen(cc)
However, if your requirement is not to generate a new column for filling the missing values, you can substitute the gen(cc) option from the above used command with the replace command, as shown in the command below
bys firm: carryforward stock_price, replace
All changing will be made in the already existing variable stock price
We still have starting missing value for entity 1 and ending missing values for entity 3. To fill these missing values, we again use the following command, as used earlier
gsort firm -year
Once the values are sorted in ascending or descending order, the carryforward command can be used to carry forward the previous values, and fill the gaps.
bys firm: carryforward stock_price, replace