Number of Unique Observations

A unique observation is a data point or occurrence that stands out from the others in scientific study and data analysis due to its unique qualities or traits. It can be an unusual occurrence or an outlier that dramatically deviates from the other data points in the dataset. In this article, we will count and list these unique observations.

We will use the national longitudinal survey of young women’s dataset. The dataset can be accessed by using the below syntax:

Download Example File

webuse nlswork, clear

Let’s limit the number of variables to that one that we are interested it so we have a cleaner data to work with.

keep idcode year age race hours

Idcode is the code assigned to an individual in the dataset. The year is the year in which that data was collected. Age represents the individual’s age. Race represents whether the individual is white or black. The hour represents the number of hours an individual works.

Codebook

We already know that a unique id is assigned to each individual in our dataset. Now, we want to see how many individuals we have in our dataset. To do so, we will use the codebook command as shown below:

codebook varname
codebook idcode

The codebook command not only tells us about the unique values but also provides other details about the variables, as shown below in the figure:

The details include the type of variable, range, unique values, mean, standard deviation, and percentiles. We found 4,711 unique IDs/observations, which means the sample includes 4,711 individuals. We can do the same for the hours.

codebook hours

The number of unique hours in the sample is 85. We can also run this command for multiple variables at once, as shown below in the syntax:

codebook varname1 varname2 codebook idcode hours

We can also limit these details based on some conditions. For example, if we want to determine the number of unique IDs for individuals who are over the age of 40, then we can use below command:

codebook idcode if age > 40 Here if used to put the condition.

We found that the unique ids/observations are 916, or there are 916 individuals over 40.

Contract

Contract command can be used to determine the unique observation.

contract var_name
contract idcode

The good news is that contract will give us the number of observations for each individual, but the bad news is that it will drop the rest of the variables and only keep the variable that we have specified in the command.

The idcode is the unique ID assigned to the individual (after dropping duplicates), and freq represents the number of observations of that specific id. 1^st individual has 12 observations, 2^nd individual also has 12 observations, 3^rd individual has 15 observations, and so on.

Note: The contract command will destroy the data, so each time you run this command, you must reopen the dataset again before running any new command.

We can also tell that within the categories of age how many unique races there are and to do so we will use the combination of two variables. The same command of the contract will be used as shown below:

contract var_name1 var_name2
contract age race

On running the above command, we will get the following results:

The results show that for age 14, we have 2 blacks and 1 other. For age 15, we have 1 black. For 16, we have 16 white, 11 black, and so on. We can also get percentages, cumulative frequencies, and cumulative percentages by using the below command:

contract age race, freq(frequency) percent(percentage) cfreq(cum_freq) cpercent (cum_percent)

In this command, freq, percent, cfreq, and cpercent are the options, and within the parentheses, we will write the names of the variables we want to assign. The below figure represents the results:

Distinct

A user-written command distinct is also used to count the unique ids. Before using it, we need to download the distinct command by using the below command:

ssc install distinct

After installation, use the below command to see how many unique observations there are in the idcode:

distinct idcode

You will get the above results. In this total are the total number of observations (28534), and distinct (4711) are the unique observations. Similarly, it can also be used for multiple variables:

distinct race idcode age

We can also determine the number of unique IDs for individuals who are over the age of 40 by using below command

distinct idcode if age > 40

Unique

The fourth method of counting the number of unique observations is the unique command, which is also a user-written command. We need downloaded this command before proceeding:

ssc install unique unique idcode

We can also apply condition with this command as shown below:

unique idcode if age > 40

You will notice that we will get the same results using a codebook, distinct and unique command.

Get the list of unique values using tabulate

We can also look at the list of unique observations by the help of tabulate command.

tabulate age

We will get the unique age observations along with their frequency, percentage, and cumulative percentage.

List of unique values using Levelof

The limitation of the tabulate command is that it does not store these values; it only displays them in the output window. But if you want to store these values in a macro, then use these values in loops. Then, in that case, we can use the levelof command. It provides a concise list of varnames unique values, as shown below:

levelsof idcode

To display these values, we will use the below command:

display "`r(levels)'"

However, to save these values as a local variable, we will use the below command:

levelsof idcode,local(values)

We will use the local option to store the unique IDs. Within the parentheses, we will write the name of the variable. We can look through each value by using the below command

foreach i of local values {             summarize age if idcode==`i' }

We will start with the foreach command and then the variable’s name (i). Then we will summarize the age if idcode = 1, 2, 3, etc. The results obtained will look somewhat like this:

Storing Distinct Value in Variable

The levelof command saves the values as a local variable but doesn’t provide the values in a variable (a column in Stata dataset).

This issue can be resolved by using the below command:

bys idcode: keep if _n==_N

This is only keep a single observation for each idcode hence we will have 4711 observation as we had 4711 individuals in our dataset. We can only keep the idcode variable if we want:

keep idcode

Similarly, we can use two variables and get unique combination of these two variables.

bys age race: keep if _n==_N
keep age race