geodist is a useful command in Stata that helps you to find the distance between two cities/locations, the nearest location from your target city/location, and the number of cities/location within a certain radius. The command can be installed using:
ssc install geodist
We will use a dataset called distance.dta that stores the latitudes and longitudes of different cities in the USA and industrialcities.dta stores the latitude and longitudes of four industrial cities in USA.
To load the dataset, we type:
use "distance.dta", clear
Make sure the working directory of your Stata file is the same as the working directory of the data file. Stata’s working directory can be changed through:
cd “directory path”
Syntax of geodist
The general syntax of the command is as follows:
geodist lat1 lon1 lat2 lon2 [if] [in] , generate(new_dist_var) [options]
The first two arguments are the latitude and longitude of the first location, followed by the latitude and longitude of the second location. The latitude must always be written before the longitude. The statements of
in are optional.
The option of
generate() is necessary, and allows us to specify the name of the variable that will be generated to store the distance between the two locations. Other optional options can also be added.
By default, this command generates distances in kilometers.
Whenever working with distances, or the
geodist command, it is a good idea to store values in variables with the double format rather than the float format. The double format is more precise and accurate than float.
Calculating the Distance of New York From Other Cities
Let’s say we wish to find the distance of New York from other cities in the US. We type the following command with the first two arguments being the latitude (40.697132) and longitude (-73.931351), respectively, of New York. The latitude and longitude of other cities are stored in ‘’latitudes’ and ‘longitudes’ respectively. The distance between each of these cities from New York will be stored in a new variable generated called ‘km’.
geodist 40.697132 -73.931351 latitude longitude,generate(km)
The distances stored in the new ‘km’ variable are air distances as opposed to road distances. This is because this command calculates the geodetic distance, which is the length of the shortest curve between two points on the earth’s spherical surface.
Distance in Miles
To compute the above distances in miles, we just add an option
geodist 40.697132 -73.931351 latitude longitude, gen(mile) mile
This generates a variable ‘mile’ with the distance of each city from New York stored in miles.
Haversine and Vincenty (1975) Formulae for Distance:
By default, Stata uses the Vincenty (1975) formula to calculate the distance through geodist.
If we require the Haversine formula to be used, we add an option of
geodist 40.697132 -73.931351 latitude longitude, gen(km2) sphere geodist 40.697132 -73.931351 latitude longitude, gen(mile2) mile sphere
The Vincenty (1975) and Haversine formulas produce fairly similar measures of distance.
Using geodist with Longitudes and Latitudes Stored In a Variable
If our longitudinal and latitudinal values were stored in a variable, and we did not want to write entire numbers in our commands manually, we simply use the respective variable names. Let’s first generate variables that hold longitude and latitude values for New York:
generate baselat = 40.697132 generate baselong = -73.931351
This generates two variables ‘baselat’ and ‘baselong’ that are formatted as float. To illustrate the lack of precision that results from the float format, we run the
geodist command again using these newly generated variables:
geodist baselat baselong latitude longitude, generate(base)
Note that the variable ‘base’ shows a non zero distance between New York and itself, even though a city’s distance from itself should be zero. It is therefore important to format distance variables as double.
Pairwise Combinations of Latitudes and Longitudes:
We now move on to computing which cities are located near (or a certain distance away) from others. For this example , we load another data set, ‘industrialcities.dta’ with observations of four industrial cities and their latitudes and longitudes stored in the respective variables.
use "industrialcities.dta", clear
This file has longitude and latitude data on Iowa City, Boston, Houston and Chicago. We now want to know which city in our distance.dta file is within 30km of these four industrial cities.
To achieve this, we make pairwise combinations. Pairwise combinations are made using the cross command. We also need to ensure that the variable names for latitudinal and longitudinal variables need to be different in both the files.
cross using "distance.dta"
This command will create all possible pairs of the cities in the two files. For example, Iowa City will be paired up with every city in the distance.dta file; as will Boston, Houston and Chicago.
To find the distance between each of the paired up cities, we use the
geodist command again:
geodist latitude longitude latitude1 longitude1, gen(distance)
This calculates the respective distances in the same manner that we saw above, and stores the values in a new variable called ‘distance’.
Our aim is to keep cities that are within 30km of the four industrial cities. To drop all other pairs that don’t match this criteria, we run:
bysort city1: keep if distance<30
We now have only 26 pairs of cities remaining that are not more than 30km away from each other. However, this will also include observations where the cities are paired up with themselves. For example, the distance between Boston and Boston is zero, and thus is present in our observations. We can drop these by:
drop if distance==0
Finding the Nearest Cities to the Base Cities
We now look at seeing how Stata can help us calculate which city is located closest to the base cities (in our example, the base cities are the four industrial cities).
We load the industrialcities.dta dataset again, make the pairwise combinations, and calculate the distances exactly as before.
use "industrialcities.dta",clear cross using "distance.dta" geodist latitude longitude latitude1 longitude1,gen(distance)
Once we have the distance values, we sort our data by ‘city1’ (the base cities) and distance.
sort city1 distance
This will sort the data in alphabetical order of the four cities in ‘city1’ with the first observations being the pairs with Boston followed by the pairs with Chicago, Houston and Iowa City. Further to this, within each of the four categories of ‘city1’ (i.e. Boston, Chicago, Houston and Iowa City) data will be sorted in ascending order of distance.
As observed earlier, the distance of a city from itself will be zero and therefore will be the first observation each time a base city’s observation appears for the first time. We drop these observations through:
drop if city1==city
This drops the four observations where the base city (‘city1’) is the same as the city we are computing its distance with (‘city’).
Now the first row for each base city will indicate which city is nearest to them, because distance is sorted in ascending order for each one of them.
by city1: keep if _n == 1
This command groups the data by the four categories found in ‘city1’ and then keeps only the first observation from these groups. This leaves observations for only the cities that have the shortest distance between themselves and the base cities. The closest city to Boston, for example, is Brooklyn.