Using R to analyze World Bank’s Gender database

One of my undergraduate student’s wants to use the World Bank’s Gender database for a class project.  Turns out that the data is in “long” format.  Each variable and country is a row, and the years are columns.  Most statistical analysis of cross-country panel data starts with the data in “long” format, where each observation is a country-year and the columns are variables.  Getting R to reshape the data isn’t as easy as one would think.  (In Stata is was not so easy either.)  The problem is never reshaping with a single variable, but rather reshaping with many variables and also preserving all the labels.

Stack overflow has some useful posts on reshaping WDI data that are relevant.  See here, and here.  And the folks at UCLA have a nice introduction to the subject of reshaping.  I finally settled on this:

# First set your working directory to be the directory with the data

# Load a couple packages that are needed

library(plyr)
library(stringr)

# Load the gender stats database into R
WGgender <-  read.csv(“GenderStats_Data.csv”)

# Subset data with countries for analysis
WGgenderLA = subset(WGgender, ( Country.Name==”Peru” |Country.Name==”Chile” ) )
# make a list of variables that need
clist <- c(“SP.DYN.LE00.IN”, “SP.DYN.TFRT.IN”)

# loop through the list of variable names
for (i in clist) {
# subset data to only include that variable
atemp = subset(WGgenderLA, ( Indicator.Code==i ) )
# drop the indicator name and code
axtemp <- subset(atemp, select = -c(Indicator.Code, Indicator.Name))
# reshape the data into long
ax2temp <- reshape(axtemp, direction=”long”, varying=list(names(axtemp)[3:56]), v.names=i,
idvar=c(“Country.Name”,”Country.Code”), timevar=”Year”, times=1960:2013)
# a little trick to get the variable name to be assigned to the data.frame name
nameToUse <- str_sub(string=i, start=1, end=14)
assign(x=nameToUse, value=ax2temp)
# cleanup the temporary variables
rm(atemp, i, nameToUse, ax2temp, axtemp)
}

# merge the two data.frames
WBgender <- join(x = SP.DYN.LE00.IN, y = SP.DYN.TFRT.IN, by = c(“Country.Name”,”Country.Code”, “Year”))

# more cleanup
rm(clist, WGgender, WGgenderLA, SP.DYN.LE00.IN, SP.DYN.TFRT.IN)

# Now you are ready for panel cross-country analysis
# See http://www.princeton.edu/~otorres/Panel101R.pdf
# See http://cran.r-project.org/web/packages/plm/vignettes/plm.pdf
plot(SP.DYN.TFRT.IN~Year, data=WBgender)
plot(SP.DYN.LE00.IN~Year, data=WBgender)

It is probably very clunky.  You have to decide in advance which variables to use.  And then a lot of cutting and pasting to get all the variables in the code.  There must be an easier way but I did not find it.  (Of course, the WDI package in R does most of this easily, but here the student wanted to use the Gender database which you download as a csv file.)

A great resource from Oscar Torres-Reyna at Princeton of  panel analysis in R.

About mkevane

Economist at Santa Clara University and Director of Friends of African Village Libraries.
This entry was posted in Politics. Bookmark the permalink.