Wednesday, March 19, 2014

Introduction to [P] postfile in Stata



The goal of this entry is to introduce you to the postfile command in Stata and to illustrate its purpose and utility.

What is post, postfile, and postclose

[P] post, postfile, and postclose is a set of commands used to:

(1) create a subset of the original dataset;

(2) save the statistical analyses’ results in a new dataset which can be further used for graphing or other manipulations;

(3) perform Monte Carlo-type experiments.

Let’s start with the basic syntax

        postfile postname newvarlist using filename [, replace]
        post postname (exp) (exp) ... (exp)
        postclose postname
        postutil dir
        postutil clear

Postfile: creates a dataset file where results will be saved. Postname is a local macro which will temporarily store a new dataset information. Newvarlist declares variable names and types contained in the new dataset filename

Post: adds observations to the new dataset.

Postclose: declares the end of the observation postings.

Postutil dir: lists all open postfiles.

Postutil clear: closes all open postfiles.

Let's take a look at the examples below. I use nlsw88 dataset that comes with the Stata software package. This data is an extract from the US National Longitudinal Survey for employed women in 1988. 


Example 1

In this example, we will create a subset of the dataset that only contains selected observations on two variables - age and race. It will serve us as an illustration of postfile basic syntax. To decrease the likelihood of error and to avoid posting each observation for each variable 10 times, we will use a simple foreach loop [an accessible introduction to loops can be found here.]


Let's dissect each line:

Line 1: loads the dataset

Line 2: creates a local macro for temporarily storing the new dataset information. (Note: you can name this macro anything you want but make sure to refer to the same name in the post lines that follow). 

Line 3: creates a dataset where results will be saved, specifies the variable names it will contain. (Note: if you are creating string variables you must specify that. For example, if my race variable was a string variable, I would have to add str in front of 'race'). 

Lines 4-7: loops over selected observations (1, 5, 7, 8, 9, 10, 45, 66, 98, and 100) of selected variables (age and race) to be included in the new dataset. Note: the order of age and race specified in the loop must correspond to the order age1 and race1 of the results1.dta dataset created in line 3.

Line 8: declares that we are done posting the contents of the new dataset. This line is important. If you forget it, Stata will not warn you but the new dataset will be incomplete - it will contain variables with no observations. 

Our new dataset is complete and looks like this:


Subset of nslw88: results1.dta

Example  2

Next, using the same nslw88 data, let's create a dataset based on a simple statistical analysis. Our goal is to produce data that contains average wages listed by respondent's age. Again, to automate work, we will be using a loop which will produce the mean value of wage for each age category (age range in the dataset is 34-46). 



Let's analyze each line:

Line 1: loads the dataset

Line 2: creates a local macro for temporarily storing the new dataset information.

Line 3: creates a dataset where results will be saved, specifies the variable names that it will contain. In this case, the new dataset will list average wages (avg_wages) by age (age).

Lines 6-12: in a loop, lists the range of age values for which average wage should be calculated and posted in the new dataset, one by one. The 'if' condition tells Stata not to compute wage for the missing age categories. It is not necessary for this particular example but would be vital if, for example, the age raged from 34 to 46 but some categories in the range weren't present (e.g., there were no 35-year old respondents).

Line 13: declares that we are done posting the content of the new dataset. 

The new dataset looks like this: 


Average wages by age: results2.dta

Debugging Post-related Errors

Some of the most common errors while using post are:

1. Mistyping a local macro name;


For example,












This is the error you will get:







2. Specifying more/less variables than you provide the expressions for;


For example,










This is the error you will get:





3. Inconsistent order of the variables listed in the initial line and the last line;


For example,










This is the type of an error that Stata may not always "notice". But your new dataset will be erroneous. 
 
4. Forgetting to use the closing statement postclose `tempname'. In this case, Stata will not warn you but your new dataset will contain no observations.