0% found this document useful (0 votes)
74 views15 pages

STATA Capacity Building March 8

The document provides instructions for getting started with STATA and analyzing UK Labor Force Survey (LFS) data. It discusses starting and exiting STATA, the STATA windows interface, setting the memory and working directory, opening and saving datasets and log files, creating and running do-files to store commands, and descriptive statistics commands like describe and list.

Uploaded by

Mohd Aqmin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views15 pages

STATA Capacity Building March 8

The document provides instructions for getting started with STATA and analyzing UK Labor Force Survey (LFS) data. It discusses starting and exiting STATA, the STATA windows interface, setting the memory and working directory, opening and saving datasets and log files, creating and running do-files to store commands, and descriptive statistics commands like describe and list.

Uploaded by

Mohd Aqmin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

1 STATA: Getting Started

(Application on UK LFS Data)

0. Some Useful Suggestions Before Starting


Help : use the STATA viewer clicking or type help followed by the
command name
Syntax : STATA is sensitive to lower/upper cases and spaces
Abbreviations : you can abbreviate commands (e.g. gen for generate or sum for
summary)
Keep track : always remember to record the "history" of your work in a log file (see
sub-section 1.4)
Use shortcuts : use do file (see section 2) to recall sequences of commands
Raw dataset : always keep a copy of the raw dataset i.e. save your work in a different
name

1. Starting and exiting STATA


Start STATA via the Start menu, Start/All Programs/Statistics/Stata 13.0.

1.1 STATA Windows


When you start STATA, it opens four windows for you. These are titled ‘Review’,
‘Variables’, ‘Stata Command’, and ‘Stata Results’.

1
 Commands to STATA are issued from the Stata Command window (lower right hand
side), which is a single line window at the bottom of the page with the cursor blinking
in it. The commands are issued by pressing 'enter'.
 These are accumulated in the Review window (upper left hand side). You can recycle
the previously issued commands in the Review window without rewriting them, by
clicking on them with the help of your mouse, or by pressing the 'page up' key. You
can then edit the command as needed before issuing it.
 The Variables window (upper right hand side) includes the currently defined and usable
variables stored in the memory. When typing commands you can simply click the small
curly arrow on the left hand side of each variable name instead of typing them.
 The commands you issue will produce STATA results (or STATA error messages).
These results will appear in the Stata Results window (upper centre). Most commands
can be shortened up to the point where no two commands have the same abbreviation.

1.2 Memory
If you use STATA 13.0, there will be no need to adjust the memory. STATA will
automatically adjust it with respect to the data size. However, if you use STATA 12.0 or
earlier version, you need to adjust the memory.
By default STATA allocates 1Mb (1,024Kb) of memory. This is sufficient only for small
datasets. To increase the memory size to, e.g. 20Mb, type:
set memory 20m

The allocation of memory is extremely important. You should assign a quantity of memory
that is sufficient to work with the data and at the same time does not absorb too many
resources from your PC. For big dataset, as a rule of thumb, you should allocate an amount
that is in the range of 1 to 1.3 times the size of the file. In case the memory allocated is not
sufficient to process the commands (STATA will display a message) the file needs to be
closed before assigning a larger quantity of memory and opened again after this.
We may need to set the maximum number of variables that can be included in any of
Stata's estimation (i.e. the number of Right Hand Side Variables). This is important if we
use regression specifications with many independent variables. For Stata/MP and
Stata/SE, the default value is 400, but it may be changed upward or downward. The
upper limit is 11,000. For Stata/IC, the initial value is 400, but it may be changed upward
or downward. The upper limit is 800The appropriate command is matsize, hence we
type:
set matsize 1000

1.3 Working Directory


You can change your current working directory to the specified drive and directory. This
would be useful if you use multiple dataset or programmed do file. You can change it by
typing:

2
cd "drive/work_folder/directory_name"

or you can simply click File  Change Working Directory and then choose your desired
working directory

1.4 Log File


In order to keep the results of a STATA session, we can open a “log-file” before carrying
out any procedure. A log file is an output file. This automatically stores all commands and
results in a file (Use “.log” rather than smcl extension. It is easier to edit in Word.)
We open and create a new log file by:
log using "drive/work_folder/log_name"

We close this file after having carried out the procedures by the command:
log close

You can also use the fourth icon in the tool bar file  log  begin to start a new log file
or file  log  close to end a currently active log file, as follow:

3
Once a log file is open, we may also interrupt the storing of information to the log file by
the command -log off- and may later resume the log session by using -log on-. Next, if we
want to append results of a new session to an existing log-file, we type:
log "drive/work_folder/log_name", append

1.5 Dataset
Data files in STATA format are labelled with the extension “.dta”. However, for opening
such a file you can just type:
use "drive/work_folder/file_name.dta"

or if you need only need specific variables to process (i.e. var1, var2, and var3), you can
type:
use var1 var2 var3 using "drive/work_folder/file_name.dta"

at the STATA prompt, without the extension—provided you are in the same directory. The
directory is displayed in the bottom left corner. Alternatively you open files using the menu
option by clicking File > Open.
If you have another type of dataset e.g from Microsoft Excel, you can copy the data from
that other source to the STATA data editor by clicking File  data editor  data editor
(edit) or typing:
edit

This will lead you to a pop-up window where you can easily paste the data into the cells
provided as follow:

4
Prepare the data in Excel for conversion:
 Make sure that missing data values are coded as empty cells or as numeric values
(e.g., 999 or -1). Do not use character values (e.g -, N/A) to represent missing
data.
 Make sure that there are no commas in the numbers. You can change this under
Format menu, then select Cells.
 Make sure that variable names are included only in the first row of your
spreadsheet. Variable names should be 32 characters or less, start with a letter and
contain no special characters, i.e. ‘$’ or ‘&’, except ‘ ’. You should eliminate
embedded blanks (spaces).
 Under the File menu, select Save As. Then Save as type Text (tab delimited). The
file will be saved with a .txt extension.

Now we use the UKLFS. Go to the working directory and double click the data1.dta.
Alternatively, you can type the following syntax in the command bar:
use "drive/…/Workshop Data/ data1.dta "

Note that it is important to keep raw data as a backup (unchanged) and save the working
file under a different name, e.g data1_v1.dta. You can do it by clicking File  save as
and save in the working directory as data1_v1.dta. You can also do that by typing:
save "drive/…/Workshop Data/ data1_v1.dta", replace

Note that the ‘replace’ option is only necessary, if the file already exists. Please delete the
‘, replace’ for a first time file saving.

5
Before finishing any STATA session, you can close the file and clear all the data stored in
the memory by typing:
exit, clear

2. do-files
STATA is a program that works interactively. However, instead of typing every command
separately and executing it, STATA offers the possibility to store all the commands in a
text file (default extension is “.do”). It is also useful to keep track of all the commands, to
create variables, run regressions. You type once and save, hence you can re-do many times
after that.

You can create your own do-file by clicking symbol or clicking File  New Do-File.
A do-file editor window will pop up. In this do-file editor window, you can type all
commands that you need to analyse your data. Once you finish composing your do-file,
you can save it by clicking File  Save As and save it to your working folder. Or you can
do so by clicking symbol on top of do-file editor.
Should you are already provided a do-file, the commands in that do-file can then be
executed by typing:
do "drive/work_folder/do-file_name"

or in do-file editor window, run the whole commands by clicking symbol. You can also
run selected command by highlighting the specific command which you want to run, and
click symbol.
STATA will ignore all characters that are between /* and */. (It will also ignore all lines
that start with an asterisk, “*”). This is useful to store your comments and notes in your do-
files so that you could recall what the objective of a specific syntax was.

3. Descriptive Statistics
3.1 Describe
The describe or des command displays the content of the dataset. If used without any
specification it provides both general information (file name and location, number of
observations and variables, size and date of creation) and the full description of the
variables (name, field type, format and labels).
des

6
From the UKLFS data, suppose that you want to describe employment and age, which are
represented by variable empl and age respectively. Hence you type:
des age empl

3.2 List
In order to see the values of each variable for single observations, we can use the list
command. Suppose that you want to see the age for each respondent per occupation, where
age is represented by variable age variable and occupation is represented by variable
occup, therefore the command to be typed is:
list occup age

To stop a running command, click on the red circle with a white cross icon “ ”.

3.3 Codebook
The codebook command is a great tool for getting a quick overview of the variables in the
data file. It produces a kind of electronic codebook from the data file. Suppose that you
want to observe the region in the UKLFS data, you can type the following command in the
command window:
codebook uresmc

3.4 Rename
To change the name of an existing variable old_varname to new_varname; the contents of
the variable are unchanged.
rename male gender

7
Stata has two type of variable format: string and non-string (byte, int, long, float, or
double). String format variables are recorded as characters while non-string format
variables are recorded as numbers. As consequences, we cannot calculate the summary
statistics for variables that are recorded under string format, while we can do so for long
format ones.
In UKLFS data, all variables are recorded under string format. Suppose that we want to
convert those variables into string format by using tostring command by typing:
tostring educ, replace

or to destring all variables, type all variables


tostring empl yearsukold ref, replace

On the other hand, suppose that we need to convert string format variables to non-string
format variables, the command is destring. For example, we need to put the above
variables back to non-string format, hence we type:
destring educ empl yearsukold ref, replace

3.5 Summarise
The summarize or sum command calculates and displays a variety of summary statistics.
If no variable list is specified, then the summary statistics are calculated for all variables in
the dataset. Providing list of variable e.g. age (age), region (uresmc), average wage
(ave_wage_l), you should type:
sum age uresmc ave_wage_l

Which will provide the variables’ mean, standard deviation, minimum and maximum.
Should you want to know the median, percentiles, skewness etc. of a variable, you should
add the detail or d option at the end of the command:
sum age uresmc ave_wage_l, d

3.6 Tabulate
The tabulate or tab command is very useful for categorical variables that are best
described in frequency tables. For example you want to tabulate ethnicity (ethf) and
employment (empl), you should type:
tab ethf empl

8
You can also include the percentage of each cell by adding percentage by row, percentage
by column, or both by adding row and/or column operation at the end of the syntax:
tab ethf empl, row column

What are the differences in average wage between regions? We have (at least) two methods
of obtaining a first idea:
sort uresmc
by uresmc: sum ave_wage_l

or in one step combine sort with by (bysort or bys)


bysort uresmc: sum ave_wage_l

there are numerous ways to produce such tabulations, another example is


tab uresmc, sum (ave_wage_l)

3.7 Conditions
Sometimes you need to distinguish certain group of observation in your STATA operation.
For example you want to see the tabulation between ethnicity (ethf) and employment
(empl), but only for those who are 20 to 40 years old. Hence an if command would be
useful and be typed as follow:
tab ethf empl if age>=20 & age<=40

Or you want to see the tabulation between ethnicity (ethf) and employment (empl) only
for those whose age is 35 years old. Hence the syntax would be:
tab ethf empl if age==35

Note that if function can be use in any other operations. These are some frequently used
operators in if function:
o & : and
o | : or
o == : equal to
o != : not equal to
o >= : equal or more than
o <= : equal or less than

3.8 Weight

9
In most data sets, you need to use weights to correct for sampling or expand data to
population. The commonly used weight for survey data is the fweight and iweight. Note
that fweight does not allow decimal weight, hence we need to round the weight variable
(pw) first by generating a new variable named pw_n.
gen pw_n=round(pw)
tab ethf [fweight=pw_n]
tab ethf [iweight=pw]

4. Generating variables and replacing variables


In order to generate new variables you can use the generate or gen command. To generate
the sum, product or relative value of two existing variables (var1 and var2), we type:
gen newvar = var1 + var2
gen newvar = var1 * var2
gen newvar = var1 / var2

You can also assign the outcome of these formulas to an existing variable (oldvar) by
replacing the old variable using replace command:
replace oldvar = var1 + var2
replace oldvar = var1 * var2
replace oldvar = var1/var2

Suppose that you want to create an age square variable (we will name it age_sq). Then you
should type:
gen age_sq = age * age

or we can also type:


gen age_sq = age^2

It is best to assign a label to each variable, so that you know what it stands for. This label
is displayed by executing the describe command and is displayed next to the variable name
in the variable window. You can do it by typing:
label var age_sq “Age Squared”

10
Note that we have several categorical variables where each category represents one
group of observation. For example in employment variable (empl), value 0 represents
unemployed and value 1 represents employed.
One may get confuse to distinguish which value represents which group. Hence it is
useful to label the value of dummy variables by also using label command:
label define empl 0 “Unemployed” 1 “Employed”
label value empl empl
now tabulate to view
tab empl

STATA also offers the possibility to create variables that contain statistical information.
The egen command may be used to create variables that store descriptive statistics like the
mean, sum, maximum and minimum of other variables in your data. The command egen
extends the functionality of generate. For example, hypothetically, creates a new variable
containing the (constant) mean “average wage” (ave_wage_l) for by regions (uresmc).
bys uresmc: egen avg_wage = mean(ave_wage_l)

to view the distribution of the new variable, use summarise in detail


sum avg_wage, d

Other than mean, we can also use egen for calculating the median (median), maximum
value (max), minimum value (min), percentile (pctile), aggregation (sum), etc.
Now to learn more about egen
help egen

4.1 Dummies
Suppose we want to create a race dummy where white equals 1 and non-white equals 0.
First let’s view the ethnicity (ethf) variable
tab ethf

To generate a white dummy


gen white=1 if ethf==1
replace white=0 if ethf !=1

Alternatively, to create a white dummy


gen white_new=(ethf==1)

11
to check both dummies are the same
tab white
tab white_new

a shortcut use *. All variable lists beginning with male will be used.
sum white*

Suppose you want to generate dummies for a variable with more than one category for
example regions (uresmc). First tabulate to see the variable
tab uresmc

to create the dummies use the following command where certificate_d is the new
variable name
tab uresmc, gen(d_region)

Nineteen dummies are created: d_region1 to d_region19.

We can now observe the proportion of whites and non-whites who live in London
(d_region8) by typing:
tab white d_region8

Suppose that we want to categorise people for those whose wage (ave_wage_l) is below
the average (avg_wage) and above average. But this condition applies only for whites i.e.
white=1. Otherwise would be left missing.
gen stat_wage=.

At this stage, all observation have missing data for stat_wage variable. Next we have to
replace that variable according to the condition above.
replace stat_wage=0 if ave_wage_l<=avg_wage & white==1
replace stat_wage=1 if ave_wage_l >avg_wage & white==1
tab stat_wage

At this stage, those whose total wage below or equal to average would get 0 in their
stat_wage, those whose wage above average would get 1, and missing for those who are
not white. Next step is assigning label to the variable and the value for dummy variable.

12
label variable stat_wage "Whether wage above or below average"
label define stat_wage 0 "Whites and below Average" 1 "Whites and above
Average"
label values stat_wage stat_wage

tab stat_wage ethf,m

3.7 Higher Degree Table


For an analysis on more categories, you can use table command. Unlike tabulate which
working for only two variables; table would present you tabulation for three or more
variables. For example you want to see the proportion of whites (white), people who live
in London (d_region8), and employment status (empl), you should type:
table white d_region8 empl, content(freq)

5. Keeping and dropping variables and observations


5.1 Keep and Drop
We can select variables and observations of a dataset by using the keep or drop commands.
Suppose we have a dataset with 6 variables: var1, var2, …, var6.
We would like to produce a file containing only three of them, say var1, var2, and var3.
You either type:
keep var1 var2 var3
or
drop var4 var5 var6

and then save the results to a new file to preserve the raw data.
If the variables are stored in the sequence var1 to var6, we achieve same results by typing:
keep var1-var3
or
drop var4-var6

To select only certain observations, the keep and drop can be combined with logical
statements. Let us assume we would like to retain all observations for which var1 equals
one. We can either type:
keep if var1 ==1
or

13
drop if var1 !=1

The if expression can also specify a relation between two variables, for example, electing
all those who are older than 20 but younger than 60 years old. For example:
keep if age>20 & age<60

Observations are dropped in a similar way:


drop in 1/20

would delete the first 20 observations in your data.

5.2 Preserve & Restore


To preserve the data, guaranteeing that data will be restored later; e.g. drop those older
than 50 years old.
Preserve
drop if age>50
To restore the data afterwards
restore

14
Exercises
1. Open data
2. Open do file
3. Start a log file
4. Describe your data
5. Destring all variables in string format
6. Learn about each variable using summarise, codebook, tabulate
7. Using ave_wage_l,
a. What is the average wage of people who live in London?
b. What is the average wage of white people?
c. What is the average wage per occupation?
d. What is the median wage per occupation
8. Create age dummies into four categories: (i) Below 30; (ii) 30 – 39; (iii) 40 – 49;
and (iv) 50 and above.
9. Describe employed vs unemployed people in terms of: (i) region and (ii) average
age.

15

You might also like