Introduction to Basic STATA
By:
Dr ZerihunTemsas
May 2023
Outline
1. Meaning and Importance of STATA
2. Importing and Saving Data in STATA
3. Data Management
4. Graphs Using STATA
5. Descriptive Data Analysis using STATA
6. Two groups mean comparison test
7. Regression Analysis Using STATA
6/8/2023 2
Scientific Research Process
Theoretical & Empirical literature
Statement of the problem
Research Questions
Research Objectives
Research Design (Quantitative or Qualitative)
Sampling Design
Research Proposal
Data Collection Data Analysis (Software)
Report writing
6/8/2023
1. Meaning and Importance of STATA
Stata is a general-purpose statistical software package
created in 1985 by Stata corporation.
Most of its users work in research at universities or
research institutions… etc.
Use of Stata includes data management, statistical
analysis, graphics and regression analysis.
6/8/2023 4
Meaning and Importance of STATA
Stata is a multi-purpose statistical package to help
you explore, summarize and analyze datasets.
A dataset is a collection of several variables with their
respective values (usually arranged by columns).
A variable is a characteristics that can assume one or
several values.
Stata is a widely used statistical software in
Universities.
5
Cont… Result Window
Previously used commands are listed here and can
be transferred to the command window by
clicking on them.
All outputs appear in this window.
Only graphics will appear in a separate window.
Review Window
This is the command line where commands are
entered for execution.
Command Window Variables Window
6
Cont…
Basically, there are four basic STATA windows:
a. Result Window
b. Review Window
c. Command Window
d. Variable Window
7
Cont…
Stata is mostly a command-driven package.
Although the newest versions also have pull-down menus
from which different commands can be chosen, the best
way to learn Stata is still by typing in the commands.
But, sometimes the exact syntax of a command is hard to
get. (Example: Graphs)
In these cases, we often use the menu-commands to do it
once and then copy the syntax that appears.
8
STATA Buttons
The most important button functions are the following:
Open : Opens a new data file.
Save: Saves the current data file.
Print results: Prints the content of the results window.
Log begin: To open log file which is used to save results
New Viewer: This window provides help on Stata commands and
rules
Do-file Editor: Opens a new instance of the do-file editor
(same as doedit).
Data Editor: Opens the data editor window (same as edit).
Data Browser: Opens the data browser (same as browse).
9
Button functions cont…
Variable manager: this button function is used to label variables
and values
To give additional information about variable name and to label
values of categorical variables.
6/8/2023 10
Open Tool Bar
Variable Manager
New viewer Data Browser
Log begin Do File Data Editor
6/8/2023 11
Stata Menu Bars
Stata displays 8 drop-down menus across the top of the outer window, from left
to right:
1. File
Open open a Stata data file (use)
Save/Save as save the Stata data in memory to disk
Do execute a do-file
Print print log or graph
Exit quit Stata
6/8/2023 12
Menu Bar Cont…
2. Edit
• Paste: to past commands on the command window.
• Copy table: copy table from Results window to another file
• Preference: To change the color of Stata result window
3. Data
• Describe Open Variable Manager
• Create new variable Sort data
• Open data editor
6/8/2023 13
Menu Bar Cont…
4. Graph: it contains various Stata commands for graphs
5. Statistics: it contains most of Stata commands for
descriptive and inferential data analysis
6. Users: is used so as to down load some users supplied
Stata commands
7. Windows: it is used to bring variable window,
command window and review window to the
front
8. Help: used to find Stata commands, tutorials & guideline
6/8/2023 14
2. Data Importing and Saving
2.1 Data Entry
There are different ways of reading or entering data into Stata:
A. By using “use” commands
B. By copying and Paste from Excell
C. By directly writing on the Data Editor
D. By using stata transfer soft wares
A. Importing Data via “use” command
Datasets which is in stata format can be imported via the “use”
command as follow:
– Syntax: use filename.dta
– Example: use migration 15
Use command Cont…
You can open selected variables of a file using a variable list and you
can also open selected records of a file using if or in.
Example:
• use migration opens the file migration. dta for analysis.
• use migration if female== 1 opens data only for female
migrants
• use migration in 1/150 opens records 1 through 150 of
migration file
• use obsn age female using migration opens 2 variables
from migration file
16
Cont…
The clear option will clear the dataset currently in memory before opening the other one.
Example: use filename.dta, clear
B. Stat/Transfer program
There is a soft ware called stata transfer, which directly converted data in Excel
format in to stata format and locate the data in the Data Editor button.
C. copy-and-paste
If you can open the data in Excel, you can usually copy and paste the data into the Stata
data editor.
All you need to do is select the columns in Excel; copy them; open the Stata data editor;
and paste.
17
Cont…
D. Manual typing
Manually typing in the data is the tedious last resort– if the data
is not available in electronic format, you may have to type it in
manually.
Start the Stata program and use the edit command – this brings
up a spreadsheet-like where you can enter new data or edit
existing data.
18
Cont…
2.2 Saving Commands and Outputs
A. Saving File/Data
Finally, the data is saved with the save command:
– Syntax: save filename.dta [, options]
– Example: save migration.dta, replace
To see your working directory, type
– pwd
– C:\Users\hp-6570b\Desktop\Document
19
Saving Data
You can also change the location of working directory
using the following Stata command.
cd“C:\Users\hp-6570b\Desktop\Software_Training”
6/8/2023 20
B. Saving Outputs/Results
Saving the Output
– Stata Results window does not keep all the output you
generate.
– It only stores about 300-600 lines, and when it is full, it
begins to delete the old results as you add new results.
– Thus, we need to use log to save the output
21
Log File Cont…
Create a log file, sort of Stata’s built-in tape recorder and where you can:
1) retrieve the output of your work and
2) keep a record of your work.
– Example: log using stata_result.log save output in a file named
stata_result
This will create the file ‘stata_result.log’ in your working directory.
To close a log file type:
log close
22
Log File Cont…
To add more output to an existing log file add the option append, type:
– Example: log using stata_result.log, append save outputs to
an exiting file named stata_result.
– To replace a log file add the option replace, type:
– log using stata_result.log, replace replace values of an
existing file
Note that the option replace will delete the contents of the previous
version of the log.
23
Log File Cont…
log off
This command temporarily turns off the logging of output,
log on
This command is used to restart the logging,
log close
This command is used to turn off the logging and save the file.
24
C. Saving Commands
The Do-file Editor allows you to store a set of commands
and It makes it easier to check and fix errors.
It allows you to run the commands later and lets you show
others how you got your result (Example: your advisor may
want to know how you got the result).
25
Do File Cont…
In general, any time you are running more than 10 commands to
get a result, it is easier and safer to use a Do-file to store the
commands.
To open the Do-file Editor, you can click on Windows/Do-file
Editor or click on the envelope on the Tool Bar.
26
Do File Cont…
To run the commands in a Do-file, you can click on the Do
button (the second-to-last one) or click on Tools/Do.
If you want to run one or just a few commands rather than
the whole file, mark the commands and click on the Do
button
27
3. Data manipulation
Generate New Variables
New variables are generated with the generate command:
Arithmetical operators are:
– + Addition
– - Subtraction
– * Multiplication
– / Division
– ^ Power
28
Data Manipulation Cont…
We will see how to explore data using existing variables in the next
section.
Now we will discuss how to create new variables.
When new variables are created, they are in memory and they will
appear in the Data Browser, but they will not be saved on the hard-
disk unless you use the save command.
Example: generate age_sq = age^2
gen log_y=log(y)
29
Data Manipulation Cont…
Thus, the generate command is used to create a new variable. It is
similar to “compute” in SPSS.
The syntax is;
generate newvar = exp [if exp]
where “exp“ is an expression like
generate rem2= rem/y
generate S2 = S/Y
gen total_rem=expr*rem
30
3.2 Replace
The values of existing variables can be changed with
the replace command.
It works similar to the generate command expecting
expressions.
– Syntax: replace oldvar =new variable
– Example: replace S = S/Y
31
Data Manipulation Cont…
Drop
Variables or observations can be deleted using the drop
command.
– Syntax: drop varlist
– Example: drop age female city
Keep
This command works opposite to drop as it keeps variables or
observations rather than deleting them.
– Syntax: keep varlist
32
Data Manipulation Cont…
Recode
This command changes the values of a categorical variable
according to the rules specified.
• The syntax is:
recode varname old=new [if exp] [in range]
To change quantitative variable in to categorical variable
33
Data Manipulation Cont…
Here are some examples:
• recode female 1=2 changes all values of female
=1 to female = 2
• recode female 1=2 0=1 changes 1 to 2 and 0 to 1
• recode female 0=1 1=0 exchanges the values 0 and
1in female
• recode female 1=2 *=1 changes 1 in female to 2
and all other values to 1
34
4. Data formatting
Rename
A variable can be renamed with the rename command:
– Syntax: rename old_varname new_varname
– Example: rename age migrant_age
Label
This command gives the exact name for the variable
– Syntax: label variable varname ["label"]
– Example: label variable Y “migrant income”
Label define
Label define female 1 “female” 0 “male”
Label define MRST 1 “protestant” 2 “orthodox” 3 “muslim” 4 “
others”
35
5. Data Exploration
Describe
General information about the dataset can be retrieved with describe.
The command displays the number of observations, number of
variables, the size of the dataset, and lists all variables together with basic
information (such as storage type, etc.).
des age female Y
36
Data Exploration Cont…
Codebook
The codebook command delivers information about one or more
variables, such as storage type, range, number of unique values, and
number of missing values.
The command offers further interesting features which can be seen with
help codebook.
– Syntax: codebook [varlist]
– Example: codebook age
Sort
Data is sorted in ascending order with the sort command:
– Syntax: sort varlist
– Example: sort age
37
Data Exploration Cont…
Descending ordering can be done with gsort, whereas a minus in front
of a varname invokes descending order:
– Syntax: gsort [+|-] varname...]
– Example: gsort -age
Order
The order of the variables as seen in the variable window can be changed
with the order command:
– Syntax: order varlist
– Example: order age obsn
Browse
The data browser can be opened with the browse command:
– Syntax: browse [varlist]
– Example: browse age income 38
Data Exploration Cont…
List
Similar to the data browser, values of variables can be listed in
the results window with the list command.
Summarize
The most important descriptive statistics for numerical variables
are delivered with the summarize command:
– Syntax: summarize [varlist]
– Example: summarize age y ls dskm
It displays the number of observations, mean, standard
deviation, minimum, and maximum.
39
Data Exploration Cont…
Tables of summary statistics can be drawn with table.
Tabulate
One-way frequency tables for categorical variables can be drawn with the
tabulate command:
– Syntax: tabulate varname
– Example: tabulate mrts
Two-way cross-tables for two categorical variables can be drawn with
another version of tabulate:
– Syntax: tabulate varname1 varname2
– Example: tabulate mrts female
– tab2 religion city, cell nofreq
– tab2 religion city, row
– tab2 religion city, column
40
Data Exploration Cont…
Inspect
The inspect command provides a quick summary of a numeric variable
that differs from that provided by summarize or tabulate:
– Syntax: inspect [varlist]
– Example: inspect age
It reports the number of negative, zero, and positive values;
– the number of integers and non-integers;
– the number of unique values; and
– the number of missing values; and it produce a small histogram.
Its purpose is not analytical but it allows to quickly gain familiarity with
unknown data.
41
Data Exploration Cont…
If
This command is used to select certain records in carrying out data
analysis .
command if exp
Examples:
Sum age dskm y if city==1
Summarize if female==1
Sum if city==1
Sum if age<20
• Note that “if” statements always use ==, not a single =. 42
Data Exploration Cont…
In
You can also use in to select records based on the case number.
The syntax is:
command in exp
For example:
list in 10 list observation number 10
• summarize in 150/300 summarize observations 150-300
• list in -10/-1 list the last 10 observations
• Sum in -5/-1 summarize the last five observations
If and in commands have the same purpose as selecting file in
SPSS
43
Data Exploration Cont…
Count
count command can be used to show the number of
observations that satisfying if options. If no conditions are
specified, count displays the number of observations in the data.
• count
665
• count if female==1
213
44
Data Exploration Cont…
By
This prefix goes before a command and asks Stata to repeat the
command for each value of a variable. The general syntax is:
by varlist: command
Note: bysort command is most commonly used to shorten the sorting
process
Example of the by prefix are:
bysort female: sum
Similar with spilt file in SPSS
45
Data Exploration Cont…
help
The help command gives you information about any Stata command or
topic
help [command]
For example,
• help tabulate gives a description the tabulate
command
• help summarize gives a description of the summarize
command
46
5. Graphics Using STATA
• One of the advantages of Stata is its vast graphics
capabilities.
• Some graph commands are typed without the leading graph.
• For example, a basic histogram of the variable age would be:
– Example: histogram age, frequency normal
Graphs are not saved in log files.
47
Graphics Cont…
graph bar (mean) age, over(female) blabel(name): bar labeled by
name of categories
graph bar (mean) dskm, over(female) blabel(bar):bar labeled by
height of bar
graph bar (mean) age (mean) educ, over(female) blabel(bar)
6/8/2023 48
Graphics Cont…
The commands that draw graphs are
– graph twoway scatterplots, line plots
– graph matrix scatterplot matrices
– graph bar bar charts
– graph box box-and-whisker plots
– graph pie pie charts
49
Graphics Cont…
Examples
– graph twoway scatter educ age
We can show the regression line predicting educ from age using lfit
option.
– twoway lfit educ age
The two graphs can be overlapped like this
– twoway (scatter fs lf) (lfit fs ls)
– graph matrix age educ
50
Pie chart
Pie chart: used to present data for categorical
variables.
graph pie, over(religion) plabel(_all name)
graph pie, over(religion) plabel(_all sum)
6/8/2023 51
Normality and outliers
Skewness and kurtosis
sum age
sum age, detail
Check normality of a variable visually by looking at some basic
graphs
histogram age
histogram age, normal
Multivariate normality test
mvtest normal age
52
Normality Cont…
Graph box draws vertical box plots
graph box age
Upper and lower bounds of box are defined by the 25th and 75th
percentiles.
The line within the box is the median and ends of the whiskers are
5th and 95th percentile
If age is normal, the median would be in the center of the box and
the end of whiskers would be equidistant from the box
53
Normality Cont…
The kdensity command with the normal option
kdensity age, normal
– density graph of the variable with a normal distribution
superimposed on the graph
– useful in verifying that the variable are normally distributed
pnorm command produces a P-P plot
pnorm age
– It should be approximately linear if the variable follows
normal distribution
54
Normality Cont…
Qnorm command plots the quantiles of a variable against the
quantiles of a normal distribution
qnorm age
If the Q-Q plot shows a line that is close to the 45 degree line, the
variable is more normally distributed
Both P-P and Q-Q plot prove that age is normally distributed
55
Two groups mean comparison tests
ttest dskm, by(female)
ttest y, by(female)
ttest s, by(female)
ttest age, by(female)
ttest educ, by(female)
ttest rem, by(female)
6/8/2023 56
6. Regression Analysis
This section describes the use of Stata to do
regression analysis. Regression analysis involves
estimating an equation that best describes the data.
One variable is considered the dependent variable,
while the others are considered independent (or
explanatory) variables.
6/8/2023 57
Regression Cont…
Stata is capable of many types of regression analysis
and associated statistical test.
In this section, we touch on only a few of the more
common commands and procedures. The commands
described in this section are:
6/8/2023 58
Multiple linear Regression Analysis
𝐺𝑃𝐴 = 𝑓(ℎ𝑟𝑠, 𝑖𝑛𝑐𝑜𝑚𝑒, 𝑝𝑐, 𝑠𝑒𝑥)
6/8/2023 59
Regression Cont…
• regress
This is an example of ordinary linear regression by using
regress command
6/8/2023 60
Regression Cont…
• Some post estimation commands:
predict fv
list cgpa fv in 1/10
scatter cgpa hrs
twoway (scatter cgpa hrs) (lfit cgpa hrs)
predict e, resid
We can use the regression result to predict what the Cumulative
GPA of student with income of 300 would be.
display [_cons]+_b[income]*300
6/8/2023 61
Regression Cont…
Post estimation test of OLS Regression/ Diagnostic Test
-Multicollinearity
-Hetroscedasticity
-Autocorrelation
-Normality
-Model Misspecification
6/8/2023 62
Diagnostic Test/Post Estimation Tests
1. Tests for Normality of Residuals
– kdensity -- produces kernel density plot with
normal distribution over layed.
– pnorm -- graphs a standardized normal probability
(P-P) plot.
– qnorm --- plots the quantiles of varname against
the quantiles of a normal distribution.
– Mvtest normal r
6/8/2023 63
Diagnostic Cont…
2. Tests for Heteroscedasticity
– hettest -- performs Cook and Weisberg test for
heteroscedasticity.
– imtest-- computes the White general test for
Heteroscedasticity
6/8/2023 64
Diagnostic Cont…
3. Tests for Multicollinearity
Vif: Calculates the variance inflation factor for
the independent variables in the linear model.
This test involves the regression of one explanatory
variables on another explanatory variable and if the
auxiliary R2 is greater than 0.9, there is a problem of
Multicollinearity between explanatory variables.
6/8/2023 65
Diagnostic Cont…
4. Tests for Model Specification
– linktest -- performs a link test for model specification.
– ovtest -- performs regression specification error test
(RESET) for omitted variables.
6/8/2023 66
Thank You!
6/8/2023 67