0% found this document useful (0 votes)
149 views67 pages

Basics of STATA: Data Management & Graphs

Learning Stata Software Lesson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
149 views67 pages

Basics of STATA: Data Management & Graphs

Learning Stata Software Lesson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

Introduction to Basic STATA

By:
Dr ZerihunTemsas

May 2023
Outline

1. Meaning and Importance of STATA


2. Importing and Saving Data in STATA
3. Data Management
4. Graphs Using STATA
5. Descriptive Data Analysis using STATA
6. Two groups mean comparison test
7. Regression Analysis Using STATA

6/8/2023 2
Scientific Research Process
Theoretical & Empirical literature

Statement of the problem

Research Questions

Research Objectives

Research Design (Quantitative or Qualitative)

Sampling Design

Research Proposal

Data Collection Data Analysis (Software)


Report writing
6/8/2023
1. Meaning and Importance of STATA

Stata is a general-purpose statistical software package


created in 1985 by Stata corporation.

Most of its users work in research at universities or


research institutions… etc.

Use of Stata includes data management, statistical


analysis, graphics and regression analysis.

6/8/2023 4
Meaning and Importance of STATA

Stata is a multi-purpose statistical package to help


you explore, summarize and analyze datasets.
A dataset is a collection of several variables with their
respective values (usually arranged by columns).

A variable is a characteristics that can assume one or


several values.
Stata is a widely used statistical software in
Universities.

5
Cont… Result Window

Previously used commands are listed here and can


be transferred to the command window by
clicking on them.

All outputs appear in this window.


Only graphics will appear in a separate window.
Review Window

This is the command line where commands are


entered for execution.

Command Window Variables Window


6
Cont…

Basically, there are four basic STATA windows:

a. Result Window
b. Review Window
c. Command Window
d. Variable Window

7
Cont…

Stata is mostly a command-driven package.

Although the newest versions also have pull-down menus


from which different commands can be chosen, the best
way to learn Stata is still by typing in the commands.
But, sometimes the exact syntax of a command is hard to
get. (Example: Graphs)
In these cases, we often use the menu-commands to do it
once and then copy the syntax that appears.

8
STATA Buttons
The most important button functions are the following:
Open : Opens a new data file.
Save: Saves the current data file.
Print results: Prints the content of the results window.
Log begin: To open log file which is used to save results
New Viewer: This window provides help on Stata commands and
rules
 Do-file Editor: Opens a new instance of the do-file editor
(same as doedit).
Data Editor: Opens the data editor window (same as edit).
Data Browser: Opens the data browser (same as browse).
9
Button functions cont…

Variable manager: this button function is used to label variables


and values

To give additional information about variable name and to label


values of categorical variables.

6/8/2023 10
Open Tool Bar

Variable Manager

New viewer Data Browser

Log begin Do File Data Editor

6/8/2023 11
Stata Menu Bars

Stata displays 8 drop-down menus across the top of the outer window, from left
to right:
1. File
Open open a Stata data file (use)
Save/Save as save the Stata data in memory to disk
Do execute a do-file
Print print log or graph
Exit quit Stata

6/8/2023 12
Menu Bar Cont…
2. Edit
• Paste: to past commands on the command window.

• Copy table: copy table from Results window to another file

• Preference: To change the color of Stata result window

3. Data

• Describe Open Variable Manager

• Create new variable Sort data

• Open data editor

6/8/2023 13
Menu Bar Cont…

4. Graph: it contains various Stata commands for graphs


5. Statistics: it contains most of Stata commands for
descriptive and inferential data analysis
6. Users: is used so as to down load some users supplied
Stata commands
7. Windows: it is used to bring variable window,
command window and review window to the
front
8. Help: used to find Stata commands, tutorials & guideline

6/8/2023 14
2. Data Importing and Saving
2.1 Data Entry
There are different ways of reading or entering data into Stata:

A. By using “use” commands


B. By copying and Paste from Excell
C. By directly writing on the Data Editor
D. By using stata transfer soft wares

A. Importing Data via “use” command


Datasets which is in stata format can be imported via the “use”
command as follow:
– Syntax: use filename.dta
– Example: use migration 15
Use command Cont…
You can open selected variables of a file using a variable list and you
can also open selected records of a file using if or in.

Example:
• use migration opens the file migration. dta for analysis.
• use migration if female== 1 opens data only for female
migrants

• use migration in 1/150 opens records 1 through 150 of


migration file

• use obsn age female using migration opens 2 variables


from migration file

16
Cont…

The clear option will clear the dataset currently in memory before opening the other one.

Example: use filename.dta, clear

B. Stat/Transfer program
There is a soft ware called stata transfer, which directly converted data in Excel
format in to stata format and locate the data in the Data Editor button.

C. copy-and-paste

If you can open the data in Excel, you can usually copy and paste the data into the Stata
data editor.

All you need to do is select the columns in Excel; copy them; open the Stata data editor;
and paste.
17
Cont…

D. Manual typing

Manually typing in the data is the tedious last resort– if the data
is not available in electronic format, you may have to type it in
manually.

Start the Stata program and use the edit command – this brings
up a spreadsheet-like where you can enter new data or edit
existing data.
18
Cont…

2.2 Saving Commands and Outputs


A. Saving File/Data

Finally, the data is saved with the save command:


– Syntax: save filename.dta [, options]

– Example: save migration.dta, replace


To see your working directory, type
– pwd
– C:\Users\hp-6570b\Desktop\Document

19
Saving Data

You can also change the location of working directory


using the following Stata command.

cd“C:\Users\hp-6570b\Desktop\Software_Training”

6/8/2023 20
B. Saving Outputs/Results

Saving the Output

– Stata Results window does not keep all the output you
generate.

– It only stores about 300-600 lines, and when it is full, it


begins to delete the old results as you add new results.

– Thus, we need to use log to save the output

21
Log File Cont…

Create a log file, sort of Stata’s built-in tape recorder and where you can:

1) retrieve the output of your work and

2) keep a record of your work.

– Example: log using stata_result.log save output in a file named


stata_result

This will create the file ‘stata_result.log’ in your working directory.

To close a log file type:

log close

22
Log File Cont…

To add more output to an existing log file add the option append, type:
– Example: log using stata_result.log, append save outputs to
an exiting file named stata_result.
– To replace a log file add the option replace, type:
– log using stata_result.log, replace replace values of an
existing file
Note that the option replace will delete the contents of the previous
version of the log.

23
Log File Cont…

log off

This command temporarily turns off the logging of output,

log on

This command is used to restart the logging,

log close

This command is used to turn off the logging and save the file.

24
C. Saving Commands
The Do-file Editor allows you to store a set of commands
and It makes it easier to check and fix errors.

It allows you to run the commands later and lets you show
others how you got your result (Example: your advisor may
want to know how you got the result).

25
Do File Cont…

In general, any time you are running more than 10 commands to


get a result, it is easier and safer to use a Do-file to store the
commands.

To open the Do-file Editor, you can click on Windows/Do-file


Editor or click on the envelope on the Tool Bar.

26
Do File Cont…

To run the commands in a Do-file, you can click on the Do


button (the second-to-last one) or click on Tools/Do.

If you want to run one or just a few commands rather than


the whole file, mark the commands and click on the Do
button

27
3. Data manipulation
Generate New Variables

New variables are generated with the generate command:


Arithmetical operators are:
– + Addition
– - Subtraction
– * Multiplication
– / Division
– ^ Power

28
Data Manipulation Cont…

We will see how to explore data using existing variables in the next
section.

Now we will discuss how to create new variables.

When new variables are created, they are in memory and they will
appear in the Data Browser, but they will not be saved on the hard-
disk unless you use the save command.
Example: generate age_sq = age^2
gen log_y=log(y)

29
Data Manipulation Cont…

Thus, the generate command is used to create a new variable. It is


similar to “compute” in SPSS.

The syntax is;

generate newvar = exp [if exp]

where “exp“ is an expression like

generate rem2= rem/y

generate S2 = S/Y

gen total_rem=expr*rem
30
3.2 Replace

The values of existing variables can be changed with


the replace command.

It works similar to the generate command expecting


expressions.
– Syntax: replace oldvar =new variable

– Example: replace S = S/Y

31
Data Manipulation Cont…

Drop
Variables or observations can be deleted using the drop
command.
– Syntax: drop varlist
– Example: drop age female city
Keep
This command works opposite to drop as it keeps variables or
observations rather than deleting them.
– Syntax: keep varlist

32
Data Manipulation Cont…

Recode

This command changes the values of a categorical variable


according to the rules specified.

• The syntax is:

recode varname old=new [if exp] [in range]

To change quantitative variable in to categorical variable

33
Data Manipulation Cont…
Here are some examples:
• recode female 1=2 changes all values of female
=1 to female = 2

• recode female 1=2 0=1 changes 1 to 2 and 0 to 1

• recode female 0=1 1=0 exchanges the values 0 and


1in female

• recode female 1=2 *=1 changes 1 in female to 2


and all other values to 1
34
4. Data formatting
Rename
A variable can be renamed with the rename command:
– Syntax: rename old_varname new_varname
– Example: rename age migrant_age

Label
This command gives the exact name for the variable
– Syntax: label variable varname ["label"]
– Example: label variable Y “migrant income”
Label define
Label define female 1 “female” 0 “male”
Label define MRST 1 “protestant” 2 “orthodox” 3 “muslim” 4 “
others”
35
5. Data Exploration

Describe

General information about the dataset can be retrieved with describe.

The command displays the number of observations, number of


variables, the size of the dataset, and lists all variables together with basic
information (such as storage type, etc.).

des age female Y

36
Data Exploration Cont…
Codebook
The codebook command delivers information about one or more
variables, such as storage type, range, number of unique values, and
number of missing values.

The command offers further interesting features which can be seen with
help codebook.
– Syntax: codebook [varlist]
– Example: codebook age
Sort
Data is sorted in ascending order with the sort command:
– Syntax: sort varlist
– Example: sort age

37
Data Exploration Cont…
Descending ordering can be done with gsort, whereas a minus in front
of a varname invokes descending order:
– Syntax: gsort [+|-] varname...]
– Example: gsort -age
Order
The order of the variables as seen in the variable window can be changed
with the order command:
– Syntax: order varlist
– Example: order age obsn
Browse
The data browser can be opened with the browse command:
– Syntax: browse [varlist]
– Example: browse age income 38
Data Exploration Cont…
List
Similar to the data browser, values of variables can be listed in
the results window with the list command.
Summarize
The most important descriptive statistics for numerical variables
are delivered with the summarize command:
– Syntax: summarize [varlist]
– Example: summarize age y ls dskm
It displays the number of observations, mean, standard
deviation, minimum, and maximum.

39
Data Exploration Cont…
Tables of summary statistics can be drawn with table.
Tabulate
One-way frequency tables for categorical variables can be drawn with the
tabulate command:
– Syntax: tabulate varname
– Example: tabulate mrts
Two-way cross-tables for two categorical variables can be drawn with
another version of tabulate:
– Syntax: tabulate varname1 varname2
– Example: tabulate mrts female
– tab2 religion city, cell nofreq
– tab2 religion city, row
– tab2 religion city, column
40
Data Exploration Cont…
Inspect
The inspect command provides a quick summary of a numeric variable
that differs from that provided by summarize or tabulate:

– Syntax: inspect [varlist]


– Example: inspect age
It reports the number of negative, zero, and positive values;
– the number of integers and non-integers;
– the number of unique values; and
– the number of missing values; and it produce a small histogram.
Its purpose is not analytical but it allows to quickly gain familiarity with
unknown data.
41
Data Exploration Cont…
If
This command is used to select certain records in carrying out data
analysis .

command if exp

Examples:

 Sum age dskm y if city==1

 Summarize if female==1

 Sum if city==1

 Sum if age<20

• Note that “if” statements always use ==, not a single =. 42


Data Exploration Cont…
In
You can also use in to select records based on the case number.
The syntax is:
command in exp
For example:
 list in 10 list observation number 10
• summarize in 150/300 summarize observations 150-300
• list in -10/-1 list the last 10 observations
• Sum in -5/-1 summarize the last five observations
If and in commands have the same purpose as selecting file in
SPSS

43
Data Exploration Cont…
Count
count command can be used to show the number of
observations that satisfying if options. If no conditions are
specified, count displays the number of observations in the data.
• count
665
• count if female==1
213

44
Data Exploration Cont…

By
This prefix goes before a command and asks Stata to repeat the
command for each value of a variable. The general syntax is:
by varlist: command
Note: bysort command is most commonly used to shorten the sorting
process
Example of the by prefix are:
bysort female: sum
Similar with spilt file in SPSS
45
Data Exploration Cont…
help
The help command gives you information about any Stata command or
topic
help [command]
For example,
• help tabulate gives a description the tabulate
command
• help summarize gives a description of the summarize
command

46
5. Graphics Using STATA
• One of the advantages of Stata is its vast graphics
capabilities.

• Some graph commands are typed without the leading graph.

• For example, a basic histogram of the variable age would be:

– Example: histogram age, frequency normal

Graphs are not saved in log files.

47
Graphics Cont…
graph bar (mean) age, over(female) blabel(name): bar labeled by
name of categories

graph bar (mean) dskm, over(female) blabel(bar):bar labeled by


height of bar

graph bar (mean) age (mean) educ, over(female) blabel(bar)

6/8/2023 48
Graphics Cont…

The commands that draw graphs are

– graph twoway scatterplots, line plots

– graph matrix scatterplot matrices

– graph bar bar charts

– graph box box-and-whisker plots

– graph pie pie charts

49
Graphics Cont…

Examples

– graph twoway scatter educ age

We can show the regression line predicting educ from age using lfit
option.

– twoway lfit educ age

The two graphs can be overlapped like this

– twoway (scatter fs lf) (lfit fs ls)

– graph matrix age educ

50
Pie chart

Pie chart: used to present data for categorical


variables.

graph pie, over(religion) plabel(_all name)

graph pie, over(religion) plabel(_all sum)

6/8/2023 51
Normality and outliers

Skewness and kurtosis


sum age
sum age, detail
Check normality of a variable visually by looking at some basic
graphs
histogram age
histogram age, normal

Multivariate normality test


mvtest normal age

52
Normality Cont…
Graph box draws vertical box plots

graph box age

Upper and lower bounds of box are defined by the 25th and 75th
percentiles.

The line within the box is the median and ends of the whiskers are
5th and 95th percentile

If age is normal, the median would be in the center of the box and
the end of whiskers would be equidistant from the box

53
Normality Cont…

The kdensity command with the normal option


kdensity age, normal
– density graph of the variable with a normal distribution
superimposed on the graph
– useful in verifying that the variable are normally distributed
pnorm command produces a P-P plot
pnorm age
– It should be approximately linear if the variable follows
normal distribution
54
Normality Cont…

Qnorm command plots the quantiles of a variable against the


quantiles of a normal distribution
qnorm age
If the Q-Q plot shows a line that is close to the 45 degree line, the
variable is more normally distributed

Both P-P and Q-Q plot prove that age is normally distributed

55
Two groups mean comparison tests

ttest dskm, by(female)

ttest y, by(female)

ttest s, by(female)

ttest age, by(female)

ttest educ, by(female)

ttest rem, by(female)

6/8/2023 56
6. Regression Analysis

This section describes the use of Stata to do


regression analysis. Regression analysis involves
estimating an equation that best describes the data.

One variable is considered the dependent variable,


while the others are considered independent (or
explanatory) variables.

6/8/2023 57
Regression Cont…

Stata is capable of many types of regression analysis


and associated statistical test.

In this section, we touch on only a few of the more


common commands and procedures. The commands
described in this section are:

6/8/2023 58
Multiple linear Regression Analysis

𝐺𝑃𝐴 = 𝑓(ℎ𝑟𝑠, 𝑖𝑛𝑐𝑜𝑚𝑒, 𝑝𝑐, 𝑠𝑒𝑥)

6/8/2023 59
Regression Cont…
• regress

This is an example of ordinary linear regression by using


regress command

6/8/2023 60
Regression Cont…
• Some post estimation commands:

 predict fv
 list cgpa fv in 1/10
 scatter cgpa hrs
 twoway (scatter cgpa hrs) (lfit cgpa hrs)
 predict e, resid
We can use the regression result to predict what the Cumulative
GPA of student with income of 300 would be.
 display [_cons]+_b[income]*300

6/8/2023 61
Regression Cont…

Post estimation test of OLS Regression/ Diagnostic Test

-Multicollinearity

-Hetroscedasticity

-Autocorrelation

-Normality

-Model Misspecification

6/8/2023 62
Diagnostic Test/Post Estimation Tests

1. Tests for Normality of Residuals


– kdensity -- produces kernel density plot with
normal distribution over layed.
– pnorm -- graphs a standardized normal probability
(P-P) plot.
– qnorm --- plots the quantiles of varname against
the quantiles of a normal distribution.
– Mvtest normal r
6/8/2023 63
Diagnostic Cont…
2. Tests for Heteroscedasticity
– hettest -- performs Cook and Weisberg test for
heteroscedasticity.

– imtest-- computes the White general test for


Heteroscedasticity

6/8/2023 64
Diagnostic Cont…

3. Tests for Multicollinearity

Vif: Calculates the variance inflation factor for


the independent variables in the linear model.

This test involves the regression of one explanatory


variables on another explanatory variable and if the
auxiliary R2 is greater than 0.9, there is a problem of
Multicollinearity between explanatory variables.
6/8/2023 65
Diagnostic Cont…

4. Tests for Model Specification

– linktest -- performs a link test for model specification.

– ovtest -- performs regression specification error test


(RESET) for omitted variables.

6/8/2023 66
Thank You!

6/8/2023 67

You might also like