0% found this document useful (0 votes)
8 views14 pages

Midterm 1

Uploaded by

Anna Takács
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views14 pages

Midterm 1

Uploaded by

Anna Takács
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Empirical midterm

1.Origins of data
What is data?
• Observations are also known as cases, or rows – they’re identified by identifiers or ID
variables
• Variables are sometimes called features or covariates

Data structure and quality


• Cross-sectional (xsec) data have information on many units observed at the same time
• Time series (tseries) data have information on a single unit observed many times
• Multi-dimensional (panel) data have multiple dimensions
▪ Many cross-sectional units observed many times
▪ Units observed in different space
▪ longitudinal data, cross-section time series data, xt data
▪ In xt data tables observations are identified by two ID variables: one for the cross-
sectional units, one for time
▪ xt data is balanced if all cross-sectional units are observed at the very same time
periods
▪ It is called unbalanced if some cross-sectional units are observed more times than
others
• Data quality is key – garbage in-garbage out
• First you have to specify what is your (research) question
• Content - what is the substance a variable captures
▪ Just because a variable is called something it doesn’t necessarily measure that
(e.g., "product quality", "socio-economic status")
• Validity
▪ how close the actual content of the variable to the intended content
• Reliability
▪ If we were to measure the same variable multiple times for the same observation it
should give the same result
• Comparability of measurement
▪ how similarly the same variable is measured across different observations
• Coverage
▪ what proportion of the observations in focus are in the data
▪ Complete coverage (rare)
▪ Incomplete coverage (almost always)
• Unbiased selection
▪ if coverage incomplete the observations that are included in the data should be
similar to all observations that were intended to be covered
▪ Selection bias is the bias introduced by the selection of individuals, groups, or
data for analysis in such a way that proper randomization is not achieved, thereby
failing to ensure that the sample obtained is representative of the population
intended to be analyzed

Collecting data from existing sources


• visit the website and download
▪ Application Programming Interface, or API – directly load data into a statistical
software
• scraping
▪ code is neaded – once we have it, it can be repeated
• administrative
▪ Business transactions
▪ Government records, taxes, social security
▪ Biggest problem is very limited access

Survey
• Surveys collect data by asking people (respondents) and recording their answers
• self-administered survey
▪ cheap and efficient, can use visual aids
• interviews

Sampling
• Those all observations are called the population
• Sampling is when we purposefully collect data on a subset of the population
• A sample is good if it represents the population
▪ all important variables have very similar distributions in the sample and the
population
▪ all patterns in the sample are very similar to the patterns in the population
• How can we tell if a sample is representative
▪ Never for sure
▪ Benchmarking
• it looks at variables for which we know something in the population
• Those should be similar in the sample
• Random sampling is best
▪ Random sampling is a selection rule that is independent of any important variable
▪ Random sampling is the process that most likely leads to representative samples
▪ Provided sample is large enough
2. Preparing data for analysis
Variable types
• Quantitative/Continous variables
▪ are born as numbers
▪ special case is time
▪ Binary variables
• can take on two values
• yes/no answer to whether the observation belongs to some group
• dummy/indicator variables
 binary variables with 0-1 values
• Flag
 binary showing existence of some issue
• Qualitative/Categorical/Factor/Contionus variables
▪ each value having a specific interpretation – brands, countries etc.
▪ string variable
• text in data
• Nominal qualitative variables
▪ take on values that cannot be unambiguously ordered
▪ Color, brands
• Ordinal/ordered variables
▪ take on values that are unambiguously ordered
▪ quantitative variables can be ordered
▪ some qualitative variables can be ordered, too
▪ Grades
• "Interval" variables
▪ ordered variables, with a difference between values that can be compared
▪ Degree Celsius, Price in dollar
• "Ratio" (or "scale") variables
▪ variables with the additional property: their ratios mean the same regardless of the
magnitudes
▪ implies a meaningful zero in the scale
▪ Distance in miles, Price in dollar
• Flow variables
▪ are results of process during time
▪ government deficit last year
• Stock variables
▪ refer to quantities at a given point in time
▪ the amount of government debt last year

Data cleaning
• Data wrangling is the process of transforming raw data to a set of data tables that can be
used for a variety of downstream purposes such as analytics
• the tidy data approach:
▪ Each observation forms a row
▪ Each variable forms a column
▪ Each type of observational unit forms a table
▪ Each observation has a unique identifier (ID)
• long format for xt data
▪ store xt data in data tables with each row referring to one cross-sectional unit
observed in one time period
▪ The next row then may be the same cross-sectional unit observed in the next time
period
• wide format for xt data
▪ one row would refer to one cross-sectional unit, and different time periods are
represented in different columns

Relational data
• a concept of organizing information
• Each row is a record (observation) identified with a unique identifier ID (key)
• in a table can be linked to rows in other tables with a column for the unique ID of the
linked row (foreign ID)

Linking data
• Matching (joining) depends on data structure
▪ one-to-one (1:1) matching
• Football teams and stadium
▪ many-to-one (m:1) or one-to-many (1:m) matching
• Football teams and their players
▪ many-to-many (m:m) matching
• Football teams and their manager ever

Data wrangling
• filter out duplicates
▪ some observations appearing more than once in the data
• entity identification and resolution
▪ would need to have unique IDs
▪ could be that two observations belong to two entities although ID is the same –
ambiguous identification
▪ could be that two observations have different ID but belong to same entity -
• getting rid of non-entity observations
▪ Rows that do not belong to an entity we want in the data table
▪ Such as: a summary row in a table that adds up, or averages, variables across all,
or some, entities
• missing values
▪ some cases, missing just means "zero" or "no"
• we should simply recode (replace) the missing values as "zero" or as "no"
▪ missing systematically
• some survey respondents may not know the answer to a question or refuse to
answer it
• selection bias
 benchmarking:comparing distribution of variables that are available for all
observations
▪ sometimes, informative if missing
• create a new variable (called flag) to capture missing value and use this
variable instead of the original
▪ imputation – filling in some information
• ordinal variables
 you may add missing as new value or recode missing to a neutral variable:
high, average, low, with missing recoded as average
• quantitative variables
 recode with mean or median
• if impute
 create a flag and use it analysis

Exploratory data analysis


Frequency of variables
• absolute frequency/count:
▪ a value of a variable is simply the number of observations with that particular
value
• relative frequency:
▪ the proportion of observations with that particular value among all observations

Probability
• Probability is a measure of the likelihood of an event
• Probability as a generalization of relative frequencies in datasets
• Subjective probabilities
▪ I like a book enough to keep it for my further studies

Histograms
• Histogram reveals important properties of a distribution
• Number and location of modes
▪ these are the peaks in the distribution that stand out from their immediate
neighborhood
• Approximate regions for center and tails
• Symmetric or not
▪ asymmetric (skewed) distributions have a long left tail or a long right tail
• Extreme values
▪ values that are very different from the rest
▪ extreme values are at the far end of the tails of histograms
▪ need conscious decision
• Density plots
▪ instead of bars it shows continous curves

Summary statistics
• Sample mean
▪ x
• Expected value
▪ E x 
▪ For a quantitative variable, the expected value is the mean
• Quantiles
▪ a quantile is the value that divides the observations in the dataset to two parts in
specific proportions
• Median
▪ the middle value of the distribution
• Percentiles
▪ divide the data into two parts along a certain percentage
• Quartiles
▪ divide the data into two parts along fourths
▪ 1st quartile has one quarter of the observations below and three quarters above - it
is the 25th percentile
• Mode
▪ The mode is the value with the highest frequency in the data
• Central Tendency
▪ The mean, median and mode are different statistics for the central value of the
distribution
• Statistics that measure the spread of distributions are the range, inter-quantile ranges, the
standard deviation and the variance
• Range
▪ the difference between the highest value (the maximum) and the lowest value (the
minimum) of a variable
• Inter-quantile ranges
▪ is the difference between two quantiles- the third quartile (the 75th percentile) and
the first quartile (the 25th percentile)
• Standard deviation
▪ Its square is the variance

 (x )
2
i −x

n
• Variance
▪ the average squared difference of each observed value from the mean
• Standardized value
▪ standardized value of a variable shows the difference from the mean in units of
standard deviation
▪ x s tan dardized =
(x − x )
Std  x 
• Skewness
▪ When the distribution is symmetric its mean and median are the same
▪ When it is skewed with a long right tail the mean is larger than the median
▪ When a distribution is skewed with a long left tail the mean is smaller than the
median
x − median  x 
▪ Skewness =
Std  x 
• Visualizing summary statistics
▪ Measures of central value: mean, median, quantiles, percentiles
▪ Measures of spread: range, inter-quantile range, variance, standard deviation
▪ Measure of skewness: mean–median difference
▪ Box plot: visual representation of many quantiles and extreme values
▪ Violin plot: mixes elements of a box plot and a density plot

Distributions
• Normal distribution
▪ bell-shaped
▪ event can take any value
▪ captured by two parameters: µ is the mean - σ the standard deviation
▪ Symmetric = median, mean (and mode) are the same
▪ ex:height, IQ etc.
• Lognormal distribution
▪ asymmetrically distributed with long right tails
▪ start from a normally distributed RV (x), transform it: (e x ) and the resulting
variable is distributed log-normal
▪ always non-negative
▪ ex: firm size, income
• Power law/Pareto distribution
▪ very large extreme values - well approximated
▪ ex:city population, wealth

4. Comparison and correlation


The y and the x
• Patterns of association: whether and how observations with particular values of one
variable (x) tend have particular values of the other variable (y)
▪ compare observations that are different in their x values
▪ Goal 1: predicting the value of a y variable with the help of other variables
• we know the values of those other variables but not the y variable
▪ Goal 2: learn about the effect of a causal variable x on an outcome variable y
• what the value of y would be if we could change x
• casual/conditioning variable: x
• outcome variable: y
• We compare y, by values of x –> we condition y on x
▪ Compare salaries of workers (y) with low and high level of education (x)
• further/conditional comparison: doing more conditioning
• Joint probability: the probability of two events occur
• Independent events: if the probability of one event is the sameregardless whether or not
the other event occurs
• Conditional distribution: all y variables have a conditional distribution if conditioned on
an x variable
• Conditional mean/expectation: shows the mean of y for each value of x
▪ E[y|x]
• Bin scatter: visualization of conditional means of y for bins of x
• Scatterplot: the visualization of the joint distribution of the two variables

Statistical dependence
• Dependence of two variables/Statistical dependence:the conditional distribution of one
variable (y) are not the same when conditional on different of the other variable (x)
• Independence of variables: the distribution of y on x is the same, regardless of the value
of x
• Mean dependence
▪ conditional expectation E[y|x] varies with the value of x
▪ the extent to which conditional expectations (means) differ
▪ Two variables are positively mean-dependent: if the average of one variable tends
to be larger when the value of the other variable is larger, too
• Covariance
▪ mean dependence measure
▪ The more often a positive x i − x goes together with a positive yi − y the larger
positive is the covariance

▪ Cov  x , y  =
 (x
i i )(
− x yi − y )
n
• Correlation coefficient
▪ The correlation coefficient is the standardized version of the covariance
Cov  x , y 
▪ Corr  x , y  =
Std  x Std y 
▪ −1  Corr  x , y   1
• E[y|x] = E[y]
▪ if two variables are independent
▪ for any x
▪ the reverse is untrue
• can have zero correlation but mean dependence
 symmetrical U-shaped conditional expectation
• can have zero correlation and zero mean dependence without complete
independence
 the spread of y may be different for different values of x
• Covariance or the correlation coefficient allow for all kinds of variables, including binary
variables and ordered qualitative variables as well as quantitative variables
• Uncorrelated: the correlation coefficient is zero
• Positively correlated: the two variables „goes to the same way’’
• Negatively correlated: the „opposite way’’
▪ ex: distance and price

Latent variables
• Latent variable:such variables are not part of an actual dataset – they can’t be observed,
because they’re too abstract
▪ ex: quality of firm management
• Proxy variables:answering questions with latent variables only by substituting observed
variables for them – these are the proxy variables
• Using a single variable
• Using a sum
▪ we would need to bring itt o a common scale
▪ this standardized measure is called a "z-score" or "score"
• Principal component analysis – PCA
▪ The weights are constructed in such a way that observed variables that are better
measures receive higher weights

Sources of variation x
• If no variation in the conditioning variable
▪ all observations have the same values
▪ impossible to make comparisons
• Generalization: The more variation is there in the conditioning variable the better are the
chances for comparison
• Experimental data
▪ controlled variation
▪ easy - if conditioning variable is experimentally controlled
▪ made sure that differences in the outcome variable are due to that variable only
• Observational data
▪ most data used in business, economics and policy analysis are observational
▪ harder to interpret
▪ hard - many other things may be different when the value of the conditioning
variable differs
5. Generalizing from data

Generalization
• inference: the act of generalization
• Statistical inference
▪ uses statistical methodes to make inference
▪ our data may represent a population if it is representative sample
▪ to be able to generalize to it, the general pattern needs to exist and it needs to be
stable over time and space
• General patterns:
▪ Representative sample and popolation
• well-defined population
• random sampling is the best way to achieve a representative sample
▪ No population but general pattern
• the specific pattern is the same
 ex:likehood, conditional probability, conditional mean
• External validity
▪ assessing whether our data represents the same general pattern that would be
relevant for the situation we truly care about
▪ externally valid case: the situation we care about and the data we have represent
the same general pattern
▪ no external validity: whatever we learn from our data, may turn out to be not
relevant at all
• The process of inference
▪ 1. Use statistical inference to learn about the population, or general pattern, that
our data represents
▪ 2. Assess external validity: define the population, or general pattern we are
interested in and assess how it compares to the population, or general pattern, that
our data represents.
• We wanto to infer the true value of the statistic after having computed its estimated value
from actual data

Repeated Samples
• the conceptual background to statistical inference
• the basic idea is that the data we observe is one example of many datasets that could have
been observed
• the goal of statistical inference is learning the value of a statistic in the population (or
general pattern) represented by our data
• estimate of statistic: within each sample the calculated value of the statistic
• sampling distribution:the distribution of the statistic
• Standard Error (SE): the standard deviation of the sampling distribution
▪ any particular estimate is likely to be an erroneous estimate of the true value
▪ the magnitude of that typical error is one SE
• Properties of the Sampling distribution
▪ Unbiasedness
• the average computed from a representative sample is an unbiased estimate of
the average in the entire population
▪ Asympotic/Approximate normality
• looking at larger and larger samples, the closer and closer the sampling
distribution is to normal
▪ Root-n convergence
• the SE is inversely proportional tot he square root of the sample size
• larger samples → smaller SE (with a proportionality factor of the square root
of the sample size)
• Calculating SE
▪ In reality, we don’t get to observe the sampling distribution - That dataset is one of
the many potential samples that could have been drawn from the population, or
general pattern
▪ Formula
1
• SE x =( ) n
Std  x 

• the x observations should be independent across observations in the data


▪ Bootstraping
• a method to create synthetic samples that are similar but different
• bootstrap method takes the original dataset and draws many repeated samples
of the size of that dataset
• the trick is that the samples are drawn with replacement
• Bootstrap sample: the result is a sample of the same size as the original data
• Bootstrap estimate of SE: the standard deviation of the estimated values of the
statistic across the bootstrap samples

The confidence interval


• defines a range where we can expect the true value in the population, or the general
pattern
• gives a range for the true value with a probability
• probability tells how likely it is that the true value is in that range
• 95% → 1.96 SE
• 90% → 1.6 SE
• 99% → 2.6 SE

External validity
• High external validity: if our data is close to the population or the general pattern we care
about
• The most important challenges to external validity may be collected in three groups:
▪ Time
• we have data on the past, but we care about the future
• we need to assume that the general patternthat was relevant in the past will
remain relevant in the future
▪ Space
• our data is on one country
• these patterns occur in other countries also?
▪ Sub-groups
• our data is on 25-30 year old people
• Would a pattern hold on younger / older people?

6. Testing hyphotheses
The logic of testing hypotheses
• Hypothesis testing: analyze our data to make a decision on the hypothesis
• Reject the hypothesis if there is enough evidence against it
• s: define the the statistic we want to test
▪ ex:mean
strue : we are interested in the true value of s

▪ ex: s = rnew − rold


• s : the value the statistic in our data is its estimated value
• Null hyphothesis
▪ H 0 : strue = 0
▪ the null is protected: it has to be hard to reject it otherwise the conclusions of
hypothesis testing would not be strong
• Alternative hyphotesis
▪ H a : strue  0
• Together, the null and the alternative cover all the possibilities we are interested in
• Two-sided alternative
▪ the case we test if H a : strue  0
▪ strue can be greater and smaller than 0
• One-sided alternative
▪ we are interested if a statistic is positive
▪ H 0 : strue  0
▪ Ha : strue  0
▪ ex: testing likelihood of loss on a stock portfolio

t-Test
• We compare the estimated value of the statistic s (our best guess of s) to zero
• Evidence to reject the null = difference between s and zero
▪ Reject if large: large difference means it is unlikely to be zero
• The test statistic is a statistic that measures the distance of the estimated value from what
the true value would be if H 0 was true
• the estimated value and the SE
s
▪ t=
()
SE s

• when s is the the average of a variable x – whether the mean of the variable is equal to 0
x
▪ t=
SE x ( )
• when s is the average of a variable x minus a number
x − number
▪ t=
SE x ( )
• when s is the difference between two averages – whether the mean of a variable is the
same in group A and group B
xA − xB
▪ t=
(
SE x A − x B )
Making a decision
• The decision rule : comparing the test statistic to a pre-defined critical value
• Null hyphothesis is true
▪ TN
• true negative
• correct
• don’t reject the null
▪ FP
• false positive
• Type- I error
• reject the null
• Null hypothesis is false
▪ FN
• false negative
• Type- II error
• don’t reject the null
▪ TP
• true positive
• correct
• reject the null
• A commonly applied critical value for a t-statistic is ±2 (or 1.96)
▪ reject the null if the t-statistic is smaller than −2 or larger than +2
• With ±2 critical value - 5% is the probability of false positives - we have 5% as the
probability that we would reject the null if it was true (False positive)
• Fixing the chance of false positives affects the chance of false negatives at the same time
• A false negative arises when the t-statistic is within the critical values and we don’t reject
the null even though the null is not true
• Making a false negative call is more likely when it is harder to make a decision
▪ Sample is small
▪ The difference between true value and null is small
• Size of the test: the probability of a false positive
• Level of significance: the maximum probability of false positives we tolerate
• Power of the test: the probability of avoiding a false negative

The p-Value
• the smallest significance level at which we can reject H 0 given the value of the test
statistic in the sample
• the probability that the test statistic will be as large as, or larger than, what we calculate
from the data, if the null hypothesis is true
• p-value: probability rejecting the null while it is true (probability of avoiding False
Positive)
• Power: probability rejecting the null while it is false (probability of avoiding False
Negative)
• reject the null if the p-value is less than the level of significance w eset for ourselves
▪ otherwise don’t reject it
• p is never zero

You might also like