Midterm 1
Midterm 1
1.Origins of data
What is data?
• Observations are also known as cases, or rows – they’re identified by identifiers or ID
variables
• Variables are sometimes called features or covariates
Survey
• Surveys collect data by asking people (respondents) and recording their answers
• self-administered survey
▪ cheap and efficient, can use visual aids
• interviews
Sampling
• Those all observations are called the population
• Sampling is when we purposefully collect data on a subset of the population
• A sample is good if it represents the population
▪ all important variables have very similar distributions in the sample and the
population
▪ all patterns in the sample are very similar to the patterns in the population
• How can we tell if a sample is representative
▪ Never for sure
▪ Benchmarking
• it looks at variables for which we know something in the population
• Those should be similar in the sample
• Random sampling is best
▪ Random sampling is a selection rule that is independent of any important variable
▪ Random sampling is the process that most likely leads to representative samples
▪ Provided sample is large enough
2. Preparing data for analysis
Variable types
• Quantitative/Continous variables
▪ are born as numbers
▪ special case is time
▪ Binary variables
• can take on two values
• yes/no answer to whether the observation belongs to some group
• dummy/indicator variables
binary variables with 0-1 values
• Flag
binary showing existence of some issue
• Qualitative/Categorical/Factor/Contionus variables
▪ each value having a specific interpretation – brands, countries etc.
▪ string variable
• text in data
• Nominal qualitative variables
▪ take on values that cannot be unambiguously ordered
▪ Color, brands
• Ordinal/ordered variables
▪ take on values that are unambiguously ordered
▪ quantitative variables can be ordered
▪ some qualitative variables can be ordered, too
▪ Grades
• "Interval" variables
▪ ordered variables, with a difference between values that can be compared
▪ Degree Celsius, Price in dollar
• "Ratio" (or "scale") variables
▪ variables with the additional property: their ratios mean the same regardless of the
magnitudes
▪ implies a meaningful zero in the scale
▪ Distance in miles, Price in dollar
• Flow variables
▪ are results of process during time
▪ government deficit last year
• Stock variables
▪ refer to quantities at a given point in time
▪ the amount of government debt last year
Data cleaning
• Data wrangling is the process of transforming raw data to a set of data tables that can be
used for a variety of downstream purposes such as analytics
• the tidy data approach:
▪ Each observation forms a row
▪ Each variable forms a column
▪ Each type of observational unit forms a table
▪ Each observation has a unique identifier (ID)
• long format for xt data
▪ store xt data in data tables with each row referring to one cross-sectional unit
observed in one time period
▪ The next row then may be the same cross-sectional unit observed in the next time
period
• wide format for xt data
▪ one row would refer to one cross-sectional unit, and different time periods are
represented in different columns
Relational data
• a concept of organizing information
• Each row is a record (observation) identified with a unique identifier ID (key)
• in a table can be linked to rows in other tables with a column for the unique ID of the
linked row (foreign ID)
Linking data
• Matching (joining) depends on data structure
▪ one-to-one (1:1) matching
• Football teams and stadium
▪ many-to-one (m:1) or one-to-many (1:m) matching
• Football teams and their players
▪ many-to-many (m:m) matching
• Football teams and their manager ever
Data wrangling
• filter out duplicates
▪ some observations appearing more than once in the data
• entity identification and resolution
▪ would need to have unique IDs
▪ could be that two observations belong to two entities although ID is the same –
ambiguous identification
▪ could be that two observations have different ID but belong to same entity -
• getting rid of non-entity observations
▪ Rows that do not belong to an entity we want in the data table
▪ Such as: a summary row in a table that adds up, or averages, variables across all,
or some, entities
• missing values
▪ some cases, missing just means "zero" or "no"
• we should simply recode (replace) the missing values as "zero" or as "no"
▪ missing systematically
• some survey respondents may not know the answer to a question or refuse to
answer it
• selection bias
benchmarking:comparing distribution of variables that are available for all
observations
▪ sometimes, informative if missing
• create a new variable (called flag) to capture missing value and use this
variable instead of the original
▪ imputation – filling in some information
• ordinal variables
you may add missing as new value or recode missing to a neutral variable:
high, average, low, with missing recoded as average
• quantitative variables
recode with mean or median
• if impute
create a flag and use it analysis
Probability
• Probability is a measure of the likelihood of an event
• Probability as a generalization of relative frequencies in datasets
• Subjective probabilities
▪ I like a book enough to keep it for my further studies
Histograms
• Histogram reveals important properties of a distribution
• Number and location of modes
▪ these are the peaks in the distribution that stand out from their immediate
neighborhood
• Approximate regions for center and tails
• Symmetric or not
▪ asymmetric (skewed) distributions have a long left tail or a long right tail
• Extreme values
▪ values that are very different from the rest
▪ extreme values are at the far end of the tails of histograms
▪ need conscious decision
• Density plots
▪ instead of bars it shows continous curves
Summary statistics
• Sample mean
▪ x
• Expected value
▪ E x
▪ For a quantitative variable, the expected value is the mean
• Quantiles
▪ a quantile is the value that divides the observations in the dataset to two parts in
specific proportions
• Median
▪ the middle value of the distribution
• Percentiles
▪ divide the data into two parts along a certain percentage
• Quartiles
▪ divide the data into two parts along fourths
▪ 1st quartile has one quarter of the observations below and three quarters above - it
is the 25th percentile
• Mode
▪ The mode is the value with the highest frequency in the data
• Central Tendency
▪ The mean, median and mode are different statistics for the central value of the
distribution
• Statistics that measure the spread of distributions are the range, inter-quantile ranges, the
standard deviation and the variance
• Range
▪ the difference between the highest value (the maximum) and the lowest value (the
minimum) of a variable
• Inter-quantile ranges
▪ is the difference between two quantiles- the third quartile (the 75th percentile) and
the first quartile (the 25th percentile)
• Standard deviation
▪ Its square is the variance
(x )
2
i −x
▪
n
• Variance
▪ the average squared difference of each observed value from the mean
• Standardized value
▪ standardized value of a variable shows the difference from the mean in units of
standard deviation
▪ x s tan dardized =
(x − x )
Std x
• Skewness
▪ When the distribution is symmetric its mean and median are the same
▪ When it is skewed with a long right tail the mean is larger than the median
▪ When a distribution is skewed with a long left tail the mean is smaller than the
median
x − median x
▪ Skewness =
Std x
• Visualizing summary statistics
▪ Measures of central value: mean, median, quantiles, percentiles
▪ Measures of spread: range, inter-quantile range, variance, standard deviation
▪ Measure of skewness: mean–median difference
▪ Box plot: visual representation of many quantiles and extreme values
▪ Violin plot: mixes elements of a box plot and a density plot
Distributions
• Normal distribution
▪ bell-shaped
▪ event can take any value
▪ captured by two parameters: µ is the mean - σ the standard deviation
▪ Symmetric = median, mean (and mode) are the same
▪ ex:height, IQ etc.
• Lognormal distribution
▪ asymmetrically distributed with long right tails
▪ start from a normally distributed RV (x), transform it: (e x ) and the resulting
variable is distributed log-normal
▪ always non-negative
▪ ex: firm size, income
• Power law/Pareto distribution
▪ very large extreme values - well approximated
▪ ex:city population, wealth
Statistical dependence
• Dependence of two variables/Statistical dependence:the conditional distribution of one
variable (y) are not the same when conditional on different of the other variable (x)
• Independence of variables: the distribution of y on x is the same, regardless of the value
of x
• Mean dependence
▪ conditional expectation E[y|x] varies with the value of x
▪ the extent to which conditional expectations (means) differ
▪ Two variables are positively mean-dependent: if the average of one variable tends
to be larger when the value of the other variable is larger, too
• Covariance
▪ mean dependence measure
▪ The more often a positive x i − x goes together with a positive yi − y the larger
positive is the covariance
▪ Cov x , y =
(x
i i )(
− x yi − y )
n
• Correlation coefficient
▪ The correlation coefficient is the standardized version of the covariance
Cov x , y
▪ Corr x , y =
Std x Std y
▪ −1 Corr x , y 1
• E[y|x] = E[y]
▪ if two variables are independent
▪ for any x
▪ the reverse is untrue
• can have zero correlation but mean dependence
symmetrical U-shaped conditional expectation
• can have zero correlation and zero mean dependence without complete
independence
the spread of y may be different for different values of x
• Covariance or the correlation coefficient allow for all kinds of variables, including binary
variables and ordered qualitative variables as well as quantitative variables
• Uncorrelated: the correlation coefficient is zero
• Positively correlated: the two variables „goes to the same way’’
• Negatively correlated: the „opposite way’’
▪ ex: distance and price
Latent variables
• Latent variable:such variables are not part of an actual dataset – they can’t be observed,
because they’re too abstract
▪ ex: quality of firm management
• Proxy variables:answering questions with latent variables only by substituting observed
variables for them – these are the proxy variables
• Using a single variable
• Using a sum
▪ we would need to bring itt o a common scale
▪ this standardized measure is called a "z-score" or "score"
• Principal component analysis – PCA
▪ The weights are constructed in such a way that observed variables that are better
measures receive higher weights
Sources of variation x
• If no variation in the conditioning variable
▪ all observations have the same values
▪ impossible to make comparisons
• Generalization: The more variation is there in the conditioning variable the better are the
chances for comparison
• Experimental data
▪ controlled variation
▪ easy - if conditioning variable is experimentally controlled
▪ made sure that differences in the outcome variable are due to that variable only
• Observational data
▪ most data used in business, economics and policy analysis are observational
▪ harder to interpret
▪ hard - many other things may be different when the value of the conditioning
variable differs
5. Generalizing from data
Generalization
• inference: the act of generalization
• Statistical inference
▪ uses statistical methodes to make inference
▪ our data may represent a population if it is representative sample
▪ to be able to generalize to it, the general pattern needs to exist and it needs to be
stable over time and space
• General patterns:
▪ Representative sample and popolation
• well-defined population
• random sampling is the best way to achieve a representative sample
▪ No population but general pattern
• the specific pattern is the same
ex:likehood, conditional probability, conditional mean
• External validity
▪ assessing whether our data represents the same general pattern that would be
relevant for the situation we truly care about
▪ externally valid case: the situation we care about and the data we have represent
the same general pattern
▪ no external validity: whatever we learn from our data, may turn out to be not
relevant at all
• The process of inference
▪ 1. Use statistical inference to learn about the population, or general pattern, that
our data represents
▪ 2. Assess external validity: define the population, or general pattern we are
interested in and assess how it compares to the population, or general pattern, that
our data represents.
• We wanto to infer the true value of the statistic after having computed its estimated value
from actual data
Repeated Samples
• the conceptual background to statistical inference
• the basic idea is that the data we observe is one example of many datasets that could have
been observed
• the goal of statistical inference is learning the value of a statistic in the population (or
general pattern) represented by our data
• estimate of statistic: within each sample the calculated value of the statistic
• sampling distribution:the distribution of the statistic
• Standard Error (SE): the standard deviation of the sampling distribution
▪ any particular estimate is likely to be an erroneous estimate of the true value
▪ the magnitude of that typical error is one SE
• Properties of the Sampling distribution
▪ Unbiasedness
• the average computed from a representative sample is an unbiased estimate of
the average in the entire population
▪ Asympotic/Approximate normality
• looking at larger and larger samples, the closer and closer the sampling
distribution is to normal
▪ Root-n convergence
• the SE is inversely proportional tot he square root of the sample size
• larger samples → smaller SE (with a proportionality factor of the square root
of the sample size)
• Calculating SE
▪ In reality, we don’t get to observe the sampling distribution - That dataset is one of
the many potential samples that could have been drawn from the population, or
general pattern
▪ Formula
1
• SE x =( ) n
Std x
External validity
• High external validity: if our data is close to the population or the general pattern we care
about
• The most important challenges to external validity may be collected in three groups:
▪ Time
• we have data on the past, but we care about the future
• we need to assume that the general patternthat was relevant in the past will
remain relevant in the future
▪ Space
• our data is on one country
• these patterns occur in other countries also?
▪ Sub-groups
• our data is on 25-30 year old people
• Would a pattern hold on younger / older people?
6. Testing hyphotheses
The logic of testing hypotheses
• Hypothesis testing: analyze our data to make a decision on the hypothesis
• Reject the hypothesis if there is enough evidence against it
• s: define the the statistic we want to test
▪ ex:mean
strue : we are interested in the true value of s
t-Test
• We compare the estimated value of the statistic s (our best guess of s) to zero
• Evidence to reject the null = difference between s and zero
▪ Reject if large: large difference means it is unlikely to be zero
• The test statistic is a statistic that measures the distance of the estimated value from what
the true value would be if H 0 was true
• the estimated value and the SE
s
▪ t=
()
SE s
• when s is the the average of a variable x – whether the mean of the variable is equal to 0
x
▪ t=
SE x ( )
• when s is the average of a variable x minus a number
x − number
▪ t=
SE x ( )
• when s is the difference between two averages – whether the mean of a variable is the
same in group A and group B
xA − xB
▪ t=
(
SE x A − x B )
Making a decision
• The decision rule : comparing the test statistic to a pre-defined critical value
• Null hyphothesis is true
▪ TN
• true negative
• correct
• don’t reject the null
▪ FP
• false positive
• Type- I error
• reject the null
• Null hypothesis is false
▪ FN
• false negative
• Type- II error
• don’t reject the null
▪ TP
• true positive
• correct
• reject the null
• A commonly applied critical value for a t-statistic is ±2 (or 1.96)
▪ reject the null if the t-statistic is smaller than −2 or larger than +2
• With ±2 critical value - 5% is the probability of false positives - we have 5% as the
probability that we would reject the null if it was true (False positive)
• Fixing the chance of false positives affects the chance of false negatives at the same time
• A false negative arises when the t-statistic is within the critical values and we don’t reject
the null even though the null is not true
• Making a false negative call is more likely when it is harder to make a decision
▪ Sample is small
▪ The difference between true value and null is small
• Size of the test: the probability of a false positive
• Level of significance: the maximum probability of false positives we tolerate
• Power of the test: the probability of avoiding a false negative
The p-Value
• the smallest significance level at which we can reject H 0 given the value of the test
statistic in the sample
• the probability that the test statistic will be as large as, or larger than, what we calculate
from the data, if the null hypothesis is true
• p-value: probability rejecting the null while it is true (probability of avoiding False
Positive)
• Power: probability rejecting the null while it is false (probability of avoiding False
Negative)
• reject the null if the p-value is less than the level of significance w eset for ourselves
▪ otherwise don’t reject it
• p is never zero