0% found this document useful (0 votes)
8 views6 pages

Data Management

How to manage data

Uploaded by

22101740
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views6 pages

Data Management

How to manage data

Uploaded by

22101740
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

​ odule 4.1.

1 Data Management​
M
​-​ ​Practice​ ​of​ ​organizing​ ​and​ ​maintaining​
​data​ ​processes​ ​to​ ​meet​ ​ongoing​
​information lifecycle needs​

​ egend:​
L
​𝑥 = repeating value​
​𝑤 = number of occurrences of weight​
​𝑥 (w/ a line on top) = mean​

​ .​ M
2 ​ edian​
​-​ ​Middle​ ​score​ ​for​ ​a​ ​set​ ​of​​data​​that​​has​
​been arranged in order of magnitude​
​-​ ​Formula (location of median):​
​Measures of central tendency​ ​median = (n+1)/2​
​ .​ M
1 ​ ean​ ​Legend:​
​-​ ​The​ ​sum​ ​of​ ​all​ ​values​ ​in​ ​a​ ​dataset​ ​n = total number of data values in the sample​
​divided by the total number of values​ ​Note:​
​*if n is odd, median is the middle value​
​pros​ ​cons​
​*​ ​if​ ​n​ ​is​ ​even,​ ​median​ ​is​ ​the​ ​average​ ​of​ ​the​ ​2​
​ an be used with​
C I​t is susceptible to​ ​middle values​
​both discrete and​ ​the influence of​
​continuous data​ ​outliers​ ​ .​ M
3 ​ ode​
​-​ ​The​ ​most​ ​frequent​ ​score​ ​in​ ​the​ ​data​
​(what value pops up the most)​

​Type of variable (based​ ​Best measure​


​of level of measurement)​

​nominal​ ​mode​

​ordinal​ ​median​

​interval/ratio (not skewed)​ ​mean​

​interval/ratio (skewed)​ ​median​


​ .1 Arithmetic Mean (population mean)​
1
​-​ ​Average​ ​of​ ​a​ ​complete​ ​set​ ​of​ ​data​ ​ egend:​
L
​(population)​ ​Nominal = data can be categorized​
​1.2 Sample Mean​ ​Ordinal = can be categorized and ranked​
​-​ ​The​ ​average​ ​of​ ​a​ ​subset​ ​of​ ​data​ ​taken​ ​Interval​ ​=​ ​can​ ​be​ ​categorized,​ ​ranked,​ ​and​
​from a larger population (sample)​ ​evenly spaced​
​1.3 Weighted Mean​ ​Ratio​ ​=​ ​can​ ​be​ ​categorized,​ ​ranked,​ ​evenly​
​-​ ​An​ ​average​ ​computed​ ​by​ ​giving​ ​spaced, and has a natural zero​
​different​ ​weights​ ​to​ ​some​ ​of​ ​the​
​individual values.​
​Examples w/ explanation​
​-​ ​It is a non-negative​

​nominal​ ​Ethnicity​
​-​ ​Can't be ranked​

​ordinal​ ​Top 5 olympic medalist​


​-​ ​Does not tell you how close or far​
​they are in terms of number of wins​

​interval​ ​Temp in celsius​


​-​ ​There are equal intervals of one​
​degree, but the zero point is not​
​true as there that measurement can​
​reach negative degree celsius​ ​ egend:​
L
​ratio​ ​height​
​xi = each individual value​
​μ = population mean​
​xˉ = sample mean​
​Measures of variation​
​N = number of values in the population​
​-​ ​Gives​ ​information​ ​on​ ​the​ ​spread​ ​or​
​n = number of values in the sample​
​variability of the data values​
​-​ ​Although​ ​2​ ​dataset’s​ ​could​ ​have​ ​the​
​ teps on how to calculate:​
S
​same​ ​center,​ ​there​ ​variation​ ​could​ ​be​
​1.​ ​Calculate the mean​
​very different​
​2.​ ​Subtract the mean from each value​
​3.​ ​Square each deviation​
​4.​ ​Add up all the squared deviations​
​5.​ ​Divide by the number of values​
​5.1 if population, do it as is​
​5.2 for sample, do n-1​
​ .​ R
1 ​ ange​
​6.​ ​For standard deviation:​
​-​ ​Used to compare obvious data sets​
​6.1 square root the variance​
​-​ ​Formula:​​Range​​=​​highest​​value​​-​​lowest​
​value​
​4.​​Coefficient of variation​
​-​ ​Considered​ ​as​ ​the​ ​weakest​​measure​​of​
​-​ ​This​ ​kind​ ​of​​measure​​allows​​to​​or​​more​
​spread​
​distributions​ ​measured​ ​in​ ​same​ ​r​
​Note:​ ​heavily​ ​influenced​ ​by​ ​extreme​ ​values​
​different units to be compared​
​(outliers) and only compares 2 values​
​-​ ​How​ ​big​ ​is​ ​the​ ​standard​ ​deviation​
​compared to the mean​
​ .​ V
2 ​ ariance (​​σ​​2)​
​-​ ​Formula:​​CV =​​σ​/(x w/ a line on top) x 100 %​
​-​ ​A​ ​measure​ ​of​ ​how​​far​​a​​set​​of​​data​​are​
​Legend:​
​dispersed from the mean​
​σ = St Dev​
​-​ ​It​​is​​non-negative​​since​​each​​term​​in​​the​
​x w/ a line on top = mean​
​variance is squared​
​-​ ​All​​units​​is​​squared,​​ex:​​a​​set​​of​​weights​
​ ote:​​the​​lower​​the​​CV,​​the​​lesser​​the​​dispersion​
N
​in kg will be given in kg squared​
​of data values​

​ .​ S
3 ​ tandard deviation (​​σ)​
​5.​​Interquartile Range (IQR)​
​-​ ​Measured​​the​​deviation​​of​​data​​from​​its​
​-​ ​Defines​​the​​difference​​between​​the​​third​
​mean​
​and the first quartile​
​-​ ​ ormula:​ ​upper​ ​quartile​ ​-​​lower​​quartile​
F
​= q3 - q1​
​How to calculate the quartiles:​
​Note:​
​Q1 = 25% of data falls below this value​
​Q2 = 50% of the data falls below this value​
​Q3 = 75% of the data falls below this value​
​1.​ ​Arrange the data (smallest to largest)​
​2.​ ​Find q2 (median)​
​3.​ ​Find​ ​Q1​ ​(median​ ​of​ ​the​ ​lower​ ​half​ ​of​
​data)​
​4.​ ​Find​ ​Q3​ ​(median​ ​of​ ​the​ ​upper​ ​half​ ​of​
​data)​

​Correlation (Pearson)​
​-​ ​ easure​ ​describing​ ​the​ ​way​ ​two​
M
​variables vary together​
​-​ ​Statistic​​that​​measures​​the​​strength​​and​
​direction​ ​of​ ​a​ ​linear​ ​relationship​
​between two quantitative variables​
​-​ ​r represents the correlation coefficient​

​Size of Correlation​ ​Interpretation​

​.90 to 1 (-0.90 to -1)​ ​ ery high positive​


V
​(negative) correlation​

​.70 to .90 (-.70 to -.90)​ ​ igh positive (negative)​


H
​correlation​

​.50 to .70 (-.50 to -.70)​ ​ oderate positive​


M ​ odule 4.4 Simple Regression​
M
​(negative) correlation​
​-​ ​A​ ​statistical​ ​tool​ ​that​ ​is​ ​used​ ​in​ ​the​
​.30 to .50 (-.30 to -.50)​ ​ ow positive (negative)​
L ​quantification​ ​of​ ​the​ ​relationship​
​correlation​ ​between​ ​a​ ​single​ ​independent​ ​variable​
​.00 to .30 (.00 to -.30)​ ​Negligible correlation​ ​and​ ​a​ ​single​ ​dependent​ ​variable​ ​on​
​observations​​that​​have​​been​​carried​​out​
​in the past​

​Formula for regression: y = a + bx​


​Legend!​ ​ lternative Hypothesis (H₁ or Ha)​
A
​y = dependent variable​ ​-​ ​Represents what you aim to support or​
​x = independent variable​ ​prove​
​a = intercept (value of y when x = 0)​ ​-​ ​Indicates the presence of an effect,​
​b​ ​=​ ​slope​ ​(change​ ​in​ ​y​ ​for​ ​every​ ​1​ ​unit​ ​difference, or relationship​
​increase in x)​ ​-​ ​Always contains inequality ( ≠, >, < )​
​—----------------------------------------------------​ ​Purpose:​​proposed if H₀ is rejected​
​Formula for slope:​ ​Examples:​
​●​ ​H₁: μ ≠ 50 (the mean is not 50)​
​ = number of variables​
n ​●​ ​H₁: p < 0.7 (the proportion is less than​
​—----------------------------------------------------​ ​70%)​
​Formula for intercept a:​ ​Note: Direction depends on research​
​question​
​ and y (w/ a line): represents the means​
x ​●​ ​One-tailed: H₁ uses < or >​
​*Means​​must​​be​​calculated​​first​​before​​you​​can​
​get the intercept​ ​●​ ​Two-tailed: H₁ uses ≠​
​—----------------------------------------------------​
​Interpretation of Regression​ ​Steps in Hypothesis Testing​
​“For every 1 extra unit of smth, the dependent​ ​ .​ S
1 ​ tate H₀ and H₁​
​variable (y) will increase by (b). If independent​ ​2.​ ​Choose a significance level (α, usually​
​variable (x) stays at 0, the predicted value will be​ ​0.05)​
​(a)”​ ​3.​ ​Collect and analyze sample data​
​4.​ ​Compute test statistic (e.g., z, t)​
​Hypothesis Testing​ ​5.​ ​Compare with critical value or use​
​p-value​
-​ A method of making statistical decisions using​ ​6.​ ​Make a decision:​
​experimental data​ ​○​ ​If p-value ≤ α →​​Reject H₀​
​Used to test assumptions (claims) about a​ ​○​ ​If p-value > α →​​Fail to reject​
​population parameter based on sample data​ ​H₀​

​Null Hypothesis (H₀)​ ​ ype I and Type II Errors​


T
​-​ ​ epresents the default or status quo​
R ​-​ ​Types of incorrect decisions that may​
​assumption​ ​occur in hypothesis testing​
​-​ ​Assumes no effect, no difference, or no​ ​-​ ​Related to the truth or falsity of the null​
​relationship between variables​ ​hypothesis (H₀) and what decision is​
​-​ ​Always contains equality ( =, ≥, ≤ )​ ​made based on the data​
​Purpose:​​to test whether there's​
​enough evidence​​against​​it​ ​ ype I Error (α)​
T
​Examples:​ ​-​ ​Occurs when we reject the null​
​●​ ​H₀: μ = 50 (the population mean is 50)​ ​hypothesis (H₀) even though it is​
​●​ ​H₀: p ≥ 0.7 (the population proportion is​ ​actually true​
​at least 70%)​ -​ ​ ​Also called a false positive​
​Note: If evidence is strong, we reject H₀​ ​-​ ​We detect an effect that isn’t really there​
​Example:​ ​ inomial Distribution​
B
​-​ ​A person is diagnosed with a disease​ ​– Used when there are repeated trials, each with​
​(reject H₀) but is actually healthy (H₀ is​ ​two possible outcomes: success or failure​
​true)​ ​– The probability of success stays the same for​
​Controlled by: Significance level (α)​ ​each trial​
​– Common value: α = 0.05​ ​– Each trial is independent (one doesn’t affect​
​the others)​
​ ype II Error (β)​
T ​– Think: “yes or no” situations, repeated several​
​-​ ​Occurs when we fail to reject the null​ ​times​
​hypothesis (H₀) even though it is​
​actually false​ ​ xample:​
E
​-​ ​Also called a false negative​ ​– Flipping a coin 10 times and counting how​
​-​ ​We fail to detect an effect that is really​ ​many heads​
​there​ ​– Guessing on a multiple-choice quiz with 5​
​Example:​ ​questions​
​-​ ​A person is told they’re healthy (fail to​ ​Conditions to use binomial:​
​reject H₀) but actually has the disease​ ​– Fixed number of trials (n)​
​(H₀ is false)​ ​– Two outcomes: success/failure​
​Related to: Power of the test​ ​– Constant probability (p)​
​– Power = 1 - β​ ​– Trials are independent​
​Formula:​
​ rror Summary Table​
E
​Decision Made​
​– H₀ is True → Type I Error (α)​
​– H₀ is False → Type II Error (β)​
​– Reject H₀ → may lead to Type I Error if H₀ is​
​true​
​– Fail to Reject H₀ → may lead to Type II Error if​
​H₀ is false​

​ egend:​
L
​– H₀ = Null Hypothesis​ ​Poisson Distribution​
​– α = Probability of Type I Error​ –​ Used to count how often an event happens​
​– β = Probability of Type II Error​ ​over time or space​
​– Power = Probability of detecting a true effect​ ​– You don’t know the number of trials, but you​
​(1 − β)​ ​know the average rate​
​Note:​ ​– Events happen randomly and independently​
​– Lowering α reduces Type I errors but​ ​– Best for rare events​
​increases the chance of Type II errors​
​– Increasing sample size helps reduce both​ ​ xample:​
E
​errors​ ​- Number of emails received in an hour​
​– Cars arriving at a toll booth in a minute​
​– Number of errors in a book​
​Conditions to use Poisson:​
​– Events occur randomly and independently​
–​ Happen at a constant average rate (λ)​ ​Z-Score​
​– Based on time, area, volume, or distance​
–​ A Z-score tells you how far a data point is​
​from the mean, in terms of standard deviations​
​– Helps standardize different data values for​
​comparison​
​– A positive Z-score means the value is above​
​the mean​
​– A negative Z-score means the value is below​
​the mean​
​Hypergeometric Distribution​
–​ Used when sampling without replacement​
​from a group​
​– The probability of success changes with each​
​draw​
​– The trials are dependent on each other​
​– Useful when the population is small and​
​known​

​ xample:​
E
​- Drawing 5 cards from a deck without​
​replacement and counting the number of kings​
​– Selecting students from a class and counting​
​how many are seniors​
​– Picking colored balls from a bag and not​
​putting them back​ ​ teps to Calculate Z-score:​
S
​Conditions to use hypergeometric:​ ​ .​
1 ​Find the mean of the data​
​– Population is finite (N)​ ​2.​ ​Find the standard deviation​
​– Known number of successes in the​ ​3.​ ​Subtract the mean from the data point​
​population (K)​ ​4.​ ​Divide the result by the standard​
​– Drawing without replacement (n)​ ​deviation​

You might also like