1.
Data Science
Time: Sep 9, 2024
Every week on Monday, until Dec 23, 2024, 16 occurrence(s)
Join Zoom Meeting
https://zoom.us/j/97619662629?pwd=vgjXmYMLObqb6NbVKeUIASv417YrGB.1
Meeting ID: 976 1966 2629
Passcode: 869227
Revised
Week 1: 2024-09-09: Chapter 1: Introduction about Data Science (Joel_1)
Week 2: 2024-09-16: Chapter 2: Python Foundations - Libraries (Joel_2,3, Wes_1,2)
Week 3: 2024-09-23: ibid. (Joel_3,4, Wes_1,2)
Week 4: 2024-09-30: Chapter 3: Statistics Foundations (Joel_4,5,7, Wes_4)
Week 5: 2024-10-07: ibid.
Week 6: 2024-10-14: Chapter 4: Probability (Joel_6,7)
Week 7: 2024-10-21: ibid.
Week 8: 2024-10-28: Chapter 5: Getting Data (Joel_9)
Week 9: 2024-11-04: Chapter 6: Working with Data (Joel_10)
Week 10: 2024-11-11: Chapter 7: Machine Learning Algorithms (Joel_11,…,19)
Week 11: 2024-11-18: ibid.
Week 12: 2024-11-25: Chapter 8: Network Analysis (Joel_21)
Week 13: 2024-12-02: Chapter 9: Recommender Systems (Joel_22)
Week 14: 2024-12-09: Chapter 10: Databases and SQL (Joel_23)
Week 15: 2024-12-16: Chapter 11: MapReduce (Joel_24)
- Chapter 1: Introduction about Data Science 11. Course Details
- Chapter 2: Python Foundations - Libraries
Pandas, NumPy, Arrays and Matrix handling, Data Visualization, Exploratory Data Analysis (EDA)
- Chapter 3: Statistics Foundations
Basic/Descriptive Statistics, Distributions (Binomial, Poisson, etc.), Bayes, Inferential Statistics
- Chapter 4: Probability
Dependence and Independence, Conditional Probability, Bayes’s Theorem, Random Variables, The Normal Distribution
- Chapter 5: Getting Data
Reading files, Scraping the web, using APIs
- Chapter 6: Working with Data
Exploring Your Data, Cleaning and Munging, Manipulating Data, Rescaling, Dimensionality Reduction
- Chapter 7: Machine Learning Argorithms
k-Nearest Neighbors, Naive Bayes, Linear Regression, Multiple Regression, Logistic Regression, Decision Trees, Neural
Networks, Clustering.
- Chapter 8: Network Analysis
Examples (data as a network versus network to represent dependence among variables), determine important nodes and edges
in a network, clustering in a network
- Chapter 9: Recommender Systems
Manual Cuaration, Recommending What's Popular, User-Based Collaborative Filtering, Item-Based Collaborative Filtering
- Chapter 10: Databases and SQL
Create Table and Insert, Update, Delete, Select, Group By, Order By, Join, Subqueries, Indexes, Query Optimization, NoSQL.
- Chapter 11: MapReduce
Why MapReduce, Examples in Analyzing Status Updates, Examples in Matrix Multiplication
- Chapter 12: Examples in research cases and bussiness
Week 7: 2024-10-21: (Joel_7)
- Chapter 3: Statistics Foundations
Basic/Descriptive Statistics, Distributions (Binomial, Poisson, etc.), Bayes,
Inferential Statistics
[0-1],[0-2] Joel:
7. Hypothesis and Inference
Statistical Hypothesis Testing/Example: Flipping a Coin/p-Values/
Confidence Intervals/p-Hacking/Example: Running an A/B Test/
Bayesian Inference
https://www.scribbr.com/statistics/type-i-and-type-ii-errors/
Figure 8.2 (drawn under the assumption that H0 is true, so that the curve centers at μ0) [4]
p-Values [0-2]
Confidence Intervals [0-2]
p-Hacking
https://www.iro.umontreal.ca/~dift3913/cours/papers/cohen1994_The_
earth_is_round.pdf
Example: Running an A/B Test
A/B testing (also known as bucket testing, split-run testing, or split testing) is
a user experience research method. A/B tests consist of a randomized
experiment that usually involves two variants (A and B), although the concept
can be also extended to multiple variants of the same variable. It includes
application of statistical hypothesis testing or "two-sample hypothesis testing"
as used in the field of statistics. A/B testing is a way to compare multiple
versions of a single variable, for example by testing a subject's response to
variant A against variant B, and determining which of the variants is more
effective.
Example of A/B testing on a website. By
randomly serving visitors two versions of a
website that differ only in the design of a single
button element, the relative efficacy of the two
designs can be measured.
Example: Running an A/B Test
Abstract Misinterpretation and abuse of statistical tests, confidence intervals, and statistical power have been
decried for decades, yet remain rampant. A key problem is that there are no interpretations of these concepts
that are at once simple, intuitive, correct, and foolproof. Instead, correct use and interpretation of these statistics
requires an attention to detail which seems to tax the patience of working scientists. This high cognitive demand
has led to an epidemic of shortcut definitions and interpretations that are simply wrong, sometimes disastrously
so—and yet these misinterpretations dominate much of the scientific literature. In light of this problem, we
provide definitions and a discussion of basic statistics that are more general and critical than typically found in
traditional introductory expositions. Our goal is to provide a resource for instructors, researchers, and consumers
of statistics whose knowledge of statistical theory and technique may be limited but who wish to avoid and spot
misinterpretations. We emphasize how violation of often unstated analysis protocols (such as selecting analyses
for presentation based on the P values they produce) can lead to small P values even if the declared test
hypothesis is correct, and can lead to large P values even if that hypothesis is incorrect. We then provide an
explanatory list of 25 misinterpretations of P values, confidence intervals, and power. We conclude with
guidelines for improving statistical interpretation and reporting.
Common misinterpretations of single P values
1. The P value is the probability that the test hypothesis is true; for example, if a test of the null
hypothesis gave P = 0.01, the null hypothesis has only a 1 % chance of being true; if instead it gave P =
0.40, the null hypothesis has a 40 % chance of being true. No!
2. The P value for the null hypothesis is the probability that chance alone produced the observed
association; for example, if the P value for the null hypothesis is 0.08, there is an 8 % probability that
chance alone produced the association. No!
3. A significant test result (P ≤ 0.05) means that the test hypothesis is false or should be rejected. No!
4. A nonsignificant test result (P > 0.05) means that the test hypothesis is true or should be accepted.
No!
5. A large P value is evidence in favor of the test hypothesis. No!
6. A null-hypothesis P value greater than 0.05 means that no effect was observed, or that absence of an
effect was shown or demonstrated. No!
7. Statistical significance indicates a scientifically or substantively important relation has been detected.
No!
8. Lack of statistical significance indicates that the effect size is small. No!
9. The P value is the chance of our data occurring if the test hypothesis is true; for example, P = 0.05
means that the observed association would occur only 5 % of the time under the test hypothesis. No!
…
10. If you reject the test hypothesis because P ≤ 0.05, the chance you are in error (the chance your
‘‘significant finding’’ is a false positive) is 5 %. No!
11. P = 0.05 and P ≤ 0.05 mean the same thing. No!
12. P values are properly reported as inequalities (e.g., report ‘‘P < 0.02’’ when P = 0.015 or report
‘‘P > 0.05’’ when P = 0.06 or P = 0.70). No!
13. Statistical significance is a property of the phenomenon being studied, and thus statistical tests detect
significance. No!
14. One should always use two-sided P values. No!
Common misinterpretations of P value comparisons and predictions
15. When the same hypothesis is tested in different studies and none or a minority of the tests are
statistically significant (all P > 0.05), the overall evidence supports the hypothesis. No!
16. When the same hypothesis is tested in two different populations and the resulting P values are on
opposite sides of 0.05, the results are conflicting. No!
17. When the same hypothesis is tested in two different populations and the same P values are obtained,
the results are in agreement. No!
18. If one observes a small P value, there is a good chance that the next study will produce a P value at
least as small for the same hypothesis. No!
Common misinterpretations of confidence intervals
19. The specific 95 % confidence interval presented by a study has a 95 % chance of containing the true
effect size. No!
20. An effect size outside the 95 % confidence interval has been refuted (or excluded) by the data. No!
21. If two confidence intervals overlap, the difference between two estimates or studies is not significant.
No!
22. An observed 95 % confidence interval predicts that 95 % of the estimates from future studies will fall
inside the observed interval. No!
23. If one 95 % confidence interval includes the null value and another excludes that value, the interval
excluding the null is the more precise one. No!
24. If you accept the null hypothesis because the null P value exceeds 0.05 and the power of your test is
90 %, the chance you are in error (the chance that your finding is a false negative) is 10 %. No!
25. If the null P value exceeds 0.05 and the power of this test is 90 %at an alternative, the results support
the null over the alternative. This claim seems intuitive to many, but counterexamples are easy to construct
we add the emphatic ‘‘No!’’ to underscore statements that are not only fallacious but
also not ‘‘true enough for practical purposes.’’
Bayesian Inference
Quiz of today, Week 7, 2024/10/21
[0] Comments on this lecture are welcome.
[1] Simulation of Central Limit Theorem
[1-1] Choose and set initial population by combining distribution functions
from SciPy.
[1-2] Make a set of samples with various sizes and compute sample means
and plot their distribution similar to the figure in the next slide.
[1-3] Discuss the result and show that CLT works
[4] Douglas S. Shafer and Zhiyi Zhang, Introductory Statistics
The Central Limit Theorem(revisited)
Definition(characteristic function)
Properties (characteristic function)
The Central Limit Theorem(revisited)