Quiz2 Source

data mining quiz

Uploaded by

Martina Mendez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

19 views8 pages

Quiz2 Source

data mining quiz

Uploaded by

Martina Mendez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 8

Chapter 2 Getting to Know Your Data 2.1 Exercises 1. Give throc additional commonly used statistical measures (i.c., not illustrated in this chapter) for the characterization of data dispersion, and diseuss how they ean be computed efficiently in large databases. Answer: Data dispersion, also known as variance analysis, is the degree to which numeric data tend to spread and can be characterized by such statistical measures as mean deviation, measures of skeumess and tho coefficient of variation, ‘The mean deviation is defined as the arithmetic mean of the absolute deviations from the means and is ealeulated as: wy [2a mean deviation — (2.1) where, is the arithmetic mean of the values and 1 is the total number of values. ‘This value will be greater for distributions with a larger spread, A common measure of skewness is: motte (2.2) Which indicates how far (in standard deviations, s) the mean seater or less than the mode. ) is from the mode and whether it is The coefficient of variation is the standard deviation expressed as a percentage of the arithmetic mean and is ealeulated coefficient of variation = = x 100 (23) 1¢ variability in groups of observations with widely difforing means can be compared using this Note that all of the input values used to calculate these three statistical measures are algebraic measures (Chapter 4). ‘Thus, the value for the entire database can be efficiently calculated by partitioning the database, computing the values for each of the separate partitions, and then merging theses values into an algebraic equation that can be used to calculate the value for the entire database, uw2 CHAPTER 2. GETTING TO KNOW YOUR DATA ‘The measures of dispersion described here were obtained from: Statistical Methods in Research and Production, fourth ed., Edited by Owen L. Davies and Peter L. Goldsmith, Hafner Publishing Company, NY:NY, 1972. 2. Suppose that the data for analysis includes the attribute age. ‘The age values for the data tuples are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 38, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70. {a) What is the mean of the data? What is the median’? (b) What is the mode of the data? Com (c) What is the midrange of the data? (d) Can you find (roughly) the first quartile (Ql) and the third quartile (Q2) of the data? (0) Give the five-number summary of the data, (8) Show a borplot of the data, (g) How is a quantile-quantile plot different from a quantile plot? rent on the data’s modality (ie., bimodal, trimodal, ete.). Answer: (a) What is the mean of the data? What is the median? ‘The (arithmetic) mean of the data is: # = 230° , x: = 809/27 = 30. The median (middle value of the ordered set, as the number of values fn the set is odd) of the data is: 5. (b) What is the mode of the data? Comment on the data’s modality (ic., bimodal, trimodal, ete.) ‘This data set has two values that occur with the same highest frequency and is, therefore, bimodal. ‘Phe modes (values occurring with the greatest frequency) of the data are 25 and 35. (c) What is the midrange of the data? ‘The midrange (average of the largest and smallest values in the data set) of the data is: (70-4 13)/2=415 (a) Can you find (roughly) the first quartile (QU) and the third quartile (Q3) of the data? ‘The first quartile (corresponding to the 25th percentile) of the data is: 20. ‘The third quartile (corresponding to the 75th percentile) of the data is: 35, (0) Give the five-number summary of the data. ‘The five number summary of a distribution consists of the 1m value, first quartile, median value, third quartile, and maximum value. Tt provides a good summary of the shape of the distribution and for this data is: 13, 20, 25,35, 70. (f) Show a bozplot of the data. Sce Figure 2.1. (g) How is a quantile-quantile plot different from a quantile plot? ‘A quantile plot is a graphical method used to show the approximate percentage of values below for equal to the independent variable in a univariate distribution. Thns, it displays quantile information for all the data, where the values measnred for the independent variable are plotted against their corresponding quantile. A quantile-quantile plot however, graphs the quantiles of one univariate distribution against the corresponding quantiles of another univariate distribution, Both axes display the range of values measured for their corresponding distribution, and points are plotted that correspond to the quantile values of the two distributions. A line (y =r) ean be added to the graph along, with points representing where the first, second and third quantiles lie in order to increase the graph’s 1 value, Points that lie above such a line indicate a correspondingly higher value for the distribution plotted on the y-axis, than for the distribution plotted on the x-axis at the same quantile. ‘The opposite effect is true for points lying below this line. information21, EXERCISES B | Figure 2.1: A boxplot of the data in Exercise 2.2. 3, Suppose that the values for a given set of data are grouped into intervals. ‘The intervals and corro- sponding frequencies are as follows. age. Srequency TS 200 615 450 16 20 300 21-50 1500 5180 700 si 110 ery Compute an approximate median valne for the data. Answ Ly = 20, n= 3194, (Cyr = 950, fregmedian = 1500, width = 30, median = 30.94 years. = 4, Suppose a hospital tested the age and body fat data for 18 randomly selected adults with the following result age | BBL ALAN, a), e] oe) ijat_| 0.8 [968] TS | ITS] MA] WO] WA| WI| I ‘age | 2 | St | of | sO | oT | oS | oS | OO | OF Wjat | BG | 425 | WS) aA | G2) 31/929] 2 | wT (a) Calculate the mean, median and standard deviation of age and %fat. (b) Draw the boxplots for age and %fat (c) Draw a scatter plot and a ¢-q plot based on these two variables, Answer: {a) Calculate the mean, median and standard deviation of age and fat For the variable age the mean is 46.44, the median is 51, and the standard deviation is 12.85, the variable %fat the mean is 28.78, the median is 30.7, and the standard deviation is 8.99 Porra (b) Draw the boxplots for age and %fat Sco Figure 2.2. CHAPTER 2, GETTING TO KNOW YOUR DATA Figure 2.2: A boxplot of the variables age and Sat in Exercise 2.4 (c) Draw a scatter plot and a ¢-q plot based on these two variables. See Figure 2.3. 5. Briefly outline how to compute (a) Nominal attributes (b) Asymmetric binary attributes (c) Numeric attributes (a) Term-froquency vectors imilarity between objects described by the following:21, EXERCISES 6 Answert (a) Nominal attributes A categorical variable is a generalization of the binary variabl two states. that it can take on more than ‘The dissimilarity between two objects 7 and j can be computed based on the ratio of mismatehe dij) = (24) ? where m is the number of matches (ie., the number of variables for which i and j are in the same state), and p is the total number of variables. Alternatively, we can use a large number of binary variables by creating a new binary variable for cach of the M nominal states. For an object with a given state value, the binary variable representing that state is set to 1, while the remaining binary variables are set to 0. (b) Asymmetric binary attributes If all binary variables have the same weight, we have the contingency Table 2.1 object j T [0 [am a object i [0 [| 6 | #te sm | gte[rte) p ‘Table 2.1; A contingency table for binary variables. In computing the dissimilar matches, t, is considerod 1 ty between asym nportant and thy svie binary variables, the umber of negative is ignored in the computation, that is, rhs aaa) = ES 5 = Tops es) (c) Numeric attributes Use Buctidean distance, Manhattan is dened ance, or supremum distance. Euclidean distance GG, 9) = yi (aa — 251)? + (G2 — aya)? +0 + (Fin — 750)? 2.6) where f= (rete s.2y 004 Fon) AN J (ep y 25-00 Tyo EO WO TH ‘The Manhattan (or city block) distance, is defined as rensional data objects. (i,j) = fra — tja| + [ra — aya] +++ + [Pim — yn (27) ‘The supremum distance is 4G, 3) = im, ( sai) thx ay — 2341. (28) a (4) Torm-froqueney vectors ‘To measure the distance between complex objects represented by vectors, it is often easier to bandon traditional metric distance computation and introduce a nonmetric similarity function.16 CHAPTER 2. GETTING TO KNOW YOUR DATA For example, the similarity between two vectors, 2 and y, can be defined as a cosine measure, ns follows: sey) oy ‘ Tele ee) where 2¢ is a transposition of vector 2, |[z| is the Euclidean norm of vector, |lyl] is the Euclidean norm of vector y, and s is essentially the cosine of the angle between vectors 2 and y. 6. Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8) (a) Compute the Euclidean distance between the two objects. (b) Compute the Manhattan distance between the two objects (c) Compute the Minkowski distance hetwoen the two objects, using h— 3. (a) Compute the supremum distance betwen the two objects. Answert {a) Compute the Euclidean distance between the two objects. ‘The Euclidean distance is computed using Equation (2.6). ‘Therefore, we have Y@2— 2? + (IO + G2— B+ (OS = Va = (b) Compute the Manhattan distance between the two objects. ‘The Manhattan distance is computed using Equation (2.7). ‘Therefore, we have [22 — 20] + [1 — | + [42 — 36] + [10 ~ 8] = 11. (c) Compute the Minkonski distance botwoen the two objects, ‘The Minkowski disance is 46,9) = Vira — 2p + ee aya) ++ beep — aol (2.10) where fis a real mumber such that fi > ‘Therefore, with h=3, we have /P2— 20+ [1 — 0 + [2 — ao + [08 (4) Compute the supremum distance between the two objects. ‘The supremum distance is computed using Equation (2.8). Therefore, we have a supremum distance of 6. 7082, ing k= 3. 9/233 = 6.1534. 7. ‘The median is one of the most important holistic measures in data analysis, Propose several methods for median approximation, Aualyze their respective complexity under different parameter sottings aud decide to what extent the teal value can be approximated. Moreover, suggest a heuristic strategy to balance between accuracy and complexity and then apply it to all methods you have given. Answer: ‘This question can be dealt with either theoretically or empirically, but doing some experiments to get the result is perhaps more interesting. Wo can give students some data sets sampled from different distributions, o.g., uniform, Gaussian (both, two are symmetric) and exponential, gamma (both two are skewed). For example, if we use Equation The Euclidean normal of vector @ = (21, 22,.--.2a) ls defined as yx? p23 +--+ aR. Conceptually, It the length of the vector.21, EXERCISES Ww (2.11) to do approximation as proposed in the chapter, the most straightforward way is to divide all data into & equal length intervals. median bye AP =e width, (21) Fre dreaian where Ly is the lower boundary of the median interval, N is the mumnber of values in the entire data set, (3 freq): is the sum of the froquencies of all ofthe intervals that are lower than the median intorval, Sredme on is the frequeney of the median interval, and width is the width of the median interval. Obviously, the error incurred will be decreased as k hecomes larger and larger; however, the time used Let’s analyze this kind of relatiouship more formally. It in the whole procedure will also increase. seems the product of error made and do many tests for each type of distil ime used is a good optimality measure, From this point, we can ations (so that the result won't be dominated by randomness) and find the & giving the best trade-off. In practice, this parameter value can be chosen to improve system performanes, ‘There are also some other approaches to approximate the median, students can propose them, analyze the best trade-off point, and compare the results among different. approaches, A possible way is as following: Hierarchically divide the whole data set into intervals: at first, divide it into k rogions, find the region in which the median resides; then secondly, divide this particular region into k sub- regions, find the sub-region in which the medi sides; .... We iteratively doing this, until the width of the sub-region reaches a predefined threshold, and then the median approximation formula as above stated is applied. By doing this, we can confine the median to a smaller area without globally pat of intervals). 8. Ibis important to define or select si oning all data into shorter intervals, which is expe Jive (the cost is proportional to the nutnber larity measures in data analysis, However, there is no commonly- accepted subjective similarity measure. Results can vary depending on the similarity measures used. Nonetheless, seemingly different ‘Suppose we have the follov larity measures may be equivalent after some transformation, ing two-dimensional data set: ALA ™ [13 [17 2 19 rg | LO [1S 4 | 12 [15 315 | 10 (a) Consider the data as two-d jal data points, Given a new data point, ¢ = (1.4, 1.6) as a query, rank the database points based on similarity with the query using Euclidean distance, Manhattan distance, supremum distance, and cosine similarity. (b) Normalize the data set to make the norm of each data point equal to 1. Use Euclidean distance: on the transformed data to rank the data points. on (2.8) to compute th so yields the following table juation (2.6) to compute the Euclidean distance, juation (2.7) to compute te Manhattan apremum distanee, and Equation (2.9) to compute the imilarity between the input data point and each of the data points in the data set. Doing,as CHAPTER 2. GETTING TO KNOW YOUR DATA (- [Euclidean dist. | Manhattan dist. | supremum dist. | cosine sim, [ec omaig 02 OL 0.99009 (en _0.6708 09 06 0.90575 [es 02828 0 02 0.99007 (ea 02236 os 02 o.99907 (as) 0.608% 7 06 0.90536 ‘These values produce the following rankings of the data points based on similarity: Buclidean distance: 21,274,23,25,22 Manhattan distance: 21,.24,209,25,:09 Supremum distance: 273,.74,.73,25,72 Cosine similarity: 2y,2°4,274,272,25 (b) ‘The normalized query is (0.65850, 0.75258). ‘The normalized data set is given by the following, table Ay ry Fi | O0G1O2 | 074984 2 | 0.72500 | 0.68875 ey | 0.66436 | 0.74741 4 | 0.62470 | 0.78087 5 | 083205 | 0.55470 Recomputing the Euclidean distances as before yields = za) 00217 zs) 0.00781 z1) _0.0H09 (as [026820 ‘which results in the final ranking of the transformed data points: 21,73,.74,22,2 2.2. Supplementary Exercises 1. Briefly outline how to compute the dissimilarity between objects described by ratio-sealed variables, Answe ‘Three methods include: Treat ratio-sealed variables as interval-scaled variables, so that the Minkowski, Manhattan, or Euclidean distance ean be used to compute the dissimilarity. ‘© Apply a logarithmic transformation to a ratio-scaled variable f having value xiy for object i by using the formula yy —log(ris). The yey values can be treated as interval-valued, Treat 2p as continous ordinal data, and trent their ranks as interval-sealed variables,

Solutions To II Unit Exercises From Kamber
83% (42)
Solutions To II Unit Exercises From Kamber
16 pages
Getting To Know Your Data: 2.1 Exercises
100% (1)
Getting To Know Your Data: 2.1 Exercises
8 pages
No 2
No ratings yet
No 2
2 pages
Assignment 1
No ratings yet
Assignment 1
9 pages
4
No ratings yet
4
26 pages
Assignment DMBI 2
No ratings yet
Assignment DMBI 2
2 pages
Introduction To The Practice of Basic Statistics (Textbook Outline)
100% (14)
Introduction To The Practice of Basic Statistics (Textbook Outline)
65 pages
Tutorial: ND RD
No ratings yet
Tutorial: ND RD
34 pages
Knowing The Data Set
No ratings yet
Knowing The Data Set
31 pages
2-2-Data
No ratings yet
2-2-Data
27 pages
Dsbda Unit 2
No ratings yet
Dsbda Unit 2
155 pages
02 Data
No ratings yet
02 Data
35 pages
Mathematics As A Tool (Descriptive Statistics) (Midterm Period) Overview: This Module Tackles Mathematics As Applied To Different Areas Such As Data
No ratings yet
Mathematics As A Tool (Descriptive Statistics) (Midterm Period) Overview: This Module Tackles Mathematics As Applied To Different Areas Such As Data
33 pages
5_Data Summaries and Visualization (4)
No ratings yet
5_Data Summaries and Visualization (4)
87 pages
DWDM UNIT-2
No ratings yet
DWDM UNIT-2
19 pages
CS361 FA23 Lec2 Post
No ratings yet
CS361 FA23 Lec2 Post
67 pages
5_Data Summaries and Visualization
No ratings yet
5_Data Summaries and Visualization
97 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
19 pages
CPSC 4830 2025Summer Lecture 2
No ratings yet
CPSC 4830 2025Summer Lecture 2
42 pages
EXP-1- Statistics and Plotting
No ratings yet
EXP-1- Statistics and Plotting
23 pages
DM&DW Individual Assignment (50%)
No ratings yet
DM&DW Individual Assignment (50%)
4 pages
Further Bound Reference
No ratings yet
Further Bound Reference
42 pages
SINGLE VARIABLE Notes 5.3 Year 10
No ratings yet
SINGLE VARIABLE Notes 5.3 Year 10
9 pages
Ch 2 (2)
No ratings yet
Ch 2 (2)
35 pages
Slide-04-Chapter2-Getting to Know Your Data
No ratings yet
Slide-04-Chapter2-Getting to Know Your Data
47 pages
9-2 Data analysis and pre-processing part 2.pdf
No ratings yet
9-2 Data analysis and pre-processing part 2.pdf
27 pages
Data Preprocessing Data Basics
No ratings yet
Data Preprocessing Data Basics
86 pages
Predictive Numericals 20 Questions
No ratings yet
Predictive Numericals 20 Questions
4 pages
DSILYTC Session 5 - Descriptive Statistics
No ratings yet
DSILYTC Session 5 - Descriptive Statistics
99 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
29 pages
Homework Index: To See If The Questions Have Been Changed, or If You Are Required To Use Different Data or Examples
No ratings yet
Homework Index: To See If The Questions Have Been Changed, or If You Are Required To Use Different Data or Examples
86 pages
QBM 101 Business Statistics: Department of Business Studies Faculty of Business, Economics & Accounting HE LP University
No ratings yet
QBM 101 Business Statistics: Department of Business Studies Faculty of Business, Economics & Accounting HE LP University
62 pages
StatiF 1 Slides
No ratings yet
StatiF 1 Slides
27 pages
Central Tendency, Position, and Variation
No ratings yet
Central Tendency, Position, and Variation
37 pages
Data Mining Assignment 2
No ratings yet
Data Mining Assignment 2
2 pages
Tema1 Ejercicios
No ratings yet
Tema1 Ejercicios
21 pages
Chapter - 2 Data Mining
No ratings yet
Chapter - 2 Data Mining
21 pages
MA121-1 3 4-hw
No ratings yet
MA121-1 3 4-hw
19 pages
Unit 1 Assignment SKELETON R spr18
No ratings yet
Unit 1 Assignment SKELETON R spr18
23 pages
5-MEASURES of DISPERSION-02-Aug-2019Material I 02-Aug-2019 Exp. No. 1 - Measures of Central Tendency Dispersion Skewness and Kurtosi
No ratings yet
5-MEASURES of DISPERSION-02-Aug-2019Material I 02-Aug-2019 Exp. No. 1 - Measures of Central Tendency Dispersion Skewness and Kurtosi
10 pages
MATM111-Midterms-REVIEWER
No ratings yet
MATM111-Midterms-REVIEWER
3 pages
IT326 - Ch2
No ratings yet
IT326 - Ch2
44 pages
Introduction To Biostatistics
No ratings yet
Introduction To Biostatistics
53 pages
Statistics I Chapter 2: Univariate Data Analysis
No ratings yet
Statistics I Chapter 2: Univariate Data Analysis
27 pages
Statistics Notes
No ratings yet
Statistics Notes
20 pages
Types of Statistics
No ratings yet
Types of Statistics
7 pages
Lesson 2.1 - Know Your Data PDF
No ratings yet
Lesson 2.1 - Know Your Data PDF
43 pages
4 - IB Math Applications & Interpretations SL Notes - Unit 4 Statistics
No ratings yet
4 - IB Math Applications & Interpretations SL Notes - Unit 4 Statistics
17 pages
Mod 4 Types of Data in Cluster Analysis
No ratings yet
Mod 4 Types of Data in Cluster Analysis
31 pages
Chapter 2 Descriptive Statistics
No ratings yet
Chapter 2 Descriptive Statistics
12 pages
NguyenDucThang
No ratings yet
NguyenDucThang
5 pages
Midterm Practice Questions
No ratings yet
Midterm Practice Questions
13 pages
2 Knowing Data & Visualization
No ratings yet
2 Knowing Data & Visualization
51 pages
Week 1B - Data
No ratings yet
Week 1B - Data
38 pages
Module No 2 - Part 2 - Compressed - Compressed
No ratings yet
Module No 2 - Part 2 - Compressed - Compressed
46 pages
PGDRS 112-Exam Questions
No ratings yet
PGDRS 112-Exam Questions
3 pages
DWDM_UNIT-2
No ratings yet
DWDM_UNIT-2
58 pages
DWDM Unit-2
No ratings yet
DWDM Unit-2
20 pages

Quiz2 Source

Uploaded by

Quiz2 Source

Uploaded by

You might also like