0% found this document useful (0 votes)
26 views40 pages

Dev Unit 4

6y4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
26 views40 pages

Dev Unit 4

6y4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 40
Unit IV BIVARIATE ANALYSIS Relawrstups between Two Variables -Pureentupe Tubley - Anudyiny Contingency Tables « Handing Several Balches - Scatterplols and Resistant Lines — Transformations. Data Exploration and Visualization 1, Explain the Relations! & between Two Variables [Lis very important W understand relationship between yarwbles lo dha the right comet Statistical analysis. The relationship between variables determines how the right cone Without a7 understanding of this, you can fall ialo many pilfalls thal accompany stats infer wrong results from your data A correlation is a relationship herween twa varinkles The amount of e explained ina numerical form called a correlation enefficient « defined as a the sirength and direction of the relationship. Thate nre three posible results Pesttive correlation, a negative correlation, and nn correlation A posilive corretatiom ss a relationship belween two variables, direction, Therefore, when one variable incrcasts as th decreases vihile the other decreases An example of prs Taller people tend to be heavier, A negative correlation is a relationship ‘between, sorted with # dee ‘onreluuom would be height above sea level an temperanire As yeu elimi th n height) it gets colder Uccrease in temperature), yaa xariables For example, there is nev © 2. What is Pereentage (Tables. In this we will AS, enluge lables, their construction and their interpretation, As well as how pefcentagts may be used to summarize the effect of one variable upon another, and at how weléan test Wwhipther the selationship between two variables in a table can be ities heProportions, percentages and probabilities A rere correlation evi relauonship between the idents often get confused in statistics are probability and preportian, "s the difference: ‘Probability represents the chances of some event happening, It is theoretical, Proportion summarizes how frequently some event netually happened. [tis empiricct ‘We often use probability when talking about the chances of some event happening in the future. By contrast, we offen use proportion when deseribing how often some event actually happened in the past. A percentige is i number or ralin that cun he ex pressed as a Mruction af TOU The following examples illustrate the differences between probabilities and proporti different scenanos Example 1: Probability vs. Proportion in Coin Flips If we flip 4 Gair con, the proteability (hit at will land on heads 1s 0.5 or 50% V7 The probability of landing on heads is theoretical, but the proportion of times the coin landed om heads is empirical —we could actually count the proportion However, if we flip a fair coin 20 times then we can actually count es it Janded on heads. For example, perhaps it landed on heads in 60% of th NS-SEC: anew measure of social class The National Statistics Socio-Economic Classification (NS: ich has been developed to ‘The NS-SEC 1s bused on the ‘Goldthorpe mi his heen constructed ti) measure employment relations and the conditions aCyce SC ams to classy Migns Waters of the typical ‘employment relations! attached I b ig diverse “employment relations and conditipas, of PHO situations and work situations, Higher Lower ma Reale GdeupsHhinns (clerical, sales, service) nal ibhece ‘and own account workers ower supetirisory and technical occupations fr § ‘sccupations Roi ipations 8 Never wurked ar long. tA erm unempluyed three.class version is reduced to fallawing: Higher oceupations L 2, Intermediate occupations 3, Lower ‘Three main forms of employment regulation ave distinguished. In ‘service relatiomship the employee renders ‘service! (the employer in rotum for ‘compensation’ in tems of both immediate rewards (2 salary) and Tong-tsim of prospective benefits In a ‘labour contrict’ employees give diserete umounls af’ labour in retur, @ wage caleulated on amount of work done or by time worked. The labow tract is typical for Clays 7 und in weaker forms for Classes $ and 6, Intermediate forms of employment regulation that combine aspect th. forms (1) and (2) are typical in Class 3. 4. What is Con Contingency table does numerically what the graphically. gency fable? Contingency table showy the distribution peach variable other, ‘Vhe cuteyories of one of the variuhles form the rows, and form the columns. ~ Variables form the rows, and the categorie individual case is then tallied in the vanables. Vhe and the number of calunm can have a uvely; these are ebtained from ghe ha gency table with fourrows'at fox iC \ Cottam ma variable form the eolumns. Each thole depending an its value on both more scientiti name cells, fhe, cell frequeney. Each row and the ‘boliom respec uniyarts| distributions cam be Figure shows! a schematic contin Grad eons Figure Anatomy ofa contingeney table ‘We now know that the 663 respondents with higher professional Parents who were in full- education ul age 19 represented 10.8 per eent of the total population aged 19 an 2005, ime Bur the table as.a whole is scarcely more readable than the raw frequencies were, because there is nothing. we can compare this 19 per cent with, For this reason, total percentage tables are not aflen constructed Panel (b} of figure 6.6 shows the percentage of young people within each’ gory ol sia! elas huckgraund who are m each main activily grouping al age 19 The table was constructed by dividing each cell frequency by its ap Now total ty Parencal ‘eccupation ANS-SI Row percentages ‘Main activity at oye 19 Full-time Govt, Fulle of Looking Other education supported ime job ti re ater training Higher a 4 2 1000 Professional Lower 50 6 2 1000 Prolessional Intermediate 2 100.0 Lower 1 400.0 supervivor Rowing + 1000 Othe PB 2 ioe unclassified? Tobles that are along the raves Fechlage the rows are usually read down the columas (reading, y only confirm bya things we already know: the broad profile al” hw percentages show the diferent outcomes far madivicuils with a background. and. Took al where inflow table’, This is shown in panel (c) of figure inflow and outflow tables focus attention on the data in rather different sand the researcher would have to be cleir abort what questions were being addressed in the analysis to inform which way the percentages were calculated. fe) Golumn percentages Parenral Main activity at age 19 ‘occupation (NSSEC) Fulltime Govt, Full. Pare- Queof Looking Other ‘oral education supported time job time work after training iob homof family Higher 264 98 1A OR 83 OO professional Lower 34) Ma 243 225 21611 professional Intermediate 19.8 2200285 183 24 Lower 63 165 152 146 104 supervigory Routine 7 2u5 1720 19.0 25,1 Other! 59 450 66 148 165 1B unclassified? Yoral A well-designed table is easy to read, drafts 19 perfect. analyst It pays lake care Pp preliminary This can help revel patter Here are some eu Labelling Te gga canter Its call may Other pants of 100.0 a table 100.0 100.0 also. need Ap p calculations, however ‘dafQ, and can save time at a later stage POnstruct a lucid table of numerical data the fist thing the reader looks at. A elear tile should summarize the bey short as possible, while at the sime lime making clear when the data yeraphical unit covered, arw the unit of analysis. ‘Ipful co number figures so that you cam refer to them more suceincely in the clear, informative bels, The variables included in the rows and columns must be clea‘ly identified Don't be tempted to use mnemonics in ‘computerese’ hold income HMINC’ so many times you think everyone will know what it means - they wort You may have called Sample data HC dats are based on a sample drawn lium 4 wider population a abwuys needs srecial referencing The render must be given enough information to assess the adequacy of the sample. ‘Missing data Tis impyirtant te try a present the whole als preture, [don't exclude eases from unalysis. out particular categories of a variable ot ignore particular attitudinal items in a set eason and without telling the reader what you are doing and way. Definitions, There can be no hard and) fist rule about how much det include in your tables. They could become unreadable if todr Ifcomplex lerms are expluined elsewhere in the text, melude a precise Opinion Data When presenting apinen dala, alwys give the o respondents, including the response categories | dhflerences in replies to open «uestians such as ‘NI and forced choice questions such as Ic should always b le back into the raw cell frequencies, To aN EA present the minimum number oP base Ny needled ent vy tahle (a be re€onstiucted Brid lines can make the difference between a table Jope which is not In general, white space is preferable, but ha, eile how fur a heuding or subheading extends in a complex @Avords or numbers. Claricy is often increased by reordering either the [Lean be helplul to arrange them in increasing onder af size, or size af wording of the question put id out. There ean be bie Explainthe Fundamentals of Hypothesis Testing Typothesis testing is a technique for interpreting and drawing inferences about a population ed on sample data Taidy in determining which sample data best support mutually exclusive ulation claims Null Hypothesis (H0) = The Null Hypothesis is the assumption that the event will not occur. A ‘null hypothesis has no bearing on the study's outcome unless its rejected, HO 1y the symbot for if, and it as pronauneee Henatighl Altemate Tlypothesis(H1 or Ha) - The Alternate Hypothesis is the logical opposite of the hypothesis. The acceptance of the alternative hypothesis follows the rejection of dh hypothesis. HL ts the symbol Jor il, 8. How to Analyze the Contingency Tables? A contingency table displays frequencies for combinations of Contingency tables classify oulcomes for one variable in rows and the values at the row and column i twa variables, (Mac'PC)? The contingency ble example telow Specifically, it describes sales frequen purchased, It is a nwo-way table (2 Te fow Fetal 6, a 6 i 87 uy Column Totals i 127 223 In this contingeney, table, ns represent computer lypes amd renws represent genders, Cell foufeach combination of gender and computer type. Totals are in the otal in the bortoe-tighe margin 5 to see how two-way tables both organize your data and paint a picture can easily see the frequencies for all possible subset combinations along Marginal and Conditional Distributions In Contingency Tables Contingency tables are a fantastic way of finding marginal and conditional distributions These two distributions are types af Tequency distributions, Marginal Distribution These distributions represent the frequency distribution of one eategorical variabl regard for other variables. Unsurprisingly, you can find these distributions in the cantingency tible The following marginal distribution examples correspond to the blue Row Totals Male 106 Female 117, Colurnn Totals 223 For cxample, the marginal distribution of gendey the following: without Gonsidering computer type is + Males: 106 + Females: 117 Alternatively, the my © PC 96 Ke following, +) Mae: Conditional For these distribtt Ptvily the value for ong af the varithles in the contmgency table and the: utian of frequencies for the other variuhle. Tn other words, you condi wwenefdisiribution for one variuble by selung a value of the other variable. ‘und /bamplicaied, but it’s easy using a contingency table. Just look across one Row Totals 106 117 Column Totals 10 Forexample, the conditional distribution of computer type for females is the following: «© PCO + Mae: 87 Allernatively, the conditional distribution af gender fur Maes 1s the following +) Males: 40 «Females: 87 Finding Relationships in a Contingency Table In the contingency table below, the two categorical variables are ge preference, This is a two-way table (2 X 3) where each cell represent males and females prefer a particular ice cream Mavor “The CS¥ datash you can use to enter the data into your software: Flavor Pret fl ‘number of times shows one farmat Gender _ | Chocolate | Strawberry Female | 37 17 Male 21 18 Total 58 35, nveethperider and flavor preference? a gender rows to di A roi the contingency chocolate (37 vs. 21), while males prefer vanilla (32 researchers surveyed 66 females and 71 males. Because we Wwe ean Compare the rw counts directly Hewever, when you percentages to compure them, tow and column percentages help you draw conclusions when you have uneqpetl numbers in margins. In the contingency table example above, more women than men prefer chocolate, how'do we know that's not due to the sample having more women? Use percentages to just for unequal group sizes. Percentages are relative lrequencies: Here's how to calculate row and column percentages in a two-way table n «Row Percentage: Take a cell value and divide by the cell’s row total. + Column Percentage: Take a cell value and divide by the cell"s column total. For example, the row percentage of females who prefer chocolate is simply the number of observations in the Female/Chovolate cell divided by the row total for women: 37 / 66 = 56%, ‘The column percentage for the same cell is the trequeney of the l'emale/Chocolate divided by the column total for chocolate: 37/ 58 = 63.8%, Interpreting Percentages in a Contingency Table The contingency table below uses the same raw data as the previous row and column percentages Note how the row percentages suum to 10 wthile the column percentages sum tc 100% a the bottom, ys both in the right margin Gender | Chocolate Strawberry Female | Raw: 37 Row%: 56% Male Total rs 137 Col%: 100% ‘Whether yomnfec ercentages or column percentages in a contingency table depends on the @ttestio fziswering. In our case, we want to know whether flavor preference end Beeause the wa genders display in seperate rows, we'll look for liftierciice; Fow percentages. OE pet 8 ales, C1 , 45% off male fa of femafes prefer chocolate versus only 29 6% af males. Conversely, 45% of mules prefer vanilla, while only 18.2% of females prefer it. These results reconfirm our previous Findings using the raw counts, 2 9.How to Graph a Contingency Table? You can use bar charts to display a contingency table ‘Ihe follwing clustered bur chart shows the row percentages for the previous two-way table, I've set the praph to cluster the female and male pairs of bars together for each flavor, making, comparisons easier. I think i ives 1 nice Oomph fo the labular results Flavor Preferences by Gender Court SoAClusigh$ fram the contingency table Women in this sample prefer chocolate, asf buh genders have an equal preferenee for strawberry. ulyve 4 contingency table, Here are two more to another level, intingency tables are a Fantastic way to clisplay and Tind various types of probubiltties. Use ¢ tables to calculate joint, marginal, and conditional probabilities. 1s , we Fanked for a telationship helween gender and ice cream preference by noling the differences between counts and row percentages in the contingency table. If we're using this simple te drive mnferences abou! ihe eniire population of ive ereum cansumers, we'll need 10 use ahypothesis test to evaluate the relationship. 2B In other words, are the differences we noticed in the sample large enough to support the notion that a relationship exists in the populstion? Or can we chalk up the differences 0 random sampling error? Learn how the chiesquuare fest of miependence ean help us ul by analyzing contingeney tables! T1. Chi-Square Test - Analysis of Cantingency Tables What ts a Chi-Square Test? The Chi-Square test is a statistienl procedure for determining the difference and expected data This test can also be used to delermine whether at variables in our data. It helps to find out whether a difference betw: is due to chance or a relationship between them, The original chi-square test, often known as Pearson's chiesquare, da Pearson in the earler 190s ‘The test serves both as “goodne ss- oftic" categorized along one dimension, and as a test for the motty common “ which categorization is aerass two or more dimensions where the d ingency table”, in Formula For Chi-Square Test Where ¢ = Degrees of freedom O= Ghserved Value m #@ calculation statistically vali e(juently used to compare observed daca with data that would Parlicular hypothesis were trug, chi-Squaare test (symbolicitlly represented as 2) 1s Tunchamemtally a data analysis based on ibservations of a random: set of variables. It computes how a model equates to actual observed lata A Chi-Square statistic test is calculated based on the data, which must be raw, random, xn from inxlependent variaies, drawn form a wide-ranging simple and mutually exclu In’simple terms, two sets of statistical data are compared -for instance, the results of tossing a fair coin, Karl Pearson introduced this test in 1900 for categorical data analysis and distribution This test is also known as ‘Pearson's Chi-Squared Test’, 14 Chi-Squared Tests are most commionly used in hypothesis testing, bypothesis is an assumption that any given condition might be true, which can be tested afterwards. The ChieSquare test estimites the sive ef mconsisteney between the expected resulis and the actual results when the size of the sample and the number of variables in the relationship is mentioned ‘Vbese tests use degrees of freedom lo determine if a parlicuar mull hypothesis ean be 1 based on the total number of observations made in the experiments, Larger the sample size, reltable 1s the result There are two main types of Chi-Square tests namely « 1, Independence 2. Goodness-of-Fit Independence ‘The Chi-Squure ‘Test of Independence ix w derivable (alse which examines whether the two sets of variables ase like This test is used when we have counts af values Far ty considered as non-parameine (ext. A relitively observations are the required criteria fir exmduet noun as inferential) statistical test each oiher or not. inl «or eategorical variahles and is e and independence of For Example- Ina mavie theatre, suppase we made a list DE vartable, ‘The second variable is whi movies have bought snacks at the th whether people bough impact stick sales. dple who came to watch those genres of [bypothesis is tharth gene of the fm and Goodness-OF Fit Mg athe Chi-Syuure Goodness-of Kit test dewrmmes: whether ga given distribution or not, We must have a set of data values and the idea of the d Of this data. We can use this test when we have value counts for eategori fesl demonsirales a vay of deciding il'the data values have a * good seve ‘ph abt idea or if iris a representative sample data ofthe entire population fing Chi-Square test gives a P-value to help you know the correlation if any} A hypothesis is in consideration. that a given condition or statement might be true, which we can later, Far example + A very small Chi-Square test statistic indicates that the collected data matches the expected data extremely well 15 + A very large Chi-Square test statistic indicates that the data does not match very well. Ifthe ehi-square value is large, the null hypothesis is rejected Chi-Squiare test st he is called P-value The Pevalue is short for probability value. It defines: the probability of getting a result that is either the same or more extreme than the other act observations. The P-value represents the probability of occurrence of the given event. Th; Value 1s used ay an alternative t the rejection punnl ta pravide the least sigmifieance for, the null hypothesis would be rejected. The smaller the Pa the evider favor Of the alternate hypothesis given observed frequer id expected frequene: Description It indheates the null hypothesis is very Pevulue 5.0.05 wuntkely P-value > 0.05 ‘The hypothesis needs more P-value > 0.05 attention, nae Total 45 bald 53 w Dhanthoea 53 45 2 10 86 | 64 100) 250 16 Solution; Setting up the following table: Observed | Expected | O;-E | (0;-B)? | O- Bri 31 30.96 004 0.0016 | 0.0800516 2 20.64 18.64 7.45 | Yeo 5 15.36 10.36 3 | Boo sxamine the distribution of local usseniploytnent rates within each region, A Gd, the boxplot, will be presented which facilitates comparisons between ‘the idea of an unusual data value will be given more systematic treatment Unemployment: Its Measurement and Types unemployment rile 1s the mast commonly used indivawor for understanding conditions the labour market. The labour market is the term used by economists when talking about the supply of labour (from households} and demand for labour (by businesses and other v ‘ofganisations). The unemployment sate can also provide insights into how the economy is performing more generally, making it an important factor in. thinking about monetary policy. This explainer outlines two key topics related to unemployment. 1. How is the unemployment rate measured? 2, What are the main types of unemplayntent? How is the unemployment rate meayured? Unemployment occurs when someone is willing and able to work but does not h ‘Ube umemptoyment rate 1 the percentage of people in lhe babar large who ate Consequently, measuring the unemployment rate requires Bra it force. The labour foree includes people who are either employed or who 1s employed oF unemployed involves muking, praclieil judgeme paid work someone needs to undertake for them to be con actuslly counting how many people have jabs ar nob Groups in the Labour Ma Three broad categories: + Employed — includes people who are in a paid job for one hour or more in a week. + Unemployed — includes people who are not in a paid job, but who are actively looking for work. 18 + Not im the labour force — includes people not in a paid job, and who are not looking for work. ‘This cam include peuple who are studying, caring For children or family members on a voluntary basis, retired, of who are permanently unable to work. ‘Once (he number of penple in euch of these calegories his buen estimated, the following labour market indicators can be ¢alculated + Labour finree— the sum ef emplayed and unemployed people + Unemployment rate ~ the percentage of people in the labour force’ are unemploved + Participation rate — the percentige of people in the wo al ql ate in the labour force, Calculating the Unemployment Rate — An Exatnph To understand haw the unemployment rate is ealeulat: example 12.6 million people are employed and 0.7 affilfi ssize of the labour force is calculated as the sum af these can use an example, In this cople are unemployed. The With the unemployment’ temploved, using the Broth B- fonce who are the equation bélow, the unemployment rate is affected by changes in che number of unemployed people (the etalor), which can result from cyclical factors, such as the number of peuple: who become unemployed because of an economic downturn, or more structural factors in the economy (see “What are the mam types of unemployment?’ below) ‘The unemployment rate 1s alsa affected by changes in the size of the labour foree (the denominator’. 19 14, What is a Boxptots? The boxplot is a device for conveying the information in the five number summaries (minimum, QI (First Quartile), median, Q3(thitd Quartile), and maximum) economically and elfectwely, Definition The method to summarize a set of data that is measured using an interval scal and whisker plot. These are maximum used for data analysis, We use these type; graphical representation to know: + Distribution Shape ©) Central Value oft Variability of it A box plot iy a chart hit shaves ckita [ram a five-num measures of central tendency. It does not show the dis and leaf plot ot histogram does, But it is primarily use; not and if there are potential unusual observations (al8¢ call Boxplots are also very beneficial when large numbers of datal summary Ineludmg one al the much as a stem cane a distribution is skewed or ) present in the data set In simple words_ we can define the box plo That means box or whiskers plot is a meth through their quartiles graphivally. or whiskers which indicates the va terms box-and-\vhisl mndividual posints scFiptive statisties related concepts. lepicting groups of numerical data Some lines extending trom the boxes Outside Ue lower and upper quartiles, hence the avhisker, Ourliers can be indicated as ‘or spread out with the help of graphs, As wee need more infat Ne powing the measures al’ central tendency, this is where $5 space. It is also a type of pictorial representation of Since, th overall range are immediately apparent, using these hospilots the chstrihytons Gabe oi J east ts of Box ck the image below which shows the minimum, maximum, first quantile, third quartile, jedan and outliers 20 | é Probab HiyCersity Le.) Outliers are greater than Q34( 1.5 . IQR) or less than Q1-(1.5 IQR) axplot Distribution ‘The box plot distribution will explain how tightly the data is grouped, how the data 2 is skewed, and also about the symmetry of data. Positively Skewed If the distance from the median to the maximum is greater than the distance ftom the median to the minimum, then the box plot is positively skewed Negatively Skewed: Ifthe distance from the median to minimum is greater than the distance ftom the median to the maximum, then the box plot is negatively skewed. Symmetric: The box plot is said to be symmetric if the median is equidistant from the tascimum and minimum values Cxample: Find the maximum, minimum, meditin, first quartile, third quartile for AI, 12,10, 15, 14,9. Solution: Given: 23, 42,12, 10, 3, 14, 9 Arrange the given dataset in ascending order, 9, 10,12, 14, 15,23, 42. Nnne P ZApp 2 Maximum = Median=14 Fi JU (Middle yalue of 9, 10, 12 is 1) ird 23 (Middle volue of 15,23, 42 is 23). 45. What are an Outliers in simple terms, an ounlier is an extremely high or extremely low data point relative to the nesrest dala point and the rest of the neighboring ca-exiiing values n/a dal griph or dataset you're working with 22 Outliers are extreme values that stand out greatly frown the overall pattern of values in a dataset or graph. Below, on the far left of the graph, there is an outlier. 16. How to Identify an Outlier in a Dataset An outliet has to satisfy either of the following two conditions: outlier «Q1- 1 S(IQR) outlier > Q3 + LS(OR) The rule for a tow outlier is that dk a PIAL in a catasel. hus to be less than C1 - 13x KR ‘This means that a data point needs to fall more than 1 s the Inter below the first quartile to be considered a low cutlicy ving taset is more than Q3 - |.Sx1QR, han 1.5 times the Interquantile range ual values you need PP. first ina dataset, ‘you nced ta find the so called first and third ile range The rule fora hich outlier is that if any dafa its a high outlier, ‘More specifically. the data point above the third nf ‘AS you can see, there arg 17, fTow 16 Find the Upper and Lower Quartiles in an dd Dataset To get started, let's say that you have this dataset: $,14,6,5,5.30.11,11,13.4.2 23 ‘The first step is to sort the values im ascending numerical order.from smallest to Jargest number. 24,5,5,6, 111 13,14,25,30 The lowest value (MIN) is 7 and the highest (MAX) is 30) 418, How to calculate @2 in an odd dataset The next step isto find the median or quartile 2 (Q2). This particular set of data has on odd number of values, with a total Yo find the median in a dutaset means (hil you're Finding the middle vale the single middle number in the set. Inodd datasets, there in only one middle number, Sinve there are 11 values in ttl, un exsy wi this isllp spiifThe set an two exjua Parts with each side comaining $ values 1e second 9. Howto calculate O1 in an odd dataset Next, to find the Jowor quavatie, QI, we need to find the median of the first half of we dataset, whieh is on the left hare side, As a reminder, the initial dataset is 24 ouilier >Q3 + L.S(IQR) ‘To sce if there isa lowest value outli¢z, you need to calculate the first part and see if there is a number in the set thal sit isfies the conchtion Outlier < Q1- 1 S(IQR) Outher <5 LS(9) Outlier <5 - 13.5 nutlier <~ as There are no fower wutliers, singe there 1sn'La number less than «8 dataset. Newt, to see if there are any higher outliers: ‘Outlier > Q3+ 1.S0QR)= Outlier > 14+ 1.509) Outlier > 144135 Outlier > 27,5 And there is a number in the dataset that is 27.8: 24,5. 5.6.01U113, 14.2530 In this case, 34 is the outlier int 23. How to rindle, calculate G2 in an even dataset ‘Say [hil you have this dataset wilh B juribers: 10,15.20,26.28,30.35,40 ‘This time, the numbers are already sorted from lowest to highest value. To find the median 27 ‘You again want the number in the 3rd place like you did for the first hall (11,13),14,(25,30) So Q3= 14. 21, How to calculate TOR in an otht dataset Now, the next step is to calculate the [OR which stands for Interquartile This is the difference/distance between the lower «juurtile (1) a aqua €Q3) you calculated above, AS a reminder. the formula to do so is the following: [QR = Q3- QL To find the IQR of the dataset from above: rl out if there are any aulliers in the dutaset, Asa reminder, an outlier must [il the ful lowing criteria: Wier 28,30 | 35,40) ‘The two numbers in the middle are 30 and 35. ‘ou add them and divide them by two, and the result is: Q3= (0+ 35y2.Q3 = 654 28 2 Q3=325 27, How to calculate the [OR it an even dataset The formula for calculating IQR is exactly the same as the one we used to calculate it for the odd dataset, TOR =Q3-QI IQR =325- 175 1QR=15 28, How to find at outlier in an even dataset Asa recap, so far the five number summary is the fallow MIN = 10 To calculate any outliers in the da cate Q3 + 1 To find any ldv ets, you calcualte QI - L.S(IQR) and see if there are any values fess than the fey outliers | outligr <, tlier Phere arentt any: values in the dituset thal are less than -5, Finally, to find any higher outliers, you calculate Q3 - | 5(IQR) and see if there are iny values in the abstaset that are higher than the result outlier > 32.5 + 15(15 = outlier > 32.5 + 22.5 29 ‘oulligr > 35 There aren't any values higher than 55 so this dataset doesn't have any outliers. 29. What is a Multiple Boxplot? Mohiple bosplots are quite useful charts when it eames ta visualize several grouns oF calegories, their median und varisbility, all at once Q hg Chart Title Co? Moats Moan Hi oaue ° Ly, 30, What is the T test? e . Atacstisa carsneh J roups. It is offen used in hypothesis esting ta helher process or treatment actually has an effect om the population cethef two groups are different from one another: QP ala 30 JT. Whew to use a erest? A titest can only be used when comparing the means of two groups (aka. pairwise comparison). If you want to compare more than two groups, of if you want 10 do moltiple pairwise comparisons, use an ANOVA test of a post-hoe test Forwur az Yeo ‘ 32. Explain the correlation co-efficient bycouy util with example. T ACOR| 2 API ED ADA: r he we concentrate on the type of correlation coeficient, c relationship between pairs of variables for quantitative A correlation coefficient ist pares of variab! tnd ie designated as r. th data Pearson Corre ent (1): A number berween -1,00 and +1.00 that deseribes the Imear rel; hits G Pquantitative vanables I has the following property. ign ofr; A number with a plus sign (or no sign) indicules @ positive relations, and i number vith a mirws sign indicates a negative relations. Nainerical Value af r: The more closely a value of r approaches either 1.00 or + 1.00, the stronger (more regular} the relationship, Conversely, the more closely the value of r approaches 6, the weaker (less regular) the relationship. 31 Interpretation of rz Located along a scale from —1.00 to +1.00, the value of ¢ supplies information about the direction of a linear selationship—whether positive or negative—and, generally, information about the relative strength ofa lineur relanonship whether relatively weak (and a poor describer of the datn) because fis in the vicinity of 0, or relatively strong (and a good describer of the data) because r deviates from 0 in ihe duzestion of either + 1,00 or —1,00, r Is Independent of Units of Measurement The vilue ar is independent af (he oTiginul units ef meusurement. In fart, the same deseribes the correlation betwuen height und weight for u group of adults, regard height is measured in inches or centimeters or whether weight is measured in pout Verbal Descriptions ‘When interpreting a brand new r, you'll find it helpful to translate the verbal description athe relationship. Ant of 70 fur the height and Welkkf Gr college sutlents could be (anslated into "Taller students tend to weigh more” (ot somMé other equally valid statement, stich as “Lighter students tend to be shorter"), amr of —42 foktime spent taking an exam und the subsequent exam scare conld be translate, Suen who take Tess lime tend: to make higher scores": and an r in the neighborhgéd of b for shoe size and IQ could be translated into “Little, if any, relationship exists between shoe gize af@1Q.” Correlation Not Necessarily Cause-Effect A correlation coeficient, reyardless of sid observed relationship reflects a simp alfirs. ides information shout Whether an inship or some more complex state of Givema correlation’ thal poverty causes erithe Ps ities, you can speculate mi degree of ineviiahihty ccording to this view, any widespread reduetion m pevery should common cause s' some cumbinatk widespread redu and crime According to Lis view, @ yy should have no effect on crime. Which speculation is correct? nol be resatved merely on the basis of an observed correlation, 22 Height (¥) FIGURE 6.5 Effect of cange restrictiom om the value of r COMPUTATION FORMULA FOR r 33 Calculate a value for r by using the following computation formula: CORRELATION COEFFICIENT (COMPUTATION FORMULA) SP. where the two sum of squares terms in the denominator are defined as ss,-5(x-a)ex 2 Pricrors the tong of the re atonstip stronger retasionships larger positive or negative sumy of products. Table 6.2 illustrates for the original greeting card data by using the computation formula. a4 Table 6.3 CALCULATION OF r: COMPUTATION FORMULA ‘Assign a value to a(1), representing the number ol pairs of scores. Sum all scores for X (2) and for ¥9), Find the product of each pair of Xand Yscores (), one at atime, then add all of t products G). Square each X score (6), one at a time, then add all squared Xscores (7), Square each Yscore (¥}, one at a ime, then add all squared ¥ scores @ Substitute rurmbers into formulas (1D) and solve for SP, SS, and SS, Substitute into formula (11) and solve for 1, |. DATA AND COMPUTATIONS CARDS 4 FRIEND «SENT, = RECEIVED. ¥ wy 13 14 35 34, Explain the Scatterpiots and Resistant Line. Seatter plots are the graphs that present the relationship hetween two variables in a datas set. It represents data points on a (wo-dimensianal plane or on a Cartesian system. Thy independent variable or atrribute is platted on the X-axis, while the dependent variabl plotted on the Y-axis. ‘These plots are often called scatter graphs or seater diagrams. Scarcer plots insiantly report n large volume of data It is beneficial in the follows = Fora large set of data points given + Fach set comprises a pair of values © The given data is in numeric form The line drawn ina seatler plol, which is neur to almost all {he points i plot is known as“‘line of best fit” or “trend line*. XandY Values P Scatter plot Corretatioy Z APP tistical measure of the relationship herween the nwo. ements: If the variables are correlated, the points will fall glong a the correlation, the closer the points will touch the lune. “Ihis cause difed as one of the seven essential quality toals, ce [he scatter plot explains the correlation between two attributes or variables. It represents how closely the two vanables are connected, ‘Vhere can be three such situations to see the relation between the two variables — 1, Positive Correlation 2, Negative Comrclatian 3. No Correlation Positive Correlation ‘When the points im the graph are rising, may ing Irom left to right, then the seatter pla shows a positive correlation. It means the values of one variable are increasing with resp; to another. Now positive correlation can further ke classified ime three categories; + Perfect Positive — Which represents a perfectly straight line + High Positive — All points are nearby: Iw Positive — When all the points are scaltered Perfect positive High positive correlation correlation Negative Correlation negative conelat another ‘These are alo Highnegative Low negalive correlation correlation (o Correlation ‘When the points are scattered ill over the graph and itis dificult to conctuxte whether the values are inereasing or decteasing. then there is no eortelation between the a7 ‘variables, Seatter plot Example Let us understand how to construct a seatter plot with the help of the below example, Example: Draw at scalter plot for the given aa that bEr OF eames played and scores obtained in each instance. No ofgames3 5.2671 mesg Scares Z App Solution: Rewers of hen games Y-axis : Soores ogy the, aph will be: a Hoe ot s a 7. Number of gamen 3a Resistant Line A Resistange line, sometimes also known as u Speed Line, helps identify stock trends and levels of support and resistances. Resistance lines are technical indication tools used by equity analysis and investor: determine the price trend af a specific stock, They are very useful in predi Probable movement of stock prices and belping people invest in the right stock. Resistance lines are usually drawn an a high-lo-low basis, ‘They help estins and support levels, making them a very useful tool in wading, 4 resistance Line yn an uptrend movement marks the suppemt area adowwntrend movement marks the resistance area “The three lines in the graph helow indicale a downirend mavement a of them will help lead to a sound investment decision, st Cf in a stock chart to make predictions, 5. What if Transformation? aca transformation refers to application of a function to each item in a data set. Heres! is haced by ils transfarmned value 37 where y=) franslormations are carnied oul generally to make appearance of graphs more: interpretable, There are four major functions used for transformations, 39 fogx- logarithm transformations. Log transformation is a data transformation method in which it replaces cach variable x with a log(x)For example sound units are in decibels and is generally represented using lag trans formitlions, {e+ Reciprocal Transformations, A transformation of raw data that involves (a) replacing jonigmal dala units with (heir reciprecats amd (b) analyzing the modilied dala Wt ean be with nonzero data and is commoiily used when distributions have skewness or clear out Unlike other transformations, a reciprocal tansformation changes the order of the 0f Also called inverse transformation, For example ‘Time ta complete race’ task, using speed, More the speed lesser the time taken “x Square root Transformations, This consists of taking the square\goo! The back transformation is to square the number, If you have negativ the syne rant, you should add constant tw each number to m: example arcas of circular ground are compared using their radius. 2 Power ‘ranstinrmations Power (ransfom js « family using power laws. The idea is to apply a transformation What's the purpose of 1 pawer trsinsform? distribution of the features, If a features is asy make it more symmetric fe the symmeiry af the power transformation will logarithm and Square reat Transform Reciprocal and Pawer Transformat pesilive numbers: Following diagrams ake graphically. se of positive numbers where a3 case of both negative as well as Before transfo eden ean passe 40

You might also like