Unit 4 covers the processing and analysis of data, detailing the steps involved in data processing, including collection, preparation, input, output, and storage. It discusses various statistical analysis techniques such as univariate, bivariate, and multivariate analysis, as well as measures of central tendency and dispersion. Additionally, it introduces different statistical tools used in research, including regression analysis and both parametric and non-parametric tests.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
7 views21 pages
Unit 4
Unit 4 covers the processing and analysis of data, detailing the steps involved in data processing, including collection, preparation, input, output, and storage. It discusses various statistical analysis techniques such as univariate, bivariate, and multivariate analysis, as well as measures of central tendency and dispersion. Additionally, it introduces different statistical tools used in research, including regression analysis and both parametric and non-parametric tests.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21
Unit 4
Processing and Analysis of data
Contents • Meaning, importance and steps involved in processing of data • Statistical tools and techniques for analysis of data Analysis • Basics data analysis-Frequency distribution • Analysis and interpretation of data –Interpretation of results • Diagrammatic and Graphic representation • Concept of Univariate • Bivariate and multivariate analysis Meaning, importance and steps involved in processing of data Data processing Data processing in research is the collection and translation of a data set into valuable, usable information. Through this process, a researcher, data engineer or data scientist takes raw data and converts it into a more readable format, such as a graph, report or chart, either manually or through an automated tool Collection: The first and most important step of data processing. It is important to use verified and trustworthy sources for gathering data. Preparation: Once the data is collected, it then enters the data preparation stage. Data preparation, often referred to as “pre-processing” is the stage at which raw data is cleaned up and organized for the following stage of data processing. During preparation, raw data is diligently checked for any errors or missing entries . The purpose of this step is to eliminate bad data and begin to create high-quality data for the best business intelligence. Coding od date is also done in this stage. Data input: In this step, the raw data is converted into machine readable form and fed into the processing unit. This can be in the form of data entry through a keyboard, scanner or any other input source. Information output: The output/interpretation stage is the stage at which data is finally usable to non-data scientists. Decoding of data is done at this stage. It is translated, readable, and often in the form of graphs, plain text, etc.) Members of the company or institution can now begin to self-serve the data for their own data analytics projects. Data storage:. After all of the data is processed, it is then stored for future use. While some information may be put to use immediately, much of it will serve a purpose later on. When data is properly stored, it can be quickly and easily accessed by members of the organization when needed. Analysis of different types of data • Univarate Analysis of data :Univariate analysis is a statistical technique that involves the analysis of one variable at a time. It is used to describe the data and to find patterns that exist within the data set. It is also used to identify outliers and to test for normality. Univariate analysis can be used to identify relationships between a single variable and another variable, such as a dependent variable. It can also be used to determine the impact of a single variable on a dependent variable • Descriptive analysis of univariate data is a method of summarizing and describing the data using a variety of techniques such as frequency distributions, measures of central tendency (mean, median, and mode), measures of dispersion (range, variance, and standard deviation), and graphical techniques (histograms, box plots, and scatterplots). This type of analysis is used to identify patterns, trends, and relationships in the data. It is also used to describe the characteristics of the data, such as the shape of the distribution, the spread of the data, and outliers. Descriptive analysis of univariate data is an important step in the data analysis process as it provides a basis for further analysis. For example, if we were analyzing the heights of students in a class, we would use univariate analysis to look at the distribution of heights. We could look at the mean, median, mode, and range of the data. We could also look at the frequency of different heights, and any outliers that exist. This would give us an overall picture of the data and • Bivariate Analysis of data : Bivariate analysis is a statistical method used to analyze the relationship between two variables. It is used to determine if there is a correlation between the two variables and to identify the strength of that correlation. It can also be used to identify any outliers or trends in the data. It can be used to identify relationships between two variables, such as age and income, or to compare two groups, such as men and women. Bivariate analysis can also be used to identify relationships between three or more variables, such as age, income, and education. Example : you have a dataset that includes information about people's income and education level. You can use bivariate analysis to explore the relationship between these two variables. • For example, you might use a scatter plot to visualize the data. The scatter plot would show the relationship between income and education level. You might find that people with higher levels of education tend to have higher incomes. This would suggest that there is a positive correlation between education level and income. • You could also use bivariate analysis to look at the relationship between income and other variables, such as age, gender, or location. This could help you understand how these other factors may influence income. • Bivariate analysis can be used to explore relationships between any two variables. It is a powerful tool for understanding data and can help you uncover insights that may not be obvious from looking at the data alone. • Multivariate Analysis of data : It involves the examination of more than two variables at a time. It is used to analyze the relationships between multiple variables and to explore the underlying structure of the data. It can be used to identify patterns and trends in the data, to make predictions, and to identify relationships between variables. It can be used to identify the most important factors that influence a particular outcome, or to identify clusters of similar individuals or objects. Multivariate analysis can also be used to test hypotheses and to determine the strength of the relationships between variables. For example, a researcher may want to examine the relationship between age, gender, and income. To do this, they would collect data on these three variables from a sample of individuals. They could then use multivariate analysis to look for patterns in the data. They might find that older individuals tend to have higher incomes, or that men tend to earn more than women. These patterns can then be used to make predictions about the population as a whole. Measurement is the quantification of attributes of an object or event, which can be used to compare with other objects or events. In other words, measurement is a process of determining how large or small a physical quantity is as compared to a basic reference quantity of the same kind. Scale : A scale is a device or an object used to measure or quantify any event or another object Levels of Measurements: There are four different scales of measurement. The data can be defined as being one of the four scales. The four types of scales are: 1-Nominal Scale 2-Ordinal Scale 3-Interval Scale 4-Ratio Scale Type of Type of Descriptive analysis measurement
Nominal Frequency table, Proportion percentages, Mode
Ordinal Median , Quartiles, Percentiles, Rank order correlation
Interval Arithmetic Mean ,Correlation coefficient
Ratio Index numbers, Geometric mean ,Harmonic mean
STATISTICAL TOOLS USED IN RESEARCH 1-Measures of central tendency 2-Measures of dispersion 3-Measures of Relationship In case of bivariate population : a-Cross tabulation, b-Charles spearman’s coefficient of correlation ,c-Karl person's coefficient of correlation In case of multivariate population: a-Cofficient of multiple correlation, b-Cofficient of partial correlation 4-REGRESSION ANALYSIS 5-Parametric tests 6-Non-parametric tests Measures of central tendency: it tell us the point about which items have a tendency to cluster .The mean, median and mode are the three commonly used measures of central tendency. • Mean is one of the measures of central tendency, apart from the mode and median. Mean is nothing but the average of the given set of values. It denotes the equal distribution of values for a given data set. To calculate the mean, we need to add the total values given in a datasheet and divide the sum by the total number of values. Example: What is the mean of 2, 4, 6, 8 and 10? Solution: First, add all the numbers. 2 + 4 + 6 + 8 + 10 = 30 Now divide by 5 (total number of observations). Mean = 30/5 = 6 • Median is the middle value of a given data when all the values are arranged in ascending order. Median is the central value of the data set when they are arranged in an order. For example, the median of 3, 7, 1, 4, 8, 10, 2. Arrange the data set in ascending order: 1,2,3,4,7,8,10 Median = middle value = 4 • Mode is the number in the list, which is repeated a maximum number of times. Measures of dispersion: Two data sets can have the same mean but they can be entirely different. Thus to describe data, one needs to know the extent of variability. This is given by the measures of dispersion. Range, interquartile range, and standard deviation are the three commonly used measures of dispersion. Range : (Highest value of an item in a series ) –( lowest value) Mean deviation : Mean deviation is a statistical measure that calculates the average deviation from the mean value of a given data set. Standard deviation : In statistics, standard deviation is a measure of how much a random variable varies from its mean. It is calculated as the square root of the variance. A low standard deviation indicates that the values are close to the mean, while a high standard deviation indicates that the values are spread out over a wider range. Variance: Variance measures how far each number in the set is from the mean (average), and thus from every other number in the set. Measures of Relationship: For bivariate or multivariate population , we have to answer two types of questions : Q1: Does there exist association or correlation between two or more variable, if yes what degree? Q2: Is there any cause and effect relationship between the two variables or between one variable and others ? First question is answered by use of correlation technique and second question is answered by technique of regression In case of bivariate population : a-Cross tabulation, b-Charles spearman’s coefficient of correlation ,c-Karl person's coefficient of correlation In case of multivariate population: a-Cofficient of multiple correlation, b-Cofficient of partial correlation REGRESSION ANALYSIS: Regression analysis is a set of statistical methods used to estimate relationships between a dependent variable and one or more independent variables. It can be used to assess the strength of the relationship between variables and for modeling the future relationship between them. Simple linear regression is used to model the relationship between two continuous variables. Often, the objective is to predict the value of an output variable (or response) based on the value of an input (or predictor) variable. Parametric tests :are those that make assumptions about the parameters of the population distribution from which the sample is drawn. This is often the assumption that the population data are normally distributed and the variances of the groups being compared are equal. Eg- 1-t-test: is an inferential statistic used to determine if there is a significant difference between the means of two groups and how they are related. T-tests are used when the data sets follow a normal distribution and have unknown variances, like the data set recorded from flipping a coin 100 times 2-Anova (Analysis of variance) :Analysis of Variance, is a parametric test. It's a statistical method that analyzes the differences between the means of two or more groups or treatments Non-parametric tests : are “distribution-free” and can be used for non-Normal variables. Nonparametric tests do not make any assumptions about the distribution of the data or the equality of variances.Eg- 1-The Mann-Whitney test is another powerful nonparametric test. It is similar to the t-test in that it is designed to test differences between groups, but it is used with data that are ordinal. 2-The chi-square test (chi2) is used when the data are nominal and when computation of a mean is not possible. This test is a statistical procedure that uses proportions and percentages to evaluate group differences. Diagrammatic and Graphic representation Data can be represented diagrammatically in various ways, such as bar graphs, line graphs, pie charts, scatter plots, and histograms. These diagrams can help to visualize data in a more meaningful way, making it easier to identify patterns and trends. For example, a bar graph can be used to compare the values of different categories, while a line graph can be used to show the changes in a value over time. A pie chart can be used to show the proportions of different values, and a scatter plot can be used to show the relationship between two variables. Finally, a histogram can be used to show the distribution of a single variable. Frequency distribution : is a way of organizing data into categories or groups and counting the number of observations that fall into each category. It is a way of summarizing a set of data. Frequency distributions are used to show the number of occurrences of different values in a dataset. Frequency distributions can be used to identify patterns in the data, such as the most common value or values, or the shape of the data. Example : Age Group Frequency 0-10 10 11-20 20 21-30 30 31-40 40 41-50 25 51-60 15 61-70 5 71-80 2 81-90 1 THANKS