Business Analytics Notes
Business Analytics Notes
❖ Introduction
Data Science is an interdisciplinary field that combines statistics, data analysis, and machine learning to obtain
meaningful insights and knowledge from data. Data Science is applied is many spheres, including banks,
healthcare,
manufacturing, and e-commerce, to serve critical applications such as
optimizing routes, forecasting revenues, creating targeted promotions offers, and even predicting election
outcomes.
Data is the raw material for extracting information, for example, numbers, text, observations or recordings. Data
can be structured, i.e. they are organized into predefined categories or concepts, such as lists, tables, datasets
or
spreadsheets.
Relevance: The relevance of data or statistical information reflects the degree to which it meets the
needs of data users. Some questions that must be answered are,” does this information matter?”
“Does it fill an existing data gap?”
Accuracy: Accurate data give a true reflection of reality, a data which is not accurate doesn’t help to
gain any fruitful decision and hence has no value.
Timeliness: It is the tune when data is available to the user or decision make.
Interpretability: An information that people can’t understand has no value and could even be
misleading.
Coherence: It can be split into two concepts: consistency and commonality.
Accessibility: It is defined as how easy it is for people to find, get, understand, and use data. When
determining whether data are
accessible, make sure they are organized, available, accountable, and interpretable.
o Data collection: Data analysis begins from where one generates data from different sources.
o Data Cleaning: Raw data, oftentimes, contains missing values, duplicates, or inconsistences.
o Data Exploration: After the data has been cleaned, the structure, distribution, and relationship of
data are understood.
o Data Transformation: The steps of transforming data into a format suitablefor
analysis involve normalizing values, aggregrating data , or creating new features to enhance the
effectiveness of the model.
o Data Modelling: Here, statistical, mathematical, or machine leaning models are applied to the data
for answering specific research questions, Common Techniques used include regression analysis,
classification algorithms clustering and time-series analysis.
o Data Interpretation: There is the last stage of data analysis wherein results are interpreted
and conclusion drawn.
e) Effective supply chain management is crucial to maintaining smooth operations and meeting
customer demand.
Volume: It is the huge data amount generated which is major in terms of Terabytes,
petabytes, or even exabytes For example, the likes, comments, and post shared by billion of
users of Facebook Every day.
Velocity: The speed at which data is generated, processed, and analyzed. Real-time processing is
often necessary like stock market systems process millions of transactions per second to give real-
time updates.
Variety: It refers to different type of data it can be structured data is the data that is ready for
modeling and analyzing, unstructured data which is very much scattered data which cannot
be straight away used semi-structured data which lies in between structured and
unstructured. For Example, videos from You tube , tweets etc. Have different structure.
Veracity: It refers to the accuracy, quality, and trustworthiness of data. Data can be messy,
incomplete or misleading hence we need filtering before processing.
Value: The insights and benefits that organizations are able to derive from the analysis of
big data.
3|Page
Data Type Primarily structured data Structured, semi
(rows, columns) structured, and
unstructured
• Healthcare: Big data is crucial in health care because its allows for predictive analytics.
• Retail and E-commerce: Retailers leverage big data to analyze customer behavior and preferences,
enabling personalized shopping experiences.
• Finance and Banking: Big data analytics enables financial institutions to identify fraud by monitoring
transaction patterns and discovering
anomalies in real time.
• Transportation: Transportation systems use big data to analyze traffic patterns, optimize
routes, and reduce congestion, improving travel efficiency and safety.
• Media and Entertainment: Streaming platforms use big data to analyze viewing
habits and preferences, ensuring users receive personalized content recommendations.
• Education: Educational platforms use big data to track student performance, identify learning gaps,
and customize study materials.
• Manufacturing: Big data allows manufacturers to keep track of the performance of the equipment
and predict potential failure, which decreases downtime and
maintenance costs.
• Government: Governments utilize big data to improve urban planning and optimize resources,
creating smarter and more sustainable cities. For example, Singapore’s
4|Page
Smart Nation initiative uses data from IoT devices to manage traffic, monitor air quality, and
improve public transport systems.
• Energy and Utilities: Energy providers use big data to analyze consumption patterns, forecast demand,
and improve energy efficiency.
5|Page
1|Page
BUSINESS ANALYTICS
UNIT-2 Data Preparation, Summarisation
And Visualisation Using
Spreadsheet
❖ Data Preparation
Data preparation is one of the major processes in the pipeline of data analysis and machine
learning. This involves tasks such as data cleaning, transforming, and arranging raw data in a
format that enables effective analysis or model training.
Data Collection: This involves collecting raw data from a large number of sources, such as
databases, spreadsheets, APIs, or even sensors.
Data Cleaning: It involves detecting and correcting errors, inconsistencies, and missing
values in a dataset. Treatment of outliers, duplicate entries, or irrelevant data points is
essential in this stage.
Data Transformation: This refers to the conversion of data into a format or structure fit for
analysis.
Data Integration: This is the integration of data from different sources into one dataset. It
may involve table merging, dataset joining, or another kind of data conflict resolution.
Data Reduction: This is a process for reducing either the size or the complexity of the
dataset, and it involves feature selection, dimensionality reduction, and sampling, among
others.
Data Formatting: Consistency in format, including standardized date formats and variable
naming conventions.
Data Splitting: Basically, it is the division of data into subsets, usually
training, validation, and test sets. These sets help a model builder to build
models with the data, tune their hyperparameters, and finally estimate
their performance.
❖ Data Cleaning
Data cleaning, also known as data cleansing or data scrubbing, is the process of identifying,
correcting, or removing inaccurate, incomplete, or irrelevant data from a dataset.
✓ Correcting Data Errors: To correct commonly occurring data errors, identify and
rectify the following issues:
✓ Inconsistent Data: Fix inconsistencies in formatting (e.g., date formats, text case) or
values (e.g., “NY” vs. “New York”).
✓ Data Entry Errors: Identify and correct typographical errors or misentered data,
such as incorrect numerical values or misspelled words.
✓ Standardizing Data: Data is standardized by normalization or transformation.
✓ Normalization: Ensure that data is consistent in format, especially for categorical
data (e.g., “Male” vs. “M” or “1/1/2024” vs. “01-Jan-2024”).
✓ Transformation: Convert data into a common scale or unit, such as converting all
weights to kilograms or all prices to a single currency.
✓ Outlier Detection and Treatment: Detect outliers that fall outside the expected
range of values.
✓ Validating Data Accuracy: To validate accuracy of the data, check the data against
reliable sources or business rules to ensure accuracy.
✓ Removing Irrelevant Data: This can be done by filtering data. That is by removing
data that is not relevant to the analysis or that does not contribute useful information.
✓ Formatting and Structuring Data: This is done by ensuring that data is in the
correct format, such as consistent date formats or proper text casing.
❖ Data Summarization
Data summarization is the process of transforming a given large dataset into a smaller form,
usually presentable, for reporting, analysis, and further examination.
❖ Data Sorting
Sorting helps users to organize data in a specific order. You can sort a text column in
alphabetical order (A-Z or Z-A).
❖ Filtering Data
Filters are used to temporarily hide some of the data in a table. This helps users to focus on
the data that is important for the current task at hand.
Example: Filter a range of data
• Step 1: Select the column on which to
• apply filter.
• Step 2: Select Data > Filter.
• Step 3: Select the column header arrow
• Step 4: In case of text data, uncheck the values thar you want to see
For filtering data on numeric values, you can even select a comparison, like Between to
see only the values that lie in a given range.
• Step 5: Click on OK
❖ Conditional Formatting
Conditional Formatting allows users to fill cells with certain color depending on the
condition. This enhances data visualization and its interpretation.
Example: Highlight cells that have a value greater than 350.
• Step 1: Select the range of cells on which conditional formatting has to be applied.
• Step 2: On the Home tab, under Styles Group, click Conditional Formatting.
• Step 3: Click Highlight Cells Rules > Greater Than....
• Step 4: Enter the desired value and select the formatting style.
• Step 5: Click OK
❖ Text to Column
Text to column feature is used to separate a single column data into multiple columns. This
enhances readability of the data. For example, if a column contains first name, last name and
profession in a single column, then this information can be separated in different columns.
• Step 1: Select the cell or column that contains the text to be split.
• Step 2: Select Data > Text to Columns.
• Step 3: In the Convert Text to Columns Wizard displayed on the screen,
• select Delimited > Next.
• Step 4: Select the Delimiters for your data.
• Step 5: Select Next.
• Step 6: Preview the split and select Finish.
4|Page
Select the range of cells containing duplicate values that should be removed.To do this in MS
Excel,
Step 1: Select the data from which duplicate values have to be removed.
Step 2: Select Data > Remove Duplicates.
Step 3: Uncheck the columns to be purged to remove duplicate records.
Step 4: Click OK.
❖ Data Validation
Excel is a powerful tool for data analysis, reporting, and decision-making. But, the reliability
of these activities depends on the accuracy and integrity of the data. Data validation helps
users control the input to ensure accuracy and consistency.
Step 1: Select the Cells for Data Validation
Step 2: In the Data Tab, click on Data Validation to open the Data Validation
Dialog Box
Step 3: In the Data Validation dialog box, under the Settings tab, define
the validation criteria:
Allow: Select the type of data. This data can be Whole Number, Decimal, List (only values
from a predefined list are allowed), Date, Time, Text Length (only text of a certain length is
allowed).
Data: Specify the condition (e.g., between, not between, equal to, not equal to, etc.).
Minimum/Maximum: Enter the acceptable range or limits based on the above selection. For
example, to allow values between 100 and 1000, select “Whole Number,” “between,” and
then set the minimum to 100 and the maximum to 1000.
Show Error Alert after Invalid Data is entered: Check this to enable error alerts.
Style: Choose from Stop, Warning, or Information to indicate the severity of the alert.
Title: Enter a title for the error message box.
Error Message: Type the message to be displayed. It must explain the error and suggest
ways to correct it.
f) Calculate the Upper and Lower Bounds: Defining the upper and lower bounds of
data allows identification of values that are higher than expected value (upper bound)
and smaller than the lower bound.
Calculate the upper bound of data by multiplying IQR by 1.5 and adding it to the third
quartile. The formula can be given as, “= Q3 + (1.5 * IQR).”
g) Remove the Outliers: After defining the upper and lower bounds of data, review the
data to identify values that are higher than the upper bound or lower than the lower
bound.
❖ Covariance
Array1 is a range or array of integer values.
Array2 is a second range or array of integer values.
Note the following points:
If the given arrays contain text or logical values, they are ignored by the COVARIANCE
in Excel function.
The data should contain numbers, names, arrays, or references that are numeric. If some
cells do not contain numeric data, they are ignored.
The data sets should be of the same size, with the same number of data points.
The data sets should be neither empty nor should the standard deviation of their values be
zero.
To find covariance in Excel and determine if there is any relation between
the two columns C and D, we can write =COVARIANCE.P(C1:C10, D1:D10).
Mathematically, covariance is calculated as:
COV (X ,Y) =(x- x)( y- y)/n
❖ Correlation Matrix
A correlation matrix is a table that displays the correlation coefficients for different variables.
A correlation matrix consists of rows and columns that show the correlation coefficient
between the variables. Correlation is a statistical measure that describes the extent to which
two or more variables are related to each other.
Positive Correlation: When values of two variables increase or decrease together,
they are said to be positively correlated. For example, height and weight are positively
correlated; as height increases, weight tends to increase as well.
Negative Correlation: When two values are negatively correlated, an increase in one
variable results in decline of the other. For example, speed and time are negatively
correlated. When speed increases it takes less time to reach the destination.
6|Page
❖ Moving Average
Moving average also known as rolling average, running average or moving mean is defined
as a series of averages for different subsets of the same data set.
To visualize the moving average on a chart by drawing a trendline follow the steps given
below:
Step 1: Click anywhere in the chart.
Step 2: On the Layout tab, in the Analysis group, select the trendline option.
Step 3: Click the desired option.
Excel does not have any particular function to list missing values. But it is important because
of the following reasons:
o Data Integrity which ensures that the dataset is complete.
o Data Reconciliation that facilitates the reconciliation process (mostly used in
finance).
o Quality Assurance to identify anomalies or data entry errors.
o Efficient Analysis to perform accurate data analysis by spotting and addressing gaps.
❖ Data Summarization
Data summarization in Excel can be done in multiple ways like:
Using Descriptive Statistics: For example, given a list of values in column A, we can use
Excel functions to summarize the values.
▪ SUM, AVERAGE, MEDIAN: Calculate the total, mean,
and median of a dataset.
Example: = SUM (A2:A100) sums all values in the range
A2 to A100.
Example: = AVERAGE (A2:A100) calculates the average.
▪ COUNT, COUNTA: Count the number of cells with
numbers (COUNT) or any data (COUNTA).
Example: =COUNT (A2:A100) counts numeric entries.
7|Page
❖ Data Visualization
Data visualization helps users to transform raw data into meaningful visual stories that
enables them to spot trends in data and communicate complex information effectively.
Step 1: Organize the data in rows and columns within the Excel sheet. Every row and
column should be labelled clearly to identify the data to be visualized.
Step 2: Select the data by clicking and dragging mouse to highlight the data to be
visualized. In this selection, include the row and column headers.
Step 3: Choose a chart type by clicking on the “Insert” tab. In the “Charts” section,
select the required chart option (Column, Line, Pie, Bar, Area, Scatter, etc.) by
clicking on the dropdown arrow below the chart type.
Step 4: Insert the chart. Once the desired chart is selected, it is automatically created
and inserted in the worksheet.
Step 5: Customize the chart. For this, click on the chart to select it. Now, you would
be able to see two additional tabs: “Design” and “Format”.
1. Column Chart: It displays data using vertical bars. Each bar represents a category.
2. Bar Chart: It is similar to a column chart, but instead of vertical bars, it has horizontal
bars.
3. Line Chart: The line chart plots data points and then connects these points by lines.
8|Page
4. Pie Chart: A pie chart plots data as slices of a circle. Size of each slice is proportional
to the value it represents.
5. Scatter Plot: A scatter plot displays data points on a Cartesian coordinate system, with
each axis representing a variable.
❖ Pivot Tables
Pivot tables are an important part of MS Excel that allows users to quickly summarize large
amounts of data, analyze numerical data in detail, and answer unanticipated questions about
the data.
9|Page
❖ Pivot Chart
Pivot Chart is a dynamic visualization tool that helps users summarize and analyze large
datasets.
Step 1: Click any cell inside the pivot table.
Step 2: On the PivotTable Analyze tab, click on PivotChart in the Tools group.
Step 3: Click OK on the Insert Chart dialog box.
❖ Interactive Dashboard
✓ Step 1: Define the Purpose of the Dashboard.
✓ Step 2: Gather Data in the form of a table and then convert this table into a pivot table
This is done by:
a) Selecting the table.
b) In the Insert Tab, click on Pivot Table.
c) Click on OK and the Pivot Table will be inserted in a new sheet.
✓ Step 3: Create Charts using the Pivot Table.
✓ Step 4: In the PivotTable Analyze group, click on PivotChart and select a suitable
chart from the chart drop-down.
✓ Step 5: Click on Ok. The pivot chart will be created.
Introduction
In this chapter we will introduce you to a popular opensource programming language designed primarily for
statistical computing and data analysis i.e. R programming (referred as R henceforth). Suppose a retail
company, “ShopSmart,” that needs to analyse its daily sales, currently they use excel for basic data handling,
but it becomes challenging with the increasing size of data.
Statistical software generally has very costly licenses, but R is completely free to use, which makes it
accessible to anyone interested in learning data analysis without needing to invest money.
R is a versatile statistical platform providing a wide range of data analysis techniques, enabling
virtually any type of data analytics to be performed efficiently and having state-of-the-art graphics
capabilities for visualization.
The data is mostly gathered from variety of sources analysing it at one place has its own challenges.
R is compatible with a broad range of platforms, including Windows, Unix, and macOS, making it
likely to run on almost any computer you use.
The R community which provides wide level of support for R programmers has developed thousands
of packages, extending R’s capabilities into specialized areas like quantmod for finance, plotting
package for visualization (‘ggplot2’), and support for machine learning algorithms as well.
Installation
To begin with R, students need to install both R (the base programming language) and RStudio which is an
Integrated Development Environment (IDE) that makes working with R much easier.
For R
Step 1: Go to [CRAN (Comprehensive R Archive Network)] (https://cran.r-project.org/).
Step 2: Choose y our operating system (Windows, macOS, or Linux).
Download and run the installer.
R Interface
For RStudio
Visit [RStudio’s website] (https://www.rstudio.com/products/rstudio/ download/).
Choose the free version, “RStudio Desktop.”
Follow the installation prompts.
RStudio Interface
Source Editor Pane: In RStudio IDE, you can access the source editor for R code.
Console Pane: This pane (as shown in 3 of Figure 3.1) has R interpreter, where R code is processed.
Environment Pane: This pane can be used to access the variables that are created in the current R
session.
Output Pane: This pane contains the Files, Plots, Packages, Help, Viewer, and Presentation tabs.
Packages: Here you have downloaded and installed R for the first time, this means you have
installed Base R software containing most of the functions that you will use frequently like mean()
and hist ().
Importing data from spreadsheets is quite common in business analytics because most business data is stored
in such formats as Excel. Using R, you can easily import spreadsheet data into your workspace with
packages like read xl and open xlsx.
Keywords: These are integral part of R’s syntax, keywords are used to implement various
functionalities in R.
Data Type
Unlike C or C++, we do not require to declare a variable with data type in R. It supports random assignment
of data type for a variable depending upon the values that it has been initialized to.
Data Types in R
Operators
Operators are tools that help us to perform various operations on data, we can do basic calculations or more
advanced logical comparisons, operators tell R what action to take on data.
Relational Operators are used to compare values and check for conditions like equality, greater
than, or less than. For instance, 5 > 3 checks whether 5 is greater than 3 and returns TRUE. similarly,
5 == 3 checks for equality and returns FALSE.
Logical Operators let you combine or modify logical values. You can use & to perform an AND
operation, where it is only true if both the conditions are satisfied.
Assignment Operators are used to store values in variables. The most used operator is <-, which assigns a
value to a variable, like x <- 10. You can also use = for assignment, but <- is preferred in R because it is
clear and consistent with the syntax of the language.
Functions
In R, user-defined functions enable you to create reusable blocks of code to perform specific tasks.
1|Page
BUSINESS ANALYTICS
UNIT-4 Data Structures in R
❖ Vectors
It is one of the basic data structures in R programming languages, it is used to store multiple
values having same type also called modes. It is one-dimensional and can hold numeric,
character, logical or other values, but all the values must have same mode. Vectors are
fundamental to R, hence most of the operations are performed on vectors.
Types of Vectors
❖ Creating a Vector
You can create vectors using the c() function, which stands for combine or concatenate.
Also, vectors are stored contiguously in memory just like arrays in C, hence the size of vector
is determined at the time of creation.
✓ Length: We can obtain the length of a vector using length() function. This can be
used to iterate over vector in loops.
✓ You can also apply filtering to vectors by applying logical expressions that return
true/false for each vector, output is given by true values.
Note: One important point when applying an operation to two vectors is that such operations
require both vectors to be the same length. In case of length mismatch R automatically
recycles, or repeats, the shorter one.
• Miscellaneous Functions: There are certain functions shown below in Table which
can be used with vectors, as required.
❖ Matrices
Matrices are actually a special type of a broader concept in R called arrays. While matrices
have just two dimensions (rows and columns), arrays can go further and have multiple
dimensions. For instance, a three-dimensional array has rows, columns, and layers, adding an
extra level of organization into your data.
Creation: Matrices are generally created using matrix() function, the data in matrices is
stored in column major format by default. The ‘nrow’ parameter specifies rows, and ‘ncol’
specifies columns.
❖ Arrays
An array in R is a data structure that can store data in more than one dimension, hence
in R arrays are an extension of matrix. While a matrix is constrained to two
dimensions, with rows and columns, an array, however, can take three or more
dimensions.
Array can be created using array() function with arguments data, dimensions and
dimension names.
Array elements can be accessed in same manner as vector or matrices. We can also
name the dimensions.
We can reshape arrays dimension.
❖ Lists
In R, a list is an amazingly flexible data structure, meaning it can store any kind of data
together - numbers, characters, vectors, matrices, and even other lists.
You create a list by using the “list()” function, and any of the elements in the list
are accessed using double square brackets “[[ ]]”. So for instance, “list (42,
“Hello”, c(1, 2, 3))” generates a list that has an integer, a string, and a vector.
4|Page
❖ Factors
Factors are another type of R objects that are created by using a vector, it stores the vector as
well as a record of distinct values in that vector called level. Factors are majorly used for
nominal or categorical data.
❖ Data Frames
A data frame is a two-dimensional, tabular data structure commonly used for storing and
manipulating data. It is very similar to table or spreadsheet, where each column can store data
of various types-numeric, character, logical-and each row is an observation or record.
• There are three decision making constructs in R programming: if, if… else, switch.
The if statement in R is the simplest form of decision making. It compares a
condition, and then if that condition is TRUE then the code block inside if is
executed; otherwise, the code block is skipped for a FALSE condition.
❖ Loops
Like any other programming language, we have loops in R too. They are basic constructs
allowing a block of code to be executed repeatedly. R implements several kinds of loops: for,
while, and repeat. Each loop type is suited for different tasks, depending on the kind of
control flow needed.
▪ For Loop: It is used to iterate over a sequence of elements (that are iterate able), such
as a vector, list, or sequence using a loop control variable.
▪ Like for loop, while loop also repeatedly executes a block of code as long as the
condition remains TRUE. But here the loop control variable needs to be initialized
outside the loop.
▪ The third type of iterative statement i.e. repeat loops indefinitely until explicitly
stopped using a break statement.
▪ We can also have nested loops for complex operations where iterations are needed at
various levels. For example, if you want to print columns for each row, nested code.
5|Page
▪ Next and break statements can be used to control loop, next helps to skips the current
iteration and moves to the next one while break terminates the loop entirely as seen in
repeat loop.
❖ Apply Family
The apply family in R includes functions like apply, lapply, sapply, vapply, tapply, mapply,
and rapply. It is very useful and powerful feature of R. These functions provide alternatives
to loops for applying functions across various data structures like vectors, matrices, arrays,
lists, factors, and data frames.
The apply() is used to operate on margins of matrix and array. It applies a given
function along rows or columns of a matrix or higher-dimensional array.
lapply() is used to apply a function to each element of the list and it returns a list.
The sapply() function works like lapply() but it attempts to simplify the output into a
vector or matrix when possible.
vapply() is also like lapply() and sapply() but it lets you to specify the expected output
type for better reliability.
tapply() applies a function to subsets of a vector, defined by a factor or a list of
factors. It takes three input parameters data vector, factors to group by, function to
apply.
mapply() can be used to apply a function to multiple arguments (vectorized).
If you want to recursively apply a function to elements of a list you can use rapply(),
kit can also be used to handle nested list.
BUSINESS ANALYTICS
UNIT-5 Descriptive Statistics Using R
❖ Introduction
Data analysis is an important skill in today’s data-driven world, allowing people and
organizations to extract meaningful insights from raw data.
d) Line Graphs: They are recommended when we want to analyse trends over a
continuous variable or to observe relationships. The code to generate line graph.
e) Scatter Plots: When we need to visualize the relationship between two continuous
variables we can use scatter plots, they are an ideal choice for identifying trends,
clusters, or correlations. The code shows how to generate scatter plot.
II. Median: Median is the middle value in a sorted dataset, it is calculated as the
central value if the dataset has an odd number of observations and median is
the average of the two central values if observations are even in number.
Compared to mean; the median is less affected by outliers. The example
below shows how to compute median.
III. Mode: The mode represents the value that appears most frequently in a
dataset. A dataset can be unimodal having one mode, multimodal having
more than one mode, or no mode at all if no value repeats. The example
below shows all three, the corresponding code to compute mode in dataset.
3|Page
❖ Measure of Dispersion
a. Range: The simplest measure of dispersion is range which is the difference between
the maximum and minimum values in a dataset. Although range is easy to calculate
but it is sensitive to outliers.
Formula for range is:
Formula:
Range = Maximum Value – Minimum Value
b. Variance: It is the measures of deviation from mean i.e. how far each data point is
from the mean, on average. A higher variance indicates greater variability in the data.
Variance is expressed in squared units which makes it harder to interpret directly. The
formula of variance is given below:
Formula:
d. Interquartile Range (IQR): It measures the spread of the middle 50% values of data.
It indicates how spread out the middle half of a dataset is. For better understanding,
imagine you line up all your data from smallest to largest (sorted). The IQR focuses
on the middle 50% of those numbers, ignoring the smallest and largest value. The
formula is shown below:
Formula:
IQR = Q3 – Q1
BUSINESS ANALYTICS
UNIT-6 Predictive and Textual Analytics
❖ Introduction
Predictive analytics is changing industries because it can facilitate data- driven decision-
making and efficiency in operations. Textual analysis is the systematic examination and
interpretation of textual data in order to draw meaningful insight, patterns, and trends.
The very first step is to prepare data for this we need to ensure that the data set is
clean and contain no missing values for the variable involved. We need to load the
data into R using functions like read.csv() or read.table().
Once data is uploaded we need to visualize it using scatter plot.
After that we can fit the regression model using lm() function and then we can use
summarize method to understand the details of the model.
We can also make predictions using predict() function.
• Variance of the error must be constant across the ranges of independent variables i.e.
homoscedasticity.
• Residuals or error terms must be normally distributed.
• Independent variables should not have strong mutual correlations means no
multicollinearity.