0% found this document useful (0 votes)
33 views31 pages

Business Analytics Notes

The document provides an overview of Business Analytics, focusing on Data Science, its applications, and the importance of data preparation, cleaning, and summarization. It discusses various types of data analytics, the characteristics of big data, and the challenges faced in data analytics. Additionally, it outlines key techniques for data summarization and visualization using spreadsheets.

Uploaded by

avnish.kumar1821
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views31 pages

Business Analytics Notes

The document provides an overview of Business Analytics, focusing on Data Science, its applications, and the importance of data preparation, cleaning, and summarization. It discusses various types of data analytics, the characteristics of big data, and the challenges faced in data analytics. Additionally, it outlines key techniques for data summarization and visualization using spreadsheets.

Uploaded by

avnish.kumar1821
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

BUSINESS ANALYTICS

UNIT-1 INTRODUCTION TO DATA SCIENCE

❖ Introduction

Data Science is an interdisciplinary field that combines statistics, data analysis, and machine learning to obtain
meaningful insights and knowledge from data. Data Science is applied is many spheres, including banks,
healthcare,
manufacturing, and e-commerce, to serve critical applications such as
optimizing routes, forecasting revenues, creating targeted promotions offers, and even predicting election
outcomes.

➢ Data and Its Types

Data is the raw material for extracting information, for example, numbers, text, observations or recordings. Data
can be structured, i.e. they are organized into predefined categories or concepts, such as lists, tables, datasets
or
spreadsheets.

Relevance: The relevance of data or statistical information reflects the degree to which it meets the
needs of data users. Some questions that must be answered are,” does this information matter?”
“Does it fill an existing data gap?”
Accuracy: Accurate data give a true reflection of reality, a data which is not accurate doesn’t help to
gain any fruitful decision and hence has no value.
Timeliness: It is the tune when data is available to the user or decision make.
Interpretability: An information that people can’t understand has no value and could even be
misleading.
Coherence: It can be split into two concepts: consistency and commonality.
Accessibility: It is defined as how easy it is for people to find, get, understand, and use data. When
determining whether data are

accessible, make sure they are organized, available, accountable, and interpretable.

❖ Data Analytics and Data Analysis


Data analytics and data analysis are two terms frequently used
interchangeably, but they also have different meanings in context of working with data and
extracting useful insights. It refers to the whole process of examining datasets in order to find
trends, patterns,
relationships, and other insights that might help in the decision- making process.

Four main types of Data Analytics :-


▪ Descriptive Analytics: This type of analytics focuses on summarizing past data
and describing what happened. It includes this use of historical data to identify
1|Page
trends and patterns.
▪ Diagnostic analytics: IT is a step ahead, which identifies the causes of certain trends or
patterns identified in the descriptive analytics phase.
▪ Prescriptive Analytics: This type of analytics suggests possible actions and outcomes
based-on the analysis.
▪ Predictive Analytics: It has a ability to make predictions based on historical data through
statistical models, and the output could be a machine leaning algorithm.

Type Focuses Question Techniques/Methods


Answered

Descriptive Historical Data analysis What happened? Statistical analysis, reporting


Analytics

Diagnostic Cause Analysis Why did it happen? Correlation regression


Analytics

Predictive Future What could Machine learning forecasting


Analytics Forecasting happen?

Prescriptive Actionable What should we Optimizing decision analysis


analytics do?
recommendations

o Data collection: Data analysis begins from where one generates data from different sources.
o Data Cleaning: Raw data, oftentimes, contains missing values, duplicates, or inconsistences.
o Data Exploration: After the data has been cleaned, the structure, distribution, and relationship of
data are understood.
o Data Transformation: The steps of transforming data into a format suitablefor
analysis involve normalizing values, aggregrating data , or creating new features to enhance the
effectiveness of the model.
o Data Modelling: Here, statistical, mathematical, or machine leaning models are applied to the data
for answering specific research questions, Common Techniques used include regression analysis,
classification algorithms clustering and time-series analysis.
o Data Interpretation: There is the last stage of data analysis wherein results are interpreted
and conclusion drawn.

❖ Application of Analytics in Business

a) One of the powerful applications of analytics is business is understanding customer behavior


and preferences.
b) Analytics is also an important function for improving business process.
2|Page
c) Companies need to make wise financial decisions to stay ahead. Analytics can help in
forecasting revenue, manage budgets, and assess financial risks so that companies can have a
well-supported decision-making process regarding investments and expenses. For example,
predictive analytics will help a bank assess the creditworthiness of someone applying a loan.
d) Analytics can also apply to Human Resources (HR) by helping improve employee performance,
reduce turnover, and optimize the hiring process.

e) Effective supply chain management is crucial to maintaining smooth operations and meeting
customer demand.

❖ Big Data and its Characteristics:-


This huge amount of data is termed as big data. Businesses use this data to understand the preference
of customers, predict trends, and even
recommended products to you (like, “You might also like”,). For example, when Netflix suggests shows
based on what you have watched it’s using big data to make smart recommendation tailored just for
you.

Volume: It is the huge data amount generated which is major in terms of Terabytes,
petabytes, or even exabytes For example, the likes, comments, and post shared by billion of
users of Facebook Every day.
Velocity: The speed at which data is generated, processed, and analyzed. Real-time processing is
often necessary like stock market systems process millions of transactions per second to give real-
time updates.
Variety: It refers to different type of data it can be structured data is the data that is ready for
modeling and analyzing, unstructured data which is very much scattered data which cannot
be straight away used semi-structured data which lies in between structured and
unstructured. For Example, videos from You tube , tweets etc. Have different structure.
Veracity: It refers to the accuracy, quality, and trustworthiness of data. Data can be messy,
incomplete or misleading hence we need filtering before processing.
Value: The insights and benefits that organizations are able to derive from the analysis of
big data.

Parameter Traditional Data Big Data

Data Size Limited (gigabytes to terabytes) Vast (terabytes to petabytes or


more)

3|Page
Data Type Primarily structured data Structured, semi
(rows, columns) structured, and
unstructured

Processing Speed Batch processing: Real-time or near-real-


slower insights time processing
Storage Relational databases Distributed systems
(e.g., SQL) (e.g., Hadoop,NoSQL)
Complexity Manageable with Distributed systems (e.g.,
Hadoop, NoSQL)
traditional tools and
methods
Technology Uses tools like RDBMS (MySQL, Uses tools like Hadoop, Spark,
Oracle) NoSQL
Data Sources Limited (e.g., business Multiple (e.g., social media,
transactions, logs) loT, sensors)
usine

Analysis Focus Historical data analysis Predictive, real-time, and trend


analysis
Scalability Limited scalability Highly scalable using distributed
systems

❖ Applications of Big Data

• Healthcare: Big data is crucial in health care because its allows for predictive analytics.
• Retail and E-commerce: Retailers leverage big data to analyze customer behavior and preferences,
enabling personalized shopping experiences.
• Finance and Banking: Big data analytics enables financial institutions to identify fraud by monitoring
transaction patterns and discovering
anomalies in real time.
• Transportation: Transportation systems use big data to analyze traffic patterns, optimize
routes, and reduce congestion, improving travel efficiency and safety.
• Media and Entertainment: Streaming platforms use big data to analyze viewing
habits and preferences, ensuring users receive personalized content recommendations.
• Education: Educational platforms use big data to track student performance, identify learning gaps,
and customize study materials.
• Manufacturing: Big data allows manufacturers to keep track of the performance of the equipment
and predict potential failure, which decreases downtime and
maintenance costs.
• Government: Governments utilize big data to improve urban planning and optimize resources,
creating smarter and more sustainable cities. For example, Singapore’s

4|Page
Smart Nation initiative uses data from IoT devices to manage traffic, monitor air quality, and
improve public transport systems.
• Energy and Utilities: Energy providers use big data to analyze consumption patterns, forecast demand,
and improve energy efficiency.

❖ Challenges in Data Analytics


a) Getting poor-quality data like old data or data that might be completely
inconsistent.
b) Integrating data from diverse sources, such as databases, IoT devices, and social media, can be
complex due to varying formats and structures.
c) The sheer size of big data makes storage, processing, and analysis resource intensive.
Handling such petabytes of data requires significant computational power, which can strain
infrastructure.
d) Processing data in real-time for immediate insights is difficult, especially with high-velocity
data streams but delayed insights can lead to missed opportunities in time-sensitive scenarios
like fraud detection. Hence processing in real time is an issue.

5|Page
1|Page

BUSINESS ANALYTICS
UNIT-2 Data Preparation, Summarisation
And Visualisation Using
Spreadsheet

❖ Data Preparation
Data preparation is one of the major processes in the pipeline of data analysis and machine
learning. This involves tasks such as data cleaning, transforming, and arranging raw data in a
format that enables effective analysis or model training.

Data Collection: This involves collecting raw data from a large number of sources, such as
databases, spreadsheets, APIs, or even sensors.
Data Cleaning: It involves detecting and correcting errors, inconsistencies, and missing
values in a dataset. Treatment of outliers, duplicate entries, or irrelevant data points is
essential in this stage.
Data Transformation: This refers to the conversion of data into a format or structure fit for
analysis.
Data Integration: This is the integration of data from different sources into one dataset. It
may involve table merging, dataset joining, or another kind of data conflict resolution.
Data Reduction: This is a process for reducing either the size or the complexity of the
dataset, and it involves feature selection, dimensionality reduction, and sampling, among
others.
Data Formatting: Consistency in format, including standardized date formats and variable
naming conventions.
Data Splitting: Basically, it is the division of data into subsets, usually
training, validation, and test sets. These sets help a model builder to build
models with the data, tune their hyperparameters, and finally estimate
their performance.

❖ Data Cleaning
Data cleaning, also known as data cleansing or data scrubbing, is the process of identifying,
correcting, or removing inaccurate, incomplete, or irrelevant data from a dataset.

❖ Key Steps in Data Cleaning:


✓ Handling Missing Data: Detect missing values in the dataset, which can appear as
blanks, NA, null, or other placeholders. This data can be handled by using the
following strategies.
✓ Removal: Delete rows or columns with missing values if they are not critical.
✓ Imputation: Replace missing values with estimates, such as the mean, median, mode,
or by using more advanced methods like predictive modeling.
✓ Placeholder: Leave the missing values as they are, but flag them for attention in
future analysis.
✓ Removing Duplicate Data: For this identify duplicate records that may occur due to
repeated data entry or merging datasets and then, delete duplicate values to prevent
skewed results in analysis.
2|Page

✓ Correcting Data Errors: To correct commonly occurring data errors, identify and
rectify the following issues:
✓ Inconsistent Data: Fix inconsistencies in formatting (e.g., date formats, text case) or
values (e.g., “NY” vs. “New York”).
✓ Data Entry Errors: Identify and correct typographical errors or misentered data,
such as incorrect numerical values or misspelled words.
✓ Standardizing Data: Data is standardized by normalization or transformation.
✓ Normalization: Ensure that data is consistent in format, especially for categorical
data (e.g., “Male” vs. “M” or “1/1/2024” vs. “01-Jan-2024”).
✓ Transformation: Convert data into a common scale or unit, such as converting all
weights to kilograms or all prices to a single currency.
✓ Outlier Detection and Treatment: Detect outliers that fall outside the expected
range of values.
✓ Validating Data Accuracy: To validate accuracy of the data, check the data against
reliable sources or business rules to ensure accuracy.
✓ Removing Irrelevant Data: This can be done by filtering data. That is by removing
data that is not relevant to the analysis or that does not contribute useful information.
✓ Formatting and Structuring Data: This is done by ensuring that data is in the
correct format, such as consistent date formats or proper text casing.

❖ Data Summarization

Data summarization is the process of transforming a given large dataset into a smaller form,
usually presentable, for reporting, analysis, and further examination.

❖ Key Techniques in Data Summarization:


Data summarization can be further divided into different categories as given below:
Descriptive Statistics: Measures of Central Tendency: Summarize data using mean,
median, and mode, which describe the middle of the distribution.
Measures of Dispersion: Describe dispersion or variability in data: range, variance,
and standard deviation.
Percentiles and Quartiles: The former provide insight into the distribution by telling
about the relative standing of the data points.
Data Aggregation: This would involve combining many data points into summary
values. For example, the addition of the sales data by month or the average score
across different categories.
Data Grouping: The grouping of data into categories or segments and summarizing
each group in isolation. This can be done using techniques like pivot tables which
summarize data based on different dimensions.
Visualization: Charts and Graphs Setting up trends and distributions of data with bar
charts, histograms, pie charts, and line graphs. Box Plots can be used to visualize
distribution, central value, variability of the data, and possible outliers.
Dimensionality Reduction: The techniques, like PCA or t-SNE, reduce the number
of variables in a dataset by keeping as much as possible of the variability of the data
while summarizing it into lower dimensions.
Summarization: Those are methods for summarizing large documents or data sets,
such as rapid keyword extraction, topic modeling, or abstract generation.
Data Profiling: It contains information about the structure of a dataset, the count of
missing values, or the data types, or on the distribution of categorical variables.
3|Page

Why Summarize Data?


Simplifies Analysis: Summarization of data makes analysis and interpretation of
large datasets easy. It also helps to quickly identify patterns and trends.
Facilitates Decision-Making: These summaries represent data in such a way that
they become helpful in supporting stakeholders for making certain decisions.
Improves Reporting: Since summaries of data are used in most reports, dashboards,
and presentations for effective communication.

❖ Data Sorting
Sorting helps users to organize data in a specific order. You can sort a text column in
alphabetical order (A-Z or Z-A).
❖ Filtering Data
Filters are used to temporarily hide some of the data in a table. This helps users to focus on
the data that is important for the current task at hand.
Example: Filter a range of data
• Step 1: Select the column on which to
• apply filter.
• Step 2: Select Data > Filter.
• Step 3: Select the column header arrow
• Step 4: In case of text data, uncheck the values thar you want to see
For filtering data on numeric values, you can even select a comparison, like Between to
see only the values that lie in a given range.
• Step 5: Click on OK

❖ Conditional Formatting
Conditional Formatting allows users to fill cells with certain color depending on the
condition. This enhances data visualization and its interpretation.
Example: Highlight cells that have a value greater than 350.
• Step 1: Select the range of cells on which conditional formatting has to be applied.
• Step 2: On the Home tab, under Styles Group, click Conditional Formatting.
• Step 3: Click Highlight Cells Rules > Greater Than....
• Step 4: Enter the desired value and select the formatting style.
• Step 5: Click OK

❖ Text to Column
Text to column feature is used to separate a single column data into multiple columns. This
enhances readability of the data. For example, if a column contains first name, last name and
profession in a single column, then this information can be separated in different columns.
• Step 1: Select the cell or column that contains the text to be split.
• Step 2: Select Data > Text to Columns.
• Step 3: In the Convert Text to Columns Wizard displayed on the screen,
• select Delimited > Next.
• Step 4: Select the Delimiters for your data.
• Step 5: Select Next.
• Step 6: Preview the split and select Finish.
4|Page

❖ Removing Duplicate Values

Select the range of cells containing duplicate values that should be removed.To do this in MS
Excel,

Step 1: Select the data from which duplicate values have to be removed.
Step 2: Select Data > Remove Duplicates.
Step 3: Uncheck the columns to be purged to remove duplicate records.
Step 4: Click OK.

❖ Data Validation

Excel is a powerful tool for data analysis, reporting, and decision-making. But, the reliability
of these activities depends on the accuracy and integrity of the data. Data validation helps
users control the input to ensure accuracy and consistency.
Step 1: Select the Cells for Data Validation
Step 2: In the Data Tab, click on Data Validation to open the Data Validation
Dialog Box
Step 3: In the Data Validation dialog box, under the Settings tab, define
the validation criteria:
Allow: Select the type of data. This data can be Whole Number, Decimal, List (only values
from a predefined list are allowed), Date, Time, Text Length (only text of a certain length is
allowed).
Data: Specify the condition (e.g., between, not between, equal to, not equal to, etc.).
Minimum/Maximum: Enter the acceptable range or limits based on the above selection. For
example, to allow values between 100 and 1000, select “Whole Number,” “between,” and
then set the minimum to 100 and the maximum to 1000.
Show Error Alert after Invalid Data is entered: Check this to enable error alerts.
Style: Choose from Stop, Warning, or Information to indicate the severity of the alert.
Title: Enter a title for the error message box.
Error Message: Type the message to be displayed. It must explain the error and suggest
ways to correct it.

❖ Identifying Outliers in Data


When analysing, visualizing and interpreting data, outliers if present in the data impacts
accuracy, reliability and usability of the data.
a) Review the Data: Errors can creep in data while entering or transferring data. So,
review the data to ensure there are no typos or other errors that create inaccuracies.
b) Sort the Data Values: We have already seen how data can be sorted in MS Excel.
c) Analyze Data Values: After sorting the values, identify large data discrepancies and
outliers to eliminate them.
d) Identify Data Quartiles: To calculate the outliers in the data, calculate quartiles
using Excel’s automated quartile formula beginning with “= QUARTILE ()” in an
empty cell. After the left parenthesis.
e) Define the Interquartile Range (IQR): IQR represents the expected average range
of the data (without outlier values). It is calculated by subtracting the first quartile
from the third quartile.
5|Page

f) Calculate the Upper and Lower Bounds: Defining the upper and lower bounds of
data allows identification of values that are higher than expected value (upper bound)
and smaller than the lower bound.
Calculate the upper bound of data by multiplying IQR by 1.5 and adding it to the third
quartile. The formula can be given as, “= Q3 + (1.5 * IQR).”
g) Remove the Outliers: After defining the upper and lower bounds of data, review the
data to identify values that are higher than the upper bound or lower than the lower
bound.

❖ Covariance
Array1 is a range or array of integer values.
Array2 is a second range or array of integer values.
Note the following points:
􀂋 If the given arrays contain text or logical values, they are ignored by the COVARIANCE
in Excel function.
􀂋 The data should contain numbers, names, arrays, or references that are numeric. If some
cells do not contain numeric data, they are ignored.
􀂋 The data sets should be of the same size, with the same number of data points.
􀂋 The data sets should be neither empty nor should the standard deviation of their values be
zero.
To find covariance in Excel and determine if there is any relation between
the two columns C and D, we can write =COVARIANCE.P(C1:C10, D1:D10).
Mathematically, covariance is calculated as:
COV (X ,Y) =(x- x)( y- y)/n

❖ Correlation Matrix

A correlation matrix is a table that displays the correlation coefficients for different variables.
A correlation matrix consists of rows and columns that show the correlation coefficient
between the variables. Correlation is a statistical measure that describes the extent to which
two or more variables are related to each other.
Positive Correlation: When values of two variables increase or decrease together,
they are said to be positively correlated. For example, height and weight are positively
correlated; as height increases, weight tends to increase as well.
Negative Correlation: When two values are negatively correlated, an increase in one
variable results in decline of the other. For example, speed and time are negatively
correlated. When speed increases it takes less time to reach the destination.
6|Page

❖ Moving Average

Moving average also known as rolling average, running average or moving mean is defined
as a series of averages for different subsets of the same data set.

To visualize the moving average on a chart by drawing a trendline follow the steps given
below:
Step 1: Click anywhere in the chart.
Step 2: On the Layout tab, in the Analysis group, select the trendline option.
Step 3: Click the desired option.

❖ Finding Missing Values

Excel does not have any particular function to list missing values. But it is important because
of the following reasons:
o Data Integrity which ensures that the dataset is complete.
o Data Reconciliation that facilitates the reconciliation process (mostly used in
finance).
o Quality Assurance to identify anomalies or data entry errors.
o Efficient Analysis to perform accurate data analysis by spotting and addressing gaps.

IF, ISNUMBER and MATCH Functions:


▪ IF: Returns one value if a condition is true and another if it’s false.
▪ ISNUMBER: Checks if a value is a number.
▪ MATCH: Searches for a value in a range and returns its relative position.

❖ Data Summarization
Data summarization in Excel can be done in multiple ways like:
Using Descriptive Statistics: For example, given a list of values in column A, we can use
Excel functions to summarize the values.
▪ SUM, AVERAGE, MEDIAN: Calculate the total, mean,
and median of a dataset.
Example: = SUM (A2:A100) sums all values in the range
A2 to A100.
Example: = AVERAGE (A2:A100) calculates the average.
▪ COUNT, COUNTA: Count the number of cells with
numbers (COUNT) or any data (COUNTA).
Example: =COUNT (A2:A100) counts numeric entries.
7|Page

▪ STDEV.P, VAR.P: Calculate the standard deviation and variance of a dataset.


Example: =STDEV.P (A2:A100) for standard deviation.
▪ MIN, MAX: Find the smallest and largest values
Example: =MIN (A2:A100) and = MAX (A2:A100)

❖ Data Visualization
Data visualization helps users to transform raw data into meaningful visual stories that
enables them to spot trends in data and communicate complex information effectively.
Step 1: Organize the data in rows and columns within the Excel sheet. Every row and
column should be labelled clearly to identify the data to be visualized.
Step 2: Select the data by clicking and dragging mouse to highlight the data to be
visualized. In this selection, include the row and column headers.
Step 3: Choose a chart type by clicking on the “Insert” tab. In the “Charts” section,
select the required chart option (Column, Line, Pie, Bar, Area, Scatter, etc.) by
clicking on the dropdown arrow below the chart type.
Step 4: Insert the chart. Once the desired chart is selected, it is automatically created
and inserted in the worksheet.
Step 5: Customize the chart. For this, click on the chart to select it. Now, you would
be able to see two additional tabs: “Design” and “Format”.

❖ Types of Data Visualizations in Excel

1. Column Chart: It displays data using vertical bars. Each bar represents a category.

2. Bar Chart: It is similar to a column chart, but instead of vertical bars, it has horizontal
bars.

3. Line Chart: The line chart plots data points and then connects these points by lines.
8|Page

4. Pie Chart: A pie chart plots data as slices of a circle. Size of each slice is proportional
to the value it represents.

5. Scatter Plot: A scatter plot displays data points on a Cartesian coordinate system, with
each axis representing a variable.

❖ Pivot Tables

Pivot tables are an important part of MS Excel that allows users to quickly summarize large
amounts of data, analyze numerical data in detail, and answer unanticipated questions about
the data.
9|Page

Step 1: Click any single cell inside the data set.


Step 2: Click on the Insert tab, in the Tables group.
Step 3: Click on PivotTable.
Step 4: From the dialog box that appears, Excel automatically selects the data and the
default location set for a new pivot table is New Worksheet.
Step 5: Click on OK.
Step 6: Now, drag the fields.

❖ Pivot Chart
Pivot Chart is a dynamic visualization tool that helps users summarize and analyze large
datasets.
Step 1: Click any cell inside the pivot table.
Step 2: On the PivotTable Analyze tab, click on PivotChart in the Tools group.
Step 3: Click OK on the Insert Chart dialog box.

❖ Interactive Dashboard
✓ Step 1: Define the Purpose of the Dashboard.
✓ Step 2: Gather Data in the form of a table and then convert this table into a pivot table
This is done by:
a) Selecting the table.
b) In the Insert Tab, click on Pivot Table.
c) Click on OK and the Pivot Table will be inserted in a new sheet.
✓ Step 3: Create Charts using the Pivot Table.
✓ Step 4: In the PivotTable Analyze group, click on PivotChart and select a suitable
chart from the chart drop-down.
✓ Step 5: Click on Ok. The pivot chart will be created.

Chart Title to change the title of the chart.


Legend to enable, disable, or edit the legend.
Axes to edit horizontal axis and vertical axis of the chart.
✓ Step 6: Add Interactive Features to the dashboard design. For this, select any chart
and click on PivotChart Analyze.
✓ Step 7: Click on Insert Timeline. However, to insert a timeline to any pivot chart,
there must be a Date column in the data.
BUSINESS ANALYTICS

UNIT-3 Getting Started with R

 Introduction
In this chapter we will introduce you to a popular opensource programming language designed primarily for
statistical computing and data analysis i.e. R programming (referred as R henceforth). Suppose a retail
company, “ShopSmart,” that needs to analyse its daily sales, currently they use excel for basic data handling,
but it becomes challenging with the increasing size of data.

 Statistical software generally has very costly licenses, but R is completely free to use, which makes it
accessible to anyone interested in learning data analysis without needing to invest money.
 R is a versatile statistical platform providing a wide range of data analysis techniques, enabling
virtually any type of data analytics to be performed efficiently and having state-of-the-art graphics
capabilities for visualization.
 The data is mostly gathered from variety of sources analysing it at one place has its own challenges.
 R is compatible with a broad range of platforms, including Windows, Unix, and macOS, making it
likely to run on almost any computer you use.
 The R community which provides wide level of support for R programmers has developed thousands
of packages, extending R’s capabilities into specialized areas like quantmod for finance, plotting
package for visualization (‘ggplot2’), and support for machine learning algorithms as well.

 Installation
To begin with R, students need to install both R (the base programming language) and RStudio which is an
Integrated Development Environment (IDE) that makes working with R much easier.

Installation of R and RStudio

For R
 Step 1: Go to [CRAN (Comprehensive R Archive Network)] (https://cran.r-project.org/).
 Step 2: Choose y our operating system (Windows, macOS, or Linux).
 Download and run the installer.
 R Interface

For RStudio
 Visit [RStudio’s website] (https://www.rstudio.com/products/rstudio/ download/).
 Choose the free version, “RStudio Desktop.”
 Follow the installation prompts.
 RStudio Interface

 Understanding RStudio IDE


The RStudio the display is divided into various tabs these tabs can further be customized as per your
requirement.

Source Editor Pane: In RStudio IDE, you can access the source editor for R code.
Console Pane: This pane (as shown in 3 of Figure 3.1) has R interpreter, where R code is processed.
Environment Pane: This pane can be used to access the variables that are created in the current R
session.
Output Pane: This pane contains the Files, Plots, Packages, Help, Viewer, and Presentation tabs.
Packages: Here you have downloaded and installed R for the first time, this means you have
installed Base R software containing most of the functions that you will use frequently like mean()
and hist ().

 Importing Data from Spreadsheet Files

Importing data from spreadsheets is quite common in business analytics because most business data is stored
in such formats as Excel. Using R, you can easily import spreadsheet data into your workspace with
packages like read xl and open xlsx.

 Commands and Syntax


The most basic program in any programming language is “Hello World” we will start learning the basic
commands.
There are certain rules to give valid variable names in R as discussed below:
A variable name can include letters (a-z, A-Z), digits (0-9), and the dot (.) or underscore (_) but
cannot start with a number.
R is case sensitive var and Var are two different identifiers.
Reserved keywords in R cannot be used as variable names.
Any special character except underscore and dot is not allowed.
Variable names starting with a dot are allowed, but they should notbe followed by a number. It is not
advised to use dot as starting character.

 Keywords: These are integral part of R’s syntax, keywords are used to implement various
functionalities in R.

 Data Type

Unlike C or C++, we do not require to declare a variable with data type in R. It supports random assignment
of data type for a variable depending upon the values that it has been initialized to.

Data Types in R
 Operators

Operators are tools that help us to perform various operations on data, we can do basic calculations or more
advanced logical comparisons, operators tell R what action to take on data.

 Arithmetic Operators are simplest and most frequently used operators.

 Relational Operators are used to compare values and check for conditions like equality, greater
than, or less than. For instance, 5 > 3 checks whether 5 is greater than 3 and returns TRUE. similarly,
5 == 3 checks for equality and returns FALSE.

 Logical Operators let you combine or modify logical values. You can use & to perform an AND
operation, where it is only true if both the conditions are satisfied.

Operator Example Result

& AND (element-wise) TRUE&FALSE FALSE

&& AND (Single comparison) TRUE&&FALSE TRUE

| OR (element-wise) TRUE|FALSE TRUE

|| OR (single comparison) TRUE||FALSE FALSE

! Not (negation) ! TRUE FALSE

Assignment Operators are used to store values in variables. The most used operator is <-, which assigns a
value to a variable, like x <- 10. You can also use = for assignment, but <- is preferred in R because it is
clear and consistent with the syntax of the language.
 Functions 

In R, user-defined functions enable you to create reusable blocks of code to perform specific tasks.
1|Page

BUSINESS ANALYTICS
UNIT-4 Data Structures in R

❖ Vectors
It is one of the basic data structures in R programming languages, it is used to store multiple
values having same type also called modes. It is one-dimensional and can hold numeric,
character, logical or other values, but all the values must have same mode. Vectors are
fundamental to R, hence most of the operations are performed on vectors.

Types of Vectors

❖ Creating a Vector
You can create vectors using the c() function, which stands for combine or concatenate.
Also, vectors are stored contiguously in memory just like arrays in C, hence the size of vector
is determined at the time of creation.
✓ Length: We can obtain the length of a vector using length() function. This can be
used to iterate over vector in loops.

This will return 5 as output.

✓ Indexing and Subset: We can use indexing to refer to a particular element of a


vector, we can also extract subsets using indexing. Note that vector index starts from
1 instead of 0, and subset range is inclusive.
2|Page

✓ You can also apply filtering to vectors by applying logical expressions that return
true/false for each vector, output is given by true values.

✓ Element-wise Operations: We can apply simple operations on all the element of a


vector.

✓ Vectorized Functions: R offers many built-in functions which can be applied to


vectors as a whole (rather than element-wise) and give cumulative output.

✓ Combining and Modifying Vectors: Apart from applying operations on a single


vector, we can also apply the given functions on two or more vectors.
3|Page

Note: One important point when applying an operation to two vectors is that such operations
require both vectors to be the same length. In case of length mismatch R automatically
recycles, or repeats, the shorter one.
• Miscellaneous Functions: There are certain functions shown below in Table which
can be used with vectors, as required.

❖ Matrices
Matrices are actually a special type of a broader concept in R called arrays. While matrices
have just two dimensions (rows and columns), arrays can go further and have multiple
dimensions. For instance, a three-dimensional array has rows, columns, and layers, adding an
extra level of organization into your data.

Creation: Matrices are generally created using matrix() function, the data in matrices is
stored in column major format by default. The ‘nrow’ parameter specifies rows, and ‘ncol’
specifies columns.

❖ Arrays
An array in R is a data structure that can store data in more than one dimension, hence
in R arrays are an extension of matrix. While a matrix is constrained to two
dimensions, with rows and columns, an array, however, can take three or more
dimensions.
Array can be created using array() function with arguments data, dimensions and
dimension names.
Array elements can be accessed in same manner as vector or matrices. We can also
name the dimensions.
We can reshape arrays dimension.

❖ Lists
In R, a list is an amazingly flexible data structure, meaning it can store any kind of data
together - numbers, characters, vectors, matrices, and even other lists.

You create a list by using the “list()” function, and any of the elements in the list
are accessed using double square brackets “[[ ]]”. So for instance, “list (42,
“Hello”, c(1, 2, 3))” generates a list that has an integer, a string, and a vector.
4|Page

We can do indexing, subsetting or accessing elements.


We can find size of a list using length(), we can also add or delete elements.

❖ Factors
Factors are another type of R objects that are created by using a vector, it stores the vector as
well as a record of distinct values in that vector called level. Factors are majorly used for
nominal or categorical data.

❖ Data Frames
A data frame is a two-dimensional, tabular data structure commonly used for storing and
manipulating data. It is very similar to table or spreadsheet, where each column can store data
of various types-numeric, character, logical-and each row is an observation or record.

❖ Conditionals and Control Flows


Decision making refers to the process of choosing amongst several alternative actions or
courses of action based on certain conditions or criteria. It allows programs to make choices
based on logical evaluations.

• There are three decision making constructs in R programming: if, if… else, switch.
The if statement in R is the simplest form of decision making. It compares a
condition, and then if that condition is TRUE then the code block inside if is
executed; otherwise, the code block is skipped for a FALSE condition.

❖ Loops
Like any other programming language, we have loops in R too. They are basic constructs
allowing a block of code to be executed repeatedly. R implements several kinds of loops: for,
while, and repeat. Each loop type is suited for different tasks, depending on the kind of
control flow needed.
▪ For Loop: It is used to iterate over a sequence of elements (that are iterate able), such
as a vector, list, or sequence using a loop control variable.
▪ Like for loop, while loop also repeatedly executes a block of code as long as the
condition remains TRUE. But here the loop control variable needs to be initialized
outside the loop.
▪ The third type of iterative statement i.e. repeat loops indefinitely until explicitly
stopped using a break statement.
▪ We can also have nested loops for complex operations where iterations are needed at
various levels. For example, if you want to print columns for each row, nested code.
5|Page

▪ Next and break statements can be used to control loop, next helps to skips the current
iteration and moves to the next one while break terminates the loop entirely as seen in
repeat loop.

❖ Apply Family
The apply family in R includes functions like apply, lapply, sapply, vapply, tapply, mapply,
and rapply. It is very useful and powerful feature of R. These functions provide alternatives
to loops for applying functions across various data structures like vectors, matrices, arrays,
lists, factors, and data frames.
The apply() is used to operate on margins of matrix and array. It applies a given
function along rows or columns of a matrix or higher-dimensional array.
lapply() is used to apply a function to each element of the list and it returns a list.
The sapply() function works like lapply() but it attempts to simplify the output into a
vector or matrix when possible.
vapply() is also like lapply() and sapply() but it lets you to specify the expected output
type for better reliability.
tapply() applies a function to subsets of a vector, defined by a factor or a list of
factors. It takes three input parameters data vector, factors to group by, function to
apply.
mapply() can be used to apply a function to multiple arguments (vectorized).
If you want to recursively apply a function to elements of a list you can use rapply(),
kit can also be used to handle nested list.

Apply Family Functions


1|Page

BUSINESS ANALYTICS
UNIT-5 Descriptive Statistics Using R

❖ Introduction
Data analysis is an important skill in today’s data-driven world, allowing people and
organizations to extract meaningful insights from raw data.

❖ Importing Data File


• For importing data from csv we require the function read.csv(), the syntax is
read.csv(file path, header, sep); where file path specifies the location of file, header
parameter specifies if the first row contains column names (TRUE/FALSE), and Sep
is used to provide delimiter (like “,” for csv).
• To import other formats, we need to load desired package from library, code to read
excel, json and excel file.
• R can also interact with database using packages like DBI and RMySQL.

❖ Data Visualisation Using Charts


R provides a rich ecosystem of libraries (like ggplot2, plotly, lattice, cowplot) each offering
unique capabilities to create a variety of charts, plots, and interactive visualizations etc.

Columns of mpg Dataset

a) Histograms: We can use histogram to visualize the distribution of a single continuous


variable by binning i.e. dividing them into intervals or bins. It is useful to identify
patterns such as skewness, spread, or unusual gaps.
b) Bar Chart: We use bar charts to represent categorial data, they are ideal when we are
comparing discrete groups as they can show counts or proportions of each category.
c) Box Plot: It summarizes the distribution of a continuous variable by displaying the
median, quartiles, and potential outliers. It is useful when we need to compare across
multiple groups.
2|Page

d) Line Graphs: They are recommended when we want to analyse trends over a
continuous variable or to observe relationships. The code to generate line graph.
e) Scatter Plots: When we need to visualize the relationship between two continuous
variables we can use scatter plots, they are an ideal choice for identifying trends,
clusters, or correlations. The code shows how to generate scatter plot.

❖ Measure of Central Tendency


I. Mean: The arithmetic average or mean is calculated by the formula below:

II. Median: Median is the middle value in a sorted dataset, it is calculated as the
central value if the dataset has an odd number of observations and median is
the average of the two central values if observations are even in number.
Compared to mean; the median is less affected by outliers. The example
below shows how to compute median.

III. Mode: The mode represents the value that appears most frequently in a
dataset. A dataset can be unimodal having one mode, multimodal having
more than one mode, or no mode at all if no value repeats. The example
below shows all three, the corresponding code to compute mode in dataset.
3|Page

❖ Measure of Dispersion
a. Range: The simplest measure of dispersion is range which is the difference between
the maximum and minimum values in a dataset. Although range is easy to calculate
but it is sensitive to outliers.
Formula for range is:
Formula:
Range = Maximum Value – Minimum Value

b. Variance: It is the measures of deviation from mean i.e. how far each data point is
from the mean, on average. A higher variance indicates greater variability in the data.
Variance is expressed in squared units which makes it harder to interpret directly. The
formula of variance is given below:
Formula:

c. Standard Deviation: Since interpreting variance was difficult therefore we use


standard deviation, which is square root of the variance, hence, providing a measure
of dispersion in the same units as the original data. A smaller standard deviation
means the data points are closer to the mean.
Formula:

d. Interquartile Range (IQR): It measures the spread of the middle 50% values of data.
It indicates how spread out the middle half of a dataset is. For better understanding,
imagine you line up all your data from smallest to largest (sorted). The IQR focuses
on the middle 50% of those numbers, ignoring the smallest and largest value. The
formula is shown below:
Formula:
IQR = Q3 – Q1

❖ Relationship between Variables


Covariance is a statistical measure that indicates how two variables change together. It shows
whether an increase in one variable leads to increase in another variable, or whether this will
affect inversely. The formula for covariance is given below:
4|Page
1|Page

BUSINESS ANALYTICS
UNIT-6 Predictive and Textual Analytics

❖ Introduction
Predictive analytics is changing industries because it can facilitate data- driven decision-
making and efficiency in operations. Textual analysis is the systematic examination and
interpretation of textual data in order to draw meaningful insight, patterns, and trends.

❖ Simple Linear Regression Models


Simple linear regression can be defined as a statistical learning method that is used to
examine or predict quantitative relationship between two continuous variables: one
independent variable called predictor (X) and other dependent variable called response (Y).
This method helps us model the linear relationship between the variables and make their
predictions assuming that there is approximately a linear relationship between independent
variable X and dependent variable Y. Mathematically, we can write this linear relationship
as: y = βo+β1x+€

The very first step is to prepare data for this we need to ensure that the data set is
clean and contain no missing values for the variable involved. We need to load the
data into R using functions like read.csv() or read.table().
Once data is uploaded we need to visualize it using scatter plot.
After that we can fit the regression model using lm() function and then we can use
summarize method to understand the details of the model.
We can also make predictions using predict() function.

❖ Confidence and Prediction Interval


In predictive analysis, confidence intervals and prediction intervals are two critical tools that
can be used to quantify the uncertainty surrounding any statistical estimates or predictions.
Both of them give a quantitative indication of ranges in which the true values are expected to
lie, yet they have different usage.

❖ Multiple Linear Regression


Multiple Linear Regression (MLR) is just an extension of simple linear regression that
models the relationship between two or more independent variables and a dependent variable.
The mathematical equation for MLR is:
y = β₀ + β₁x₁ + β₂x₂ + ... + βₖxₖ + ε.
Where y is the target variable β0 is the intercept term that represents the value of y when all
independent variables are 0. We have multiple independent variables represented by x1, x2,
…., xn with coefficients β1, β2……, βn and € is the error term.
For an MLR model to give valid results we need to fulfil a few assumptions that are stated
below:
• The relationship between the independent and dependent variables must be linear.
Observations must be independent.
2|Page

• Variance of the error must be constant across the ranges of independent variables i.e.
homoscedasticity.
• Residuals or error terms must be normally distributed.
• Independent variables should not have strong mutual correlations means no
multicollinearity.

❖ Interpretation of Regression Coefficients


When working with a regression model, coefficients hold very important meaning. They
determine how each independent variable would cause the dependent variable to move in
simple terms, they describe the relationship of the predictor variables to the outcome we are
trying to predict. The multiple linear regression interpretation is more complex, but the
same idea is still there.
y=βo+β1x₁+β₂x₂+β₃x₃

❖ Heteroscedasticity and Multi-Collinearity


Heteroscedasticity is defined as the case when the variation of errors (i.e. the
differences between observed and predicted values) varies with the levels of the
independent variables. In simple words, it means that whenever the value of the
independent variable is changed, the spread or dispersion of the residuals also varies.
Assume an example of house price prediction by square feet. Heteroscedasticity
occurs if error variance for the larger house predictions is greater than the smaller
houses. Heteroscedasticity affects the standard errors of the coefficients, biasing test
statistics

Multicollinearity occurs when two or more independent variables in a regression


model are highly correlated with each other. In simple terms, the independent
variables are stepping on each other’s toes and giving redundant information, thereby
not making it possible to separate the individual effect of each variable on the
dependent variable. For example if we attempt to forecast a person’s income, given
his or her level of education and years of work experience.
Both heteroscedasticity and multicollinearity can reduce the effectiveness of a
regression model. Heteroscedasticity interferes with the estimation of the standard
errors of the coefficients, which may lead to misleading significance tests, while
multicollinearity prevents the assessment of the effects of each predictor. Knowledge
of these issues and their detection and resolution is crucial in developing more reliable
and interpretable regression models.

❖ Basics of Textual Data Analysis


Textual analysis includes information extraction from text, such as emails, reviews, or posts
on social media. It is commonly applied in tasks such as sentiment analysis, keyword
extraction, and text classification. Basic steps of text analysis are:
▪ Text Preprocessing i.e. cleaning and preparing the text for the analysis process, for
example conversion to lowercase, remove stop words such as “the” and “is”,
punctuation, and special characters.
▪ Tokenization means dividing a text into smaller tokens or units such as words,
phrases, etc.
▪ Text Representation implies converting text in a form that is ready to be analysed, for
instance as a frequency count of words, bag of- words representation.
3|Page

❖ Methods and Techniques of Textual Analysis


There are three methods of textual analysis: text mining, categorization, and sentiment
analysis.
Text Minning
The process of mining useful information and knowledge from unstructured text is called text
mining.
Categorization
It refers to the process of assigning text into predefined categories or labels based on the
content. This method is applied in many applications, including email filtering (spam vs. non-
spam), document classification (business, sports, tech), and sentiment analysis (positive,
negative, neutral).
❖ Sentiment Analysis
Sentiment analysis is the process of determining the emotional tone or sentiment behind a
piece of text.
✓ Lexicon-based Methods: These use pre-defined dictionaries of words with positive,
negative, or neutral sentiments.
✓ Machine Learning-based Approaches: This involves training the model on labelled
text data wherein the sentiment is known in advance and then applies that model to
classify new text.

You might also like