0% found this document useful (0 votes)
8 views34 pages

Probability Statistics Report

Uploaded by

Nguyễn Duy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views34 pages

Probability Statistics Report

Uploaded by

Nguyễn Duy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

VIETNAM NATIONAL UNIVERSITY, HO CHI MINH CITY

UNIVERSITY OF TECHNOLOGY
FACULTY OF APPLIED SCIENCE

PROBABILITY AND STATISTICS (MT2013)

Assignment Report - Group 10

Analyzing some technical specifications


affecting Memory Bandwidth of GPUs

Advisor: Ph.D Nguyen Tien Dung


Students: Pham Thai An - 2252011 (CC02, Leader)
Nguyen Viet Hung - 2252272 (CC02)
Khuu Vinh Kien - 2211720 (CC02)

HO CHI MINH CITY, MARCH 2024


University of Technology, Ho Chi Minh City
Faculty of Applied Science

Contents
1 Member List & Workload 2

2 Data and Code Availability 2


2.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Code R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

3 Data Overview 3

4 Theory 4
4.1 Sampling Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4.1.1 Some Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4.1.2 The Characteristics of A Random Sample . . . . . . . . . . . . . . . . . . 4
4.2 Exploratory data analysis (EDA) . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.3 Multiple Linear Regression (MLR) . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.4 Statistical Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

5 Data Preprocessing 7
5.1 Data Importing: "All_GPUs.csv" . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5.2 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
5.2.1 Extracting data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
5.2.2 Handling variable format . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
5.2.3 Accounting for missing data . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.2.4 Handling missing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

6 Descriptive Statistics 13
6.1 Calculating Characteristic Values . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
6.2 Plotting Histogram for the Distribution of Memory Bandwidth . . . . . . . . . . 14
6.3 Plotting Box Plots for the Distribution of Memory Bandwidth follows Qualitative
Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
6.4 Plotting Scatter Plot for the Linear Relationship between Memory_Bandwidth
and Quantitative Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6.5 Plotting a Correlation Matrix to Check for Multicollinearity between Independent
Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

7 Inferential Statistics 22
7.1 Confidence Interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
7.2 Multiple Linear Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
7.2.1 Model Building . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
7.2.2 Homoscedasticity Checking . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7.2.3 Model Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

8 Discussion and Extension 31


8.1 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
8.2 Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 1/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science

1 Member List & Workload

No. Fullname Student ID Workload Percentage of work


Data Overview
Multiple Linear Regression (Theory)
1 Pham Thai An 2252011 Data Preprocessing 33.33%
Descriptive Statistic
Inferential Statistics
Statistical Hypothesis Testing (Theory)
Data Preprocessing
1 Nguyen Viet Hung 2252272 Descriptive Statistic 33.33%
Discussion and Extension

The Theoretical Basis of Sampling (Theory)


Descriptive Statistic
1 Khuu Vinh Kien 2211720 R Script File 33.33%
Discussion and Extension

2 Data and Code Availability


2.1 Dataset
[Link]

2.2 Code R

Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 2/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science

3 Data Overview
At present, user’s needs in using computers do not stop at calculating or running simple
programs. Manufacturers always trying to create systems that can handle a variety of tasks,
from researching, designing, etc. Among them, the field that is of particular interest is graphics.
The CPU’s architecture is not designed to be efficient at processing such tasks; therefore, the
Graphics Processing Unit (GPU) was born and has become one of the most important parts in
most of the computer systems recently.

The GPU is a processor designed to accelerate the creating and rendering of images and
videos. In the graphics field, the GPU will process huge chunks of visual data and translate
them into stunning visuals. With the increasing demand for high-resolution, realistic visual ex-
periences in applications ranging from gaming to professional graphic design and virtual reality,
the role of the GPU has become ever more central in both creating and rendering digital graphics.

To summarize, the vital role of GPUs in recent life could not be denied. A strong GPU
is one with high Memory Bandwidth. Therefore, the goal of this project is to analyze some tech-
nical specifications affecting to the Memory Bandwidth in GPUs. The dataset we use in this
project is in the file named "All_GPUs.csv", which is originally available at: Computer Parts
(CPUs and GPUs) Dataset (Kaggle) by author Ilissek, containing information about various
technical specifications, release dates, and lauch prices of GPUs. The dataset is used to under-
stand, analyze and evaluate to get a general overview of the factors that influence GPUs.

The dataset includes 34 variables and 3406 observations. However, as mentioned above, in
this analysis, we focus on technical specifications that affect to the Memory Bandwidth. The
data provided here mainly pertains to Intel, AMD, and related companies involved in producing
these components. From there, we have the main variables selected:
• Manufacturer: The company that manufactured the GPU, such as Nvidia or AMD.
• Dedicated: This indicates whether GPU is a discrete card or not, as opposed to an inte-
grated GPU.

• Core Speed (MHz): The base clock speed of the GPU operating at standard performance.
• Memory (MB): The memory capacity of the GPU represents its ability to store data.
• Memory Bus (Bit): The data transfer pathway between the GPU and memory, deter-
mining the data transmission capability.

• Memory Speed (MHz): The speed at which the GPU"s memory operates.
• OpenGL: Support for important graphics APIs to run new applications and games.
• Pixel Rate (GPixel/s) and Texture Rate (GTexel/s): Measured to evaluate the
ability to process pixels and textures per second.

• TMUs (Texture Mapping Unit): These units are responsible for applying texture maps
(images) to 3D models during rendering.
• Memory Bandwidth (GB/sec): The ability to transfer data between the GPU and
memory.

Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 3/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science

4 Theory
4.1 Sampling Theory
4.1.1 Some Definitions
• The sample is a number of units selected from the population according to a certain
sampling method. The sample characteristics are used to infer the characteristics of the
overall population.
• Primary data is data that is collected directly from the research object according to the
requirements of the researcher.
• Secondary data is data from existing sources, which has usually been processed. It helps
the researcher save time, effort, and cost compared to collecting primary data. However, it
should be noted that this data may not always meet the more detailed requirements of the
research.

4.1.2 The Characteristics of A Random Sample


• The sample mean: Consider Pna random sample (X1 , ..., Xn ) of the random variable X,
the sample statistic X̄ P = n1 i=1 Xi is called the sample mean. For a specific sample
n
(x1 , ..., xn ), then: x̄ = n1 i=1 xi , refers to the value that the sample mean takes on for the
given sample.
• Sample variance: Similar to the sample mean, the sample variance is defined as the
expected value of the squared deviations of the sample elements from the sample mean,
and is denoted as:

1
Pn 1
Pn
S2 = n−1 i=1 (Xi − X̄)2 and Sb = n i=1 (Xi − X̄)2 ,

where Sb is called the uncorrected sample standard deviation, and the statistic S is called
the corrected sample standard deviation.
• Sample proportion: F = M
n and f ≡ p ≡ m
n.

• The median: Assuming a sample of size n is arranged in increasing order of the values
being observed: x1 ≤ x2 ≤ ... ≤ xn−1 ≤ xn .
If n = 2k + 1 then the sample median is xk + 1.
If n = 2k then the sample median is xk+12+xk
• quartiles: The median divides the ordered data sample into 2 sets of equal size. The
median of the lower set of data is called the first quartile, Q1 (the lower quartile). The
median of the upper set of data is called the third quartile, Q3 (the upper quartile). The
second quartile, Q2 , is taken to be the median value.
• Outlier points: Also called anomalous points, aberrant points, or outliers. These are the
elements of the sample that have values lying outside the range (Q1 −1.5IQR; Q3 +1.5IQR).

Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 4/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science

4.2 Exploratory data analysis (EDA)


Exploratory data analysis (EDA) is a method for drawing insights from data, often
utilizing data visualization and statistical graphics to reveal relationships between variables,
identify patterns and trends, and detect outliers. It is crucial for extracting important features
for predictive models. By plotting the raw data, we can gain an understanding of the general
behavior and distribution of the variables:
• Histograms are used to visualize the distribution of a numerical variable.
• Box plots are used to display the distributions of numerical data values, particularly when
comparing them across several groups.

• Scatter plots are employed to comprehend the finest characteristics that may be applied
to describe a link between two variables or to create the most distinct clusters.
• Correlation Matrix: The correlation coefficients between variables are displayed in a
table called a correlation matrix. The association between two variables is displayed in
each cell of the table.

4.3 Multiple Linear Regression (MLR)


It is not always a response variable or an explanatory variable in statistical analysis of data.
Measurement of the association between a continuous dependent variable and two or more in-
dependent variables is done using the multiple regression method. Linear relationships are those
that emerge from correlations between variables. We utilize this method to forecast the behavior
of a response variable based on its predictors after applying multivariate regression to a dataset.

There are 3 assumptions that need to be met when performing an MLR test:

• Normality: the residuals (the differences between the observed values and the predicted
values) follow a normal distribution.
• No multicollinearity: The independent variables in the regression model are not highly
correlated with each other. Multicollinearity occurs when two or more independent vari-
ables are strongly correlated, making it difficult to determine their individual effects on
the dependent variable.
• Linearity : The relationship between the independent variables and the dependent variable
should be linear. This means that the expected value of the dependent variable changes in
a straight line as the independent variables change, holding other variables constant.
The general equation of MLR:

Y = β0 + β1 · X1 + β2 · X2 + ... + βn · Xn + ε
where:
• Y is the dependent variable.
• ith independent variable.
• β0 is the intercept of Y when all Xi are zeros.
• βi is the coefficient of each Xi .
• ε is the independent error term of the model.

Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 5/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science

Model performance metrics:


• R-squared (R2 ): is the squared correlation between the actual result value and the
value predicted by the model, and it measures the proportion of the predictor’s variance in the
outcome that can be accounted for. The better the model, the higher the value.
• Root mean square error (RM SE): calculates the average error a model makes while
forecasting an observation. The model is better the lower the RM SE.

4.4 Statistical Hypothesis Testing


Law of small probabilities: If an event has a very small probability of occurring, it can be
considered not to have occurred when conducting a related trial.
Method of contradiction: From the correct hypothesis leading to a contradiction, we reject it
(accept the alternative).

• General principles of statistical hypothesis testing:


– Standard statistical hypothesis testing criterion: From the original population distri-
bution X, a random sample X1 , ..., Xn is selected, and a statistic T = T (X1 , ..., Xn )
may depend on known parameters in the null hypothesis H0 . If the null hypothesis
H0 is true, the distribution law of T must be completely determined. Such a statistic
is called a standard for testing.

– Testing rule: If we successfully divide the acceptance region and rejection region of
the testing standard T into two parts Rα and Rα , where Rα is the rejection region
and the rest is the acceptance region of H0 .

– Type I and Type II errors: With the testing rule as above, two types of errors can be
made:
∗ (i) Type I error: rejecting a true hypothesis. We see that the probability of com-
mitting a Type I error is exactly the significance level α. Type I error arises due
to too small sample size, sampling method...
∗ (ii) Type II error: accepting a false hypothesis. The probability of Type II error
β is defined as follows: P (T ∈
/ Rα |H1 ) = β
– Procedure for statistical hypothesis testing: Based on the above content, we can build
a procedure for statistical hypothesis testing including the following steps:
∗ (i) State the null hypothesis H0 and the alternative hypothesis H1 .

∗ (ii) Randomly sample from the population with a sample size of n.

∗ (iii) Choose the testing standard T and determine the probability distribution
law of T under the condition that the null hypothesis H0 is true.

∗ (iv) Based on the probability distribution law of T , find the rejection region Rα
such that: P (T ∈ Rα |H1 ) = α

Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 6/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science

5 Data Preprocessing
5.1 Data Importing: "All_GPUs.csv"
Code:
1 All_GPUs <- [Link]("~/Downloads/HK232/Probability _ Statistics (XSTK)/BTL/All_GPUs
.csv")
2 str(All_GPUs)
Output:

Figure 1: The structure of "All_GPUs" R object

The dataset includes 3406 observations with 34 different parameters related to the GPU. The
variables consist of different data types:
• chr (character): Textual data, which is used for categorical or descriptive information.
Examples include Architecture, Best_Resolution, Name, ...

• int (integer): Integer numbers, used for counts or other numerical data that do not have
decimals. Some variables such as HDMI_Connection and DVI_Connection are of this type.
• num (numeric): This can include real numbers (integers and floating-point). For instance,
Open_GL and PSU are numeric.

Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 7/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science

• Some columns, like Direct_X and SLI_Support, contain factors or ordered data represented
as characters.

• NA (Not Available) appears in several places, indicating missing or unavailable data for
those entries.

5.2 Data Cleaning


5.2.1 Extracting data
Code:
1 Extract_Data <- All_GPUs[, c("Manufacturer", "Dedicated", "Core_Speed", "Memory", "
Memory_Bus", "Memory_Speed", "Open_GL", "Pixel_Rate", "TMUs", "Texture_Rate", "
Memory_Bandwidth")]
2 str(Extract_Data)
Output:

Figure 2: The structure of the extracting dataframe

Based on initial observations of the dataset, the team didn’t choose qualitative variables such
as Architecture, Name, PSU, ROPs, ... as they have many classifications and their missing rate
is high, there aren’t any basis to fill those missing values. Besides, quantitative variables such as
Boost_Clock, DisplayPort_Connection, Release_Price, ... are also ignored since their missing
rate is considerably high (more than 50%), which greatly affects the overall results of the analy-
sis. Therefore, combined with articles about factors affecting the Memory Bandwidth, below are
the variables that are suitable for the main purpose of the analysis and satisfy the above two
conditions, the detail articles are available at: Computing GPU memory bandwidth with Deep
Learning Benchmarks and What Is Memory Bandwidth? Complete Guide

Figure 2 shows a table view of the data frame "Extract_Data", we can see a few rows
of data that have been mentioned above. We can see that "Extract_Data" is a structured sub-
set from "All_GPUs" consisting of 11 specific columns that are likely relevant to a particular
analysis. The data contains a mixture of numeric and character types, with some missing values
that may need to be addressed before further analysis.

5.2.2 Handling variable format


Code:

Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 8/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science

1 Extract_Data$Core_Speed <- [Link](sub("MHz", " ", Extract_Data$Core_Speed))


2 Extract_Data$Memory <- [Link](sub("MB", " ", Extract_Data$Memory))
3 Extract_Data$Memory_Bus <- [Link](sub("Bit", " ", Extract_Data$Memory_Bus))
4 Extract_Data$Memory_Speed <- [Link](sub("MHz", " ", Extract_Data$Memory_Speed))
5 Extract_Data$Pixel_Rate <- [Link](sub("GPixel/s", " ", Extract_Data$Pixel_Rate))
6 Extract_Data$Texture_Rate <- [Link](sub("GTexel/s", " ", Extract_Data$Texture_
Rate))
7 Extract_Data$Memory_Bandwidth <- [Link](sub("GB/sec", "", Extract_Data$Memory_
Bandwidth))
8 str(Extract_Data)
Output:

Figure 3: The result after data formatting

Based on the output in the previous section, we can see that the data type of some variables
in the sub-data file "Extract_Data" are in wrong type. Only the qualitative variable "Manu-
facturer" containing the names of GPU manufacturers is in the correct data type - "character",
while the remaining quantitative variables containing GPU parameters that are in the wrong
type. To facilitate data processing, we have to delete characters representing units of these vari-
ables and then convert them into the correct type - "numeric".

This conversion is essential for any numeric computations or statistical analysis, as oper-
ations on numeric data cannot be performed while the data is in character format. However,
this code assumes that all values in these columns are properly formatted with the units at the
end. If there are missing values or inconsistencies in formatting, the "[Link]()" function will
return "NA" for those entries, and additional data cleaning may be required.

5.2.3 Accounting for missing data


Code:
1 apply([Link](Extract_Data), 2, sum)
2 apply([Link](Extract_Data), 2, mean)
Output:

Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 9/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science

Figure 4: Checking the total number and the rate of missing data

After formatting, we have to check for missing data, so that we can handle it to gain the
result with high accuracy. There are some ways to handle with the values of missing data:
• Replace its value by the mean or median of that column.

• Remove the N/A values from the data frame.


Basing on the missing rate, we can make decision the way to handle with missing values. To be
specific, if the missing rate is less than 10%, this indicates that the missing cells of these variables
do not significantly affect to the overall result of the analyzing process, then we can remove them
from data frame. Otherwise, variables whose missing rate is greater than 10%, which implies that
the missing cells would have a certain effects to the general result, thus we have to replace its
values by mean or median of that column. When processing these variables, we should rely on
the histogram to decide whether to fill the missing value with the mean or median. This is a
method in EDA (Exploratory Data Analysis) to better understand the distribution of data.
• If the variables follow Normal Distribution or closely resemble Normal Distribution, using
the mean is appropriate as the mean is a good estimate for the location of the center of
the Normal Distribution.
• If the data distribution is skewed or has outliers, it makes more sense to use the median
because the median is less affected by extreme values and is a better estimate of the center
position of the skewed distribution.

Columns like "Core_Speed", "Memory", and "Pixel_Rate" have a high number of "NA" values,
with "Core_Speed" having 936 missing values, which is quite significant. Other columns such
as "Manufacturer", "Open_GL", and "Dedicated" have fewer "NA" values, with "Dedicated"
having no missing values at all.

5.2.4 Handling missing data


After considering the rate of missing data, we found that the missing rates of "Core_Speed",
"Memory", "Pixel_Rate", "TMUs", and "Texture_Rate" are noticeably affects the result of
analyzing process; therefore, we have to fill missing cells by the mean or median of that column.
Figure 5 contains histograms used to check whether those variables follow Normal Distribu-
tion or not. Only "Core_Speed" closely follows Normal Distribution, then we replace its missing
data by the mean, the other 3 variables have skewed data distribution, then we replace their
missing cells by the median of their column. On the other hand, for those variables having low
missing rate, we can simply remove the observations containing missing cells as they do not
significantly affect the overall result.

Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 10/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science

Figure 5: The histograms of variables having high missing rate

Code:
1 Extract_Data$Core_Speed[[Link](Extract_Data$Core_Speed)] <- mean(Extract_Data$Core_
Speed, [Link] = TRUE)
2 Extract_Data$Memory[[Link](Extract_Data$Memory)] <- median(Extract_Data$Memory, [Link]
= TRUE)
3 Extract_Data$Pixel_Rate[[Link](Extract_Data$Pixel_Rate)] <- median(Extract_Data$Pixel
_Rate, [Link] = TRUE)
4 Extract_Data$TMUs[[Link](Extract_Data$TMUs)] <- median(Extract_Data$TMUs, [Link] =
TRUE)
5 Extract_Data$Texture_Rate[[Link](Extract_Data$Texture_Rate)] <- median(Extract_Data$
Texture_Rate, [Link] = TRUE)
6 Extract_Data <- [Link](Extract_Data)
7 apply([Link](Extract_Data), 2, sum)
Output:

Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 11/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science

Figure 6: The result after handling missing data

The top part of Figure 6 shows a segment of a dataframe that likely represents specifica-
tions of GPUs. Notably, the "Core_Speed" for entries 2 through 5 (all corresponding to AMD
manufacturers) has been standardized to a precise value of approximately 946.8939 MHz,
which is indicative of an imputation where missing values have been replaced with the mean
"Core_Speed" of the dataset. The bottom part of this figure is the console output after execut-
ing the "apply()" function. The output indicates that there are 0 "NA" values in each of the
listed columns. This output confirms that the data frame does not have any missing values in
these columns, which is essential for analyses that cannot handle "NA" values.

In conclusion, all the "NA" values were replaced with mean or median values for differ-
ent columns, the figure indicates the end result of those data cleaning steps. Now that the data
frame has been cleansed of "NA" values, it’s ready for analysis without the risk of NA-related
computation errors.

Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 12/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science

6 Descriptive Statistics
6.1 Calculating Characteristic Values
Code:
1 Desc_Data <- Extract_Data[, c("Core_Speed", "Memory", "Memory_Bus", "Memory_Speed",
"Open_GL", "Pixel_Rate", "TMUs", "Texture_Rate", "Memory_Bandwidth")]
2 Mean <- apply(Desc_Data, 2, mean)
3 Standard_Deviation <- apply(Desc_Data, 2, sd)
4 Median <- apply(Desc_Data, 2, median)
5 Min <- apply(Desc_Data, 2, min)
6 Max <- apply(Desc_Data, 2, max)
7 Q1 <- apply(Desc_Data, 2, quantile, probs = 0.25)
8 Q3 <- apply(Desc_Data, 2, quantile, probs = 0.75)
9 t([Link](Mean, Standard_Deviation, Median, Min, Max, Q1, Q3))
Output:

Figure 7: Characteristic values of the dataset

The output provides descriptive statistics for various GPU specifications. Here’s an interpre-
tation of the trends for each variable based on the descriptive statistics:
• Core_Speed: There is a significant range in core speeds from 100 to 1784 MHz, with
the average being around 946 MHz. This suggests that the dataset includes a wide variety
of GPU models, from lower-end to high-performance ones. The relatively high standard
deviation indicates that core speed values are quite spread out.

• Open_GL: The minimum and maximum values are very close, and the median matches
the mean closely, suggesting a less variable and more consistent support level for the
OpenGL standard across GPUs in the dataset.
• Memory: The wide range from 16 MB to 32000 MB, with a large standard deviation,
indicates a significant diversity in memory capacities, possibly due to a mix of older and
newer GPU models.
• Memory_Bus: The range from 32 Bit to 8192 Bit and a high mean suggests that the
dataset contains both older GPUs with narrower buses and modern GPUs with wider buses,
which significantly impact overall memory bandwidth and performance.

• Memory_Speed: There is also a wide range in memory speeds, from 110 MHz to 2127
MHz. The average speed is about 1180 MHz, but the standard deviation is quite large,
indicating diverse performance capabilities.
• Pixel_Rate: With a range from 1 to 384 GPixel/s, the dataset spans GPUs that are
possibly several generations apart. The mean is relatively low compared to the maximum,

Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 13/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science

suggesting that most GPUs have modest pixel processing capabilities, with a few high-
performance outliers.

• TMUs (Texture Mapping Units): The TMUs range from 1 to 384, which is a broad
spread, reflecting varied texture processing powers within the dataset. The higher mean
relative to the median indicates there are some GPUs with very high TMU counts pulling
the average up.
• Memory_Bandwidth: There’s a substantial range from 1 to 1280 GB/sec, indicating a
significant variation in memory data transfer rates. The average bandwidth is somewhat
low compared to the maximum value, suggesting that the majority of GPUs have moderate
bandwidth with some exceptional high-end models.
Overall, the trends suggest a dataset comprising a wide array of GPUs, from basic to high-end
models, which is reflective in the substantial variability and range of values in the core speed,
memory specifications, and processing rates. The higher standard deviations in many of the vari-
ables indicate that the data are spread out over a wide range of values and not clustered around
the mean. This variety is typical for a dataset that includes multiple generations and levels of
GPU hardware.

Code:
1 table(Extract_Data$Manufacturer)
2 table(Extract_Data$Dedicated)
3

Output:

Figure 8: The numbers of observations divided by Manufacturers and Dedicated

6.2 Plotting Histogram for the Distribution of Memory Bandwidth


Code:
1 hist(Extract_Data$Memory_Bandwidth, xlab = "Memory_Bandwidth (MHz)", labels = T, col
= "#99CCFF", breaks = 20)
Output:

Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 14/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science

Figure 9: The Histogram of Memory_Bandwidth

The histogram shows the frequency distribution of the Memory_Bandwidth variable mea-
sured in MHz. The distribution is skewed to the right, indicating that most of the GPUs in the
dataset have a lower memory bandwidth, with a decreasing number of GPUs as the memory
bandwidth increases.
There are a few key observations:
• Concentration in Lower Bins: A large concentration of GPUs have a memory bandwidth
of 200 MHz or less. This might suggest common lower-end or older models.
• Long Tail to the Right: There are fewer GPUs with higher memory bandwidths, but
they do exist up to 1200 MHz, which is the farthest bin on the right. This indicates some
high-performance GPUs are present in the data.
• Possible Outliers: There are a few GPUs with very high memory bandwidth (800 MHz
and above), which could be considered outliers compared to the rest of the data.
• Bins and Range: The range of "Memory_Bandwidth" values is divided into 20 equally
sized intervals. However, many of these intervals, especially at the higher end, have very
few or no GPUs, leading to a sparse right tail.

Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 15/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science

6.3 Plotting Box Plots for the Distribution of Memory Bandwidth


follows Qualitative Variables
Code:
1 library(ggplot2)
2 ggplot(Extract_Data, aes(x = Manufacturer, y = Memory_Bandwidth, fill = Manufacturer
)) + geom_boxplot() +
3 stat_summary(fun.y = "mean", geom = "point", color = "red") +
4 scale_fill_brewer(palette = "Blues") +
5 theme_minimal()
6 ggplot(Extract_Data, aes(x = Dedicated, y = Memory_Bandwidth, fill = Dedicated)) +
geom_boxplot() +
7 stat_summary(fun.y = "mean", geom = "point", color = "red") +
8 scale_fill_brewer(palette = "Blues") +
9 theme_minimal()
Output:

Figure 10: The Boxplot of Memory_Speed corresponding to Manufacturer and Dedicated

These box plots visualize the memory bandwidth (in GB/s) across different GPU manufac-
turers and whether the GPUs are dedicated or not. A few observations:
• Among the manufacturers, AMD and Nvidia have significantly higher memory bandwidth
compared to Arm, ATI, and Intel GPUs.
• Nvidia GPUs appear to have a wider range of memory bandwidth values, with some high-
end models reaching over 1000 GB/s.

Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 16/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science

• There is a distinct difference in memory bandwidth between dedicated GPUs and non-
dedicated GPUs, with dedicated GPUs having substantially higher bandwidth capabilities,
as expected.

• The boxes and whiskers indicate the distributions of memory bandwidth within each cat-
egory, allowing for comparison of central tendencies and spread.
• For the manufacturer box plots, there are several outliers visible above the main distri-
bution for AMD, Nvidia, and Intel GPUs. These likely represent high-end or specialized
GPU models from those manufacturers that have exceptionally high memory bandwidth
compared to their typical product lines. On the dedicated GPU box plot, there is one
significant outlier below the main distribution for non-dedicated GPUs. This could repre-
sent a lower-end or older integrated GPU model with relatively poor memory bandwidth
performance. Outliers in box plots can provide insight into extreme values or potential
anomalies in the data. Their presence indicates there are some GPU configurations that
deviate substantially from the typical ranges seen for each category. Examining these outlier
models specifically could reveal interesting technical factors behind their outlier memory
bandwidth capabilities.
Overall, these box plots effectively highlight the performance differences in terms of memory
bandwidth across GPU manufacturers and dedicated vs. non-dedicated GPU configurations.

6.4 Plotting Scatter Plot for the Linear Relationship between Mem-
ory_Bandwidth and Quantitative Variables
Code:
1 plot(Extract_Data$Core_Speed, Extract_Data$Memory_Bandwidth, xlab = "Core_Speed (MHz
)", ylab = "Memory_Bandwidth (GB/sec)", pch = 1)
2 plot(Extract_Data$Memory, Extract_Data$Memory_Bandwidth, xlab = "Memory (MB)", ylab
= "Memory_Bandwidth (GB/sec)", pch = 1)
3 plot(Extract_Data$Memory_Bus, Extract_Data$Memory_Bandwidth, xlab = "Memory_Bus (Bit
)", ylab = "Memory_Bandwidth (GB/sec)", pch = 1)
4 plot(Extract_Data$Memory_Speed, Extract_Data$Memory_Bandwidth, xlab = "Memory_Speed
(MHz)", ylab = "Memory_Bandwidth (GB/sec)", pch = 1)
5 plot(Extract_Data$Open_GL, Extract_Data$Memory_Bandwidth, xlab = "Open_GL", ylab = "
Memory_Bandwidth (GB/sec)", pch = 1)
6 plot(Extract_Data$Pixel_Rate, Extract_Data$Memory_Bandwidth, xlab = "Pixel_Rate (
GPixel/s)", ylab = "Memory_Bandwidth (GB/sec)", pch = 1)
7 plot(Extract_Data$TMUs, Extract_Data$Memory_Bandwidth, xlab = "TMUs", ylab = "Memory
_Bandwidth (GB/sec)", pch = 1)
8 plot(Extract_Data$Texture_Rate, Extract_Data$Memory_Bandwidth, xlab = "Texture_Rate
(GTexel/s)", ylab = "Memory_Bandwidth (GB/sec)", pch = 1)
Output:

Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 17/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science

Figure 11: The Scatter Plots of Memory_Bandwidth corresponding to Independent Variables

Scatter Plot 1: displays the relationship between core speed and memory bandwidth.
The data points appear to form a rough positive correlation, indicating that processors with

Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 18/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science

higher core speeds tend to have higher memory bandwidth capabilities as well. However, there is
substantial variation in the data, with some processors having relatively low memory bandwidth
despite their high core speeds, and vice versa. The plot also shows a dense cluster of data points
around the core speed range of 800-1200 MHz, suggesting that this is a common range for the
processors represented in the data.

Scatter Plot 2: shows the relationship between the memory and the memory bandwidth.
The data points appear to form a distinct pattern, with a dense cluster of points at the lower
end of the memory range (around 0 to 4,000 MB) and relatively high memory bandwidth values
(around 200-600 GB/sec). As the memory increases beyond 4,000 MB, the data points become
more sparse and spread out, with memory bandwidth values ranging from very low to relatively
high, indicating a weaker correlation between memory size and bandwidth for systems with larger
memory capacities. The plot has a significant number of outliers, especially at higher memory
sizes, where some data points have unexpectedly low or high bandwidth values compared to the
general trend.

Scatter Plot 3: displays the relationship between the memory bus and the memory band-
width. The data points form a distinct pattern, with a dense cluster at the lower end of the mem-
ory bus range (around 0 to 1,000 bits) and relatively high memory bandwidth values (around
200-600 GB/sec). As the memory bus width increases beyond 1,000 bits, the data points be-
come more sparse and spread out vertically, indicating a weaker correlation between bus width
and bandwidth for systems with wider memory buses. There are several outliers, particularly at
higher bus widths, where some points have unexpectedly low or high bandwidth values compared
to the general trend.

Scatter Plot 4: shows the relationship between memory speed and memory bandwidth.
There appears to be a positive correlation between the two variables, with higher memory
speeds generally corresponding to higher memory bandwidth values. However, the correlation
is not perfectly linear, and there is substantial variation in the data points. The data points form
a somewhat triangular or fan-shaped pattern, with a dense cluster of points at lower memory
speeds (around 500-800 MHz) and a wider spread of bandwidth values as the memory speed
increases. This suggests that while higher memory speeds tend to enable higher bandwidths,
other factors likely influence the actual bandwidth achieved. There are also several outliers, par-
ticularly at higher memory speeds, where some points have unexpectedly low or high bandwidth
values compared to the general trend.

Scatter Plot 5: depicts the relationship between OpenGL and memory bandwidth. The
data points appear to form distinct horizontal bands or clusters, suggesting that memory band-
width tends to group around certain values corresponding to specific OpenGL versions. Notably,
there is a dense cluster of points around the OpenGL version 2.0, with varying but generally
lower memory bandwidth values. As the OpenGL version increases beyond 2.0, there are sep-
arate clusters with progressively higher memory bandwidth capabilities. However, within each
cluster corresponding to a specific OpenGL version, there is still some variation in memory
bandwidth, indicating that other factors beyond just the OpenGL version likely influence the
achievable memory performance. It’s interesting to note the presence of a few outliers, partic-
ularly at higher OpenGL versions, where some data points exhibit unexpectedly low or high
memory bandwidth compared to the general trend for that version.

Scatter Plot 6: appears to show the relationship between TMUs and memory bandwidth.

Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 19/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science

The data points are densely clustered at lower TMU values, indicating a large number of devices
or configurations with relatively low performance. As the TMU value increases, the data points
become more sparse, suggesting fewer devices or configurations achieve higher levels of perfor-
mance. There are also a few distinct outlier points at very high TMU and bandwidth values,
potentially representing high-end or specialized systems optimized for memory performance.

Scatter Plot 7: appears to show the relationship between texture rate and memory band-
width. The data points form a distinct pattern resembling an elongated cluster with a tail ex-
tending towards higher texture rates. This shape suggests that as the texture rate increases, the
corresponding memory bandwidth tends to plateau or reach a maximum level, with fewer data
points achieving very high bandwidth at the highest texture rates. There are also several outlier
points scattered above the main cluster, indicating configurations that achieve higher memory
bandwidth than typical for their texture rate. These could represent specialized or optimized
GPU setups.

6.5 Plotting a Correlation Matrix to Check for Multicollinearity be-


tween Independent Variables
Code:
1 library(corrplot)
2 Indepen_Data <- Extract_Data[, c("Core_Speed", "Memory", "Memory_Bus", "Memory_Speed
", "Open_GL", "Pixel_Rate", "TMUs", "Texture_Rate")]
3 Correlation = cor(Indepen_Data)
4 corrplot(Correlation, "number")
Output:

Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 20/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science

Figure 12: The Correlation Matrix of Independent Variables

Based on the correlation coefficient corresponding to each pair of variables, it shows that
Texture_Rate has a relatively strong linear relationship with Memory, Pixel_Rate, and TMUs,
with r of 0.81, 0.89, and 0.81, respectively. However, because r is not greater than 0.9, we
temporarily accept keeping these variables when building a multivariate regression model. The
correlation coefficients for the remaining pairs of variables all show that there is no strong
linear relationship. Therefore, the independent variables we consider satisfy the condition that
no multicollinearity phenomenon occurs.

Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 21/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science

7 Inferential Statistics
7.1 Confidence Interval
PROBLEM: Construct a 95% two-sided confidence interval for the mean of memory band-
width basing on the above dataset.
First, we have to check whether Memory_Bandwidth variable follows a Normal Distribution
or not. There are 2 methods that we can use to determine in this situation:

• Method 1: Using Q-Q plot


• Method 2: Using Shapiro-Wilk test, based on the p-value returned from the test, we
could make a conclusion for the testing:
– Case 1: If p-value ≤ α, then we can reject H0 .
– Case 2: If p-value > α, then we fail to reject H0 .
Code (Method 1):
1 qqnorm(Extract_Data$Memory_Bandwidth)
2 qqline(Extract_Data$Memory_Bandwidth)
Output:

Figure 13: The Q-Q Plot of Memory_Bandwidth

Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 22/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science

Based on the Q-Q Plot in Figure 13, it shows that the observations deviate greatly from the
expected line of Normal Distribution. Therefore, we can conclude that the Memory_Bandwidth
does not follow the Normal Distribution. However, this type of method is not always correct,
then we should use the Shapiro-Wilk test to have an accuracy conclusion for this task.

Code (Method 2):


1 [Link](Extract_Data$Memory_Bandwidth)
Output:

Figure 14: The result of Shapiro-Wilk Test

Similar to many other Testing Problems, we first need to formulate the Hypothesis:
• H0 : Memory_Bandwidth follows the Normal Distribution.
• H1 : Memory_Bandwidth does not follow the Normal Distribution.
From the output of the test, p-value (< 2.2 × 10−16 ), is very small compared to the signifi-
cance level α (0.05). Hence, we can reject H0 which means Memory_Bandwidth does not follow
the Normal Distribution.
=⇒ Thus, this is the problem of finding the confidence interval for the average of Mem-
ory_Bandwidth following an arbitrary distribution and a large sample size (n > 30).

Then we have to calculate some sample statistics, including sample size, sample mean, and
sample standard deviation.
Code:
1 n <- length(Extract_Data$Memory_Bandwidth)
2 x_bar <- mean(Extract_Data$Memory_Bandwidth)
3 s <- sd(Extract_Data$Memory_Bandwidth)
4 [Link](n, x_bar, s)
Output:

Figure 15: Some sample statistics

After that, we calculate the error for this testing following the formula: ε = Z α2 · √sn ,
where Z α2 is the z-score corresponding to the desired level of confidence, s is the sample standard
deviation, and n is the sample size.
Code:
1 Error = qnorm(p = 0.05/2, [Link] = FALSE) * s/sqrt(n)
2 print(Error)

Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 23/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science

Output:

Figure 16: The Error of the Testing Problem

Finally, we can make a conclusion about the confidence interval of the Memory_Bandwidth
variable as follows:
Code:
1 [Link](Left_Bound = x_bar - Error, Right_Bound = x_bar + Error)
Output:

Figure 17: The Confidence Interval of Memory_Bandwidth

7.2 Multiple Linear Regression Model


7.2.1 Model Building
We are interested in which technical specifications affecting to the Memory Bandwidth of
GPUs. Then, we have to build a multivariate linear regression model, in which:

• Dependent variable: Memory_Bandwidth


• Independent variables: The remaining variables
The model is represented as follows:
Memory_Bandwidth = β0 + β1 ·ManufacturerATI + β2 ·ManufacturerIntel + β3 ·ManufacturerNvidia
+ β4 ·DedicatedYes + β5 ·Core_Speed + β6 ·Memory + β7 ·Memory_Bus + β8 ·Memory_Speed
+ β9 ·Open_GL + β10 ·Pixel_Rate + β11 ·TMUs + β12 ·Texture_Rate + ε

Then we have to estimate the coefficients βi :


Code:
1 MLR_Model <- lm(Memory_Bandwidth~., data = Extract_Data)
2 summary(MLR_Model)
Output:

Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 24/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science

Figure 18: The result of the Multiple Linear Regression Model

From the output of the model, we have the following estimated coefficients βbi :
β0 = −1.699 × 101 , β
c c1 = −3.999 × 101 , β
c2 = −3.610 × 101 , β
c3 = −1.976 × 101 , β
c4 = 1.956 × 101 ,
−2 c −3 c −2 c −2 c
β5 = −9.813 × 10 , β6 = 3.362 × 10 , β7 = 9.342 × 10 , β8 = 3.556 × 10 , β9 = 1.448 × 101 ,
c
−1 c
10 = 6.777 × 10
βc , β11 = 4.120 × 10−2 , βc
12 = 1.018.
From there, we can obtain the estimated regression equation:
\
M emory_Bandwidth = − 1.699×101 − 3.999×101 ·ManufacturerATI − 3.610×101 ·ManufacturerIntel
− 1.976 × 10 ·ManufacturerNvidia + 1.956 × 101 ·DedicatedYes − 9.813 × 10−2 ·Core_Speed +
1

3.362 × 10−3 ·Memory + 9.342 × 10−2 ·Memory_Bus + 3.556 × 10−2 ·Memory_Speed + 1.448 ×
101 ·Open_GL + 6.777 × 10−1 ·Pixel_Rate + 4.120 × 10−2 ·TMUs + 1.018·Texture_Rate

After that, we want to check whether each of the variables affecting to the changes of Mem-
ory_Bandwidth or not. We implement the problem of Testing Regression Coefficients:

• H0 : βi = 0 ←→ The regression coefficients are not significant.


• H1 : βi ̸= 0 ←→ The regression coefficients are significant.
Based on the p-value column (Pr(>|t|)), we can see that the p-value corresponding to the
variable TMUs = 0.1340, which is greater than the 5% significance level, so we fail to reject the
H0 . Then the coefficient β11 = 0, or we can say that TMUs does not affect Memory_Bandwidth.
The Adjusted Coefficient R2 (Adjusted R-squared) = 0.9138 shows that 91.38% of the fluc-
tuations in Memory_Bandwidth are explained by the independent variables in the model.

Here, we have to build the second multivariate linear regression model, removing TMUs
variable from the first one.
Code:

Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 25/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science

1 MLR_Model2 <- lm(Memory_Bandwidth~.-TMUs, data = Extract_Data)


2 summary(MLR_Model2)
Output:

Figure 19: The result of the Second Multiple Linear Regression Model

The Adjusted Coefficient R2 (Adjusted R-squared) = 0.9138 shows that 91.38% of the fluc-
tuations in Memory_Bandwidth are explained by the independent variables in the model.

After that, we have to compare the effectiveness between 2 models.


Code
1 anova(MLR_Model, MLR_Model2)
Output:

Figure 20: The Comparison of 2 Models

We continue to implement the Testing Problem about the effectiveness of 2 models as follows:
• H0 : Model 2 is more effective than Model 1.

Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 26/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science

• H1 : Model 1 is more effective than Model 2.


As the p-value (Pr(>F)) = 0.134 > 5% significance level, we fail to reject H0 . Therefore,
Model 2 is more effective than Model 1. On the other hand, Model 2 uses fewer variables than
Model 1, but both models have the same Adjusted Coefficient R2 (Adjusted R-squared). Thus,
we choose Model 2 as the better model.
The equation showing the relationship between Memory_Bandwidth and other independent
variables is presented as follows:
\
M emory_Bandwidth = − 1.608×101 − 3.984×101 ·ManufacturerATI − 3.626×101 ·ManufacturerIntel
− 1.961 × 10 ·ManufacturerNvidia + 1.975 × 101 ·DedicatedYes − 9.859 × 10−2 ·Core_Speed +
1

3.264 × 10−3 ·Memory + 9.538 × 10−2 ·Memory_Bus + 3.680 × 10−2 ·Memory_Speed + 1.423 ×
101 ·Open_GL + 6.643 × 10−1 ·Pixel_Rate + 1.040·Texture_Rate

We continue to determine the impact of Independent variables on Memory Bandwidth:

• Based on p-value: We can see that the smaller the p-value is, the more strongly that
variable has an impact on Memory Bandwidth. From there, it shows that Core_Speed,
ManufacturerATI, ManufacturerIntel, and ManufacturerNvidia have the strongest impact
on Memory_Bandwidth, followed by Memory, Memory_Speed, ...

• Based on the estimated regression coefficients: β


c5 = −9.813 × 10−2 , it shows that when
we increase the core speed of GPUs by 1 unit, we expect the memory bandwidth value
to decrease by −9.813 × 10−2 × 1 = −9.813 × 10−2 units (in case other variables remain
unchanged). This analysis is similar for the remaining variables.

7.2.2 Homoscedasticity Checking


Finally, we have to check the assumptions of the model:
• Y and the independent variables X having a linear relationship.
• The residual errors are independent of each other.

• The residual errors follow Normal Distribution with an expectation of 0 and a constant
variance.
Code:
1 par(mfrow=c(2,2), mar=c(4,4,2,1))
2 plot(MLR_Model2)
Output:

Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 27/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science

Figure 21: Homoscedasticity Checking

Plot 1: Shows the Residuals corresponding to the Fitted values, which is used to check
the following assumptions:
• Y and the independent variables X having a linear relationship.
• The residual errors have an expectation of 0.
• The variance of the residual errors is constant.
Based on the results, it shows that:
• The red line is closely a straight horizontal line, so the assumption that Y and the inde-
pendent variables X have a linear relationship is satisfied.
• The red line is close to the line y = 0, so we assume that the expected error is 0, which is
satisfied.
• The error values are not randomly scattered along the red line, so the assumption that the
error variance is constant is not satisfied.

Plot 2: Shows the Standardized Residuals, used to check the assumption that the errors
follow Normal Distribution.
Based on the results, there are many points that deviate from the expected normal distribu-
tion line, so the assumption that the error is normally distributed is not satisfied.

Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 28/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science

Plot 3: Shows the square root of the Standardized Residuals, which is used to check the
assumption that the variance of the errors is constant.
Based on the results, the error values are not randomly dispersed along the red line, so the
assumption that the variance of the errors is constant is not really satisfied.

Plot 4: Shows the highly influential points in the dataset, specifically points 950, 60, 62.
However, only point 62 goes beyond Cook’s distance. Therefore, we need to remove point 62
from the data set.

7.2.3 Model Prediction


Code:
1 Compare <- Extract_Data['Memory_Bandwidth']
2 Compare['Predicted_MB'] <- [Link](predict(MLR_Model2, newdata = Extract_Data)
)
3 #Plotting
4 library(ggplot2)
5 ggplot([Link], aes(x = Memory_Bandwidth, y = Predicted_MB)) +
6 geom_point(shape = 1) +
7 geom_abline(mapping = aes(intercept=0, slope=1), color="red") +
8 labs(x = "Memory_Bandwidth (GB/sec)",y="Predicted_Memory_Bandwidth")
Output:

Figure 22: The Scatter Plot shows the relationship of Predicted and Real Memory_Bandwidth

Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 29/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science

The plotted line in graph is (d) : y = x. The more concentration on this line the more correct
the model does. To be more specific, we will show 10 values with their predicted:
Code:
1 Pred <- [Link](predict(MLR_Model2, newdata = Extract_Data))
2 Compare <- cbind(Extract_Data$Memory_Bandwidth, Pred)
3 colnames(Compare) <- c("Memory_Bandwidth", "Prediction")
4 head(Compare, 10)
Output:
1 Memory_Bandwidth Prediction
2 64.0 79.677021
3 106.0 55.851926
4 51.2 26.993941
5 36.8 21.903033
6 22.4 1.457792
7 35.2 19.022765
8 134.4 80.299337
9 51.2 21.044412
10 160.0 129.703299
11 2.9 29.706407

We finally check for the accuracy of the model:


Code & Output:
1 SSE <- sum((Extract_Data$Memory_Bandwidth - Pred)^2)
2 SST <- sum((Extract_Data$Memory_Bandwidth - mean(Extract$Memory))^2)
3 cat("The accuracy of the model on test set: ", round((1 - SSE/SST)* 100, 2), "%")
4 #Output: The accuracy of the model on test set: 91.41%

Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 30/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science

8 Discussion and Extension


8.1 Advantages
Multiple linear regression offers several advantages in statistical modeling and analysis:
• Flexibility: Multiple linear regression allows for the incorporation of multiple independent
variables, enabling the exploration of complex relationships between the predictors and the
response variable.
• Interpretability: It provides coefficients for each independent variable, which can be inter-
preted to understand the strength and direction of their relationship with the dependent
variable.
• Prediction: It can be used for prediction purposes, allowing researchers to estimate the
value of the dependent variable based on the values of the independent variables.
• Variable Selection: Through techniques like stepwise regression or regularization methods,
multiple linear regression facilitates the selection of the most influential variables, improving
model efficiency and interpretability.

• Assumption Testing: Multiple linear regression allows for the testing of various assumptions,
such as linearity, independence of errors, homoscedasticity, and normality of residuals,
providing insights into the validity of the model.
• Model Comparison: It enables the comparison of different models using metrics like R-
squared, adjusted R-squared, and AIC/BIC, helping researchers identify the most suitable
model for their data.
• Inference: Multiple linear regression provides inferential capabilities, allowing researchers
to draw conclusions about population parameters based on sample data.
• Control of Confounding Variables: By including relevant independent variables in the
model, multiple linear regression helps control for confounding factors, thereby enhanc-
ing the accuracy of the estimated relationships.
• Hypothesis Testing: Researchers can use multiple linear regression to test specific hypothe-
ses about the relationships between independent and dependent variables, providing em-
pirical support for theoretical constructs.

• Applications: It finds applications in various fields such as economics, social sciences,


medicine, engineering, and business, making it a versatile and widely used statistical tech-
nique.
Linear regression offers a straightforward yet powerful approach for modeling the relationship
between a dependent variable and one or more independent variables. Its simplicity allows for
easy interpretation of results and coefficients, making it accessible even to non-statisticians.
Additionally, linear regression provides a solid foundation for more complex modeling techniques
and hypothesis testing. Its versatility extends across various fields, from economics and social
sciences to engineering and healthcare, making it a fundamental tool for understanding and
predicting real-world phenomena.

Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 31/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science

8.2 Disadvantages
Multiple linear regression, like any statistical method, has its limitations and disadvantages. Here
are some of them:
• Assumption of Linearity: Multiple linear regression assumes that the relationship between
the independent variables and the dependent variable is linear. If this assumption is vio-
lated, the model’s predictions may be inaccurate.
• Assumption of Independence: Multiple linear regression assumes that the independent vari-
ables are independent of each other. If there is multicollinearity (high correlation) among
the independent variables, it can lead to unreliable estimates of the regression coefficients.
• Overfitting: When too many independent variables are included in the model relative to
the number of observations, the model may overfit the data. Overfitting occurs when the
model captures noise in the data rather than the underlying relationship.
• Interpretability: As the number of independent variables increases, it becomes more chal-
lenging to interpret the coefficients of the model. Understanding the individual impact of
each independent variable on the dependent variable becomes less straightforward.
• Sensitive to Outliers: Multiple linear regression can be sensitive to outliers, which are data
points that deviate significantly from the rest of the data. Outliers can have a dispro-
portionate influence on the estimated regression coefficients and may lead to misleading
results.
• Assumption of Homoscedasticity: Multiple linear regression assumes that the residuals (the
differences between the observed and predicted values) have constant variance across all
levels of the independent variables. Violation of this assumption, known as heteroscedas-
ticity, can lead to biased standard errors and confidence intervals.
• Non-linearity of the Relationship: While multiple linear regression assumes a linear relation-
ship between the independent variables and the dependent variable, this may not always
be the case in reality. In such situations, the model may not accurately capture the true
relationship between the variables.
• Limited to Linear Relationships: Multiple linear regression is not suitable for modeling non-
linear relationships between variables. If the true relationship is nonlinear, the model may
provide poor predictions.
Linear regression may not be the appropriate choice in certain situations, particularly when
the assumptions of the model are violated or when the relationship between the variables is
not linear. For instance, when dealing with data that exhibit a nonlinear relationship, such as
exponential or polynomial patterns, linear regression may yield inaccurate predictions and un-
reliable parameter estimates. Additionally, if the independent variables are highly correlated
(multicollinearity), it can lead to inflated standard errors and difficulties in interpreting the co-
efficients. Moreover, when the data contain outliers or influential observations, linear regression
may produce biased estimates. In cases where the relationship between the variables is better
represented by a different model, such as logistic regression for binary outcomes or time se-
ries models for temporal data, linear regression should not be used. It’s crucial to assess the
appropriateness of linear regression based on the specific characteristics of the data and the re-
search question at hand, considering alternative methods when necessary to ensure accurate and
meaningful analysis.

Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 32/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science

References
[1] A Brief History of GPU. (n.d.). In Medium. Retrieved April 8, 2024, from
[Link]
text=With%20the%20progress%20of%20manufacturing,first%20GPU%E2%80%9D%
20came%20in%201999
[2] D. C. Montgomery and G. C. Runger, Applied Statistics and Probability for Engineers, 7th
ed. Kendal- lville: Wiley, 2018.
[3] ARM. (n.d.). CPU - Glossary. Retrieved April 8, 2024, from
[Link]
20processes%20and,into%20more%20usable%20information%20output.
[4] How to Find and Count Missing Values in R DataFrame. (2023, Dec 21). Retrieved April
8, 2024, from
[Link]
[5] Most Important Computer Components. (n.d.). In HP® Tech Takes. Retrieved April 8,
2024, from
[Link]

[6] Multiple Linear Regression in R: Tutorial With Examples. (2022, Nov). Retrieved April 8,
2024, from
[Link]
[7] Revelle, W. (n.d.). describe: Basic descriptive statistics useful for psychometrics. RDocu-
mentation. Retrieved April 8, 2024, from
[Link]
describe
[8] T. D. Nguyen and D. H. Nguyen, Probability – Statistics and Data Analysis. Ho Chi Minh
City: VNUHCM Press, 2020.

[9] Uses for GPUs other than gaming. (n.d.). In OpenMetal Docs. Retrieved April 8, 2024,
from
[Link]
uses-for-gpus-other-than-gaming/#:~:text=GPUs%20can%20be%20used%20for,
machine%20learning%20and%20cryptocurrency%20mining.

Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 33/33

You might also like