Probability Statistics Report
Probability Statistics Report
UNIVERSITY OF TECHNOLOGY
FACULTY OF APPLIED SCIENCE
Contents
1 Member List & Workload 2
3 Data Overview 3
4 Theory 4
4.1 Sampling Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4.1.1 Some Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4.1.2 The Characteristics of A Random Sample . . . . . . . . . . . . . . . . . . 4
4.2 Exploratory data analysis (EDA) . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.3 Multiple Linear Regression (MLR) . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.4 Statistical Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5 Data Preprocessing 7
5.1 Data Importing: "All_GPUs.csv" . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5.2 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
5.2.1 Extracting data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
5.2.2 Handling variable format . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
5.2.3 Accounting for missing data . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.2.4 Handling missing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
6 Descriptive Statistics 13
6.1 Calculating Characteristic Values . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
6.2 Plotting Histogram for the Distribution of Memory Bandwidth . . . . . . . . . . 14
6.3 Plotting Box Plots for the Distribution of Memory Bandwidth follows Qualitative
Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
6.4 Plotting Scatter Plot for the Linear Relationship between Memory_Bandwidth
and Quantitative Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6.5 Plotting a Correlation Matrix to Check for Multicollinearity between Independent
Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
7 Inferential Statistics 22
7.1 Confidence Interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
7.2 Multiple Linear Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
7.2.1 Model Building . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
7.2.2 Homoscedasticity Checking . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7.2.3 Model Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 1/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science
2.2 Code R
Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 2/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science
3 Data Overview
At present, user’s needs in using computers do not stop at calculating or running simple
programs. Manufacturers always trying to create systems that can handle a variety of tasks,
from researching, designing, etc. Among them, the field that is of particular interest is graphics.
The CPU’s architecture is not designed to be efficient at processing such tasks; therefore, the
Graphics Processing Unit (GPU) was born and has become one of the most important parts in
most of the computer systems recently.
The GPU is a processor designed to accelerate the creating and rendering of images and
videos. In the graphics field, the GPU will process huge chunks of visual data and translate
them into stunning visuals. With the increasing demand for high-resolution, realistic visual ex-
periences in applications ranging from gaming to professional graphic design and virtual reality,
the role of the GPU has become ever more central in both creating and rendering digital graphics.
To summarize, the vital role of GPUs in recent life could not be denied. A strong GPU
is one with high Memory Bandwidth. Therefore, the goal of this project is to analyze some tech-
nical specifications affecting to the Memory Bandwidth in GPUs. The dataset we use in this
project is in the file named "All_GPUs.csv", which is originally available at: Computer Parts
(CPUs and GPUs) Dataset (Kaggle) by author Ilissek, containing information about various
technical specifications, release dates, and lauch prices of GPUs. The dataset is used to under-
stand, analyze and evaluate to get a general overview of the factors that influence GPUs.
The dataset includes 34 variables and 3406 observations. However, as mentioned above, in
this analysis, we focus on technical specifications that affect to the Memory Bandwidth. The
data provided here mainly pertains to Intel, AMD, and related companies involved in producing
these components. From there, we have the main variables selected:
• Manufacturer: The company that manufactured the GPU, such as Nvidia or AMD.
• Dedicated: This indicates whether GPU is a discrete card or not, as opposed to an inte-
grated GPU.
• Core Speed (MHz): The base clock speed of the GPU operating at standard performance.
• Memory (MB): The memory capacity of the GPU represents its ability to store data.
• Memory Bus (Bit): The data transfer pathway between the GPU and memory, deter-
mining the data transmission capability.
• Memory Speed (MHz): The speed at which the GPU"s memory operates.
• OpenGL: Support for important graphics APIs to run new applications and games.
• Pixel Rate (GPixel/s) and Texture Rate (GTexel/s): Measured to evaluate the
ability to process pixels and textures per second.
• TMUs (Texture Mapping Unit): These units are responsible for applying texture maps
(images) to 3D models during rendering.
• Memory Bandwidth (GB/sec): The ability to transfer data between the GPU and
memory.
Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 3/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science
4 Theory
4.1 Sampling Theory
4.1.1 Some Definitions
• The sample is a number of units selected from the population according to a certain
sampling method. The sample characteristics are used to infer the characteristics of the
overall population.
• Primary data is data that is collected directly from the research object according to the
requirements of the researcher.
• Secondary data is data from existing sources, which has usually been processed. It helps
the researcher save time, effort, and cost compared to collecting primary data. However, it
should be noted that this data may not always meet the more detailed requirements of the
research.
1
Pn 1
Pn
S2 = n−1 i=1 (Xi − X̄)2 and Sb = n i=1 (Xi − X̄)2 ,
where Sb is called the uncorrected sample standard deviation, and the statistic S is called
the corrected sample standard deviation.
• Sample proportion: F = M
n and f ≡ p ≡ m
n.
• The median: Assuming a sample of size n is arranged in increasing order of the values
being observed: x1 ≤ x2 ≤ ... ≤ xn−1 ≤ xn .
If n = 2k + 1 then the sample median is xk + 1.
If n = 2k then the sample median is xk+12+xk
• quartiles: The median divides the ordered data sample into 2 sets of equal size. The
median of the lower set of data is called the first quartile, Q1 (the lower quartile). The
median of the upper set of data is called the third quartile, Q3 (the upper quartile). The
second quartile, Q2 , is taken to be the median value.
• Outlier points: Also called anomalous points, aberrant points, or outliers. These are the
elements of the sample that have values lying outside the range (Q1 −1.5IQR; Q3 +1.5IQR).
Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 4/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science
• Scatter plots are employed to comprehend the finest characteristics that may be applied
to describe a link between two variables or to create the most distinct clusters.
• Correlation Matrix: The correlation coefficients between variables are displayed in a
table called a correlation matrix. The association between two variables is displayed in
each cell of the table.
There are 3 assumptions that need to be met when performing an MLR test:
• Normality: the residuals (the differences between the observed values and the predicted
values) follow a normal distribution.
• No multicollinearity: The independent variables in the regression model are not highly
correlated with each other. Multicollinearity occurs when two or more independent vari-
ables are strongly correlated, making it difficult to determine their individual effects on
the dependent variable.
• Linearity : The relationship between the independent variables and the dependent variable
should be linear. This means that the expected value of the dependent variable changes in
a straight line as the independent variables change, holding other variables constant.
The general equation of MLR:
Y = β0 + β1 · X1 + β2 · X2 + ... + βn · Xn + ε
where:
• Y is the dependent variable.
• ith independent variable.
• β0 is the intercept of Y when all Xi are zeros.
• βi is the coefficient of each Xi .
• ε is the independent error term of the model.
Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 5/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science
– Testing rule: If we successfully divide the acceptance region and rejection region of
the testing standard T into two parts Rα and Rα , where Rα is the rejection region
and the rest is the acceptance region of H0 .
– Type I and Type II errors: With the testing rule as above, two types of errors can be
made:
∗ (i) Type I error: rejecting a true hypothesis. We see that the probability of com-
mitting a Type I error is exactly the significance level α. Type I error arises due
to too small sample size, sampling method...
∗ (ii) Type II error: accepting a false hypothesis. The probability of Type II error
β is defined as follows: P (T ∈
/ Rα |H1 ) = β
– Procedure for statistical hypothesis testing: Based on the above content, we can build
a procedure for statistical hypothesis testing including the following steps:
∗ (i) State the null hypothesis H0 and the alternative hypothesis H1 .
∗ (iii) Choose the testing standard T and determine the probability distribution
law of T under the condition that the null hypothesis H0 is true.
∗ (iv) Based on the probability distribution law of T , find the rejection region Rα
such that: P (T ∈ Rα |H1 ) = α
Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 6/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science
5 Data Preprocessing
5.1 Data Importing: "All_GPUs.csv"
Code:
1 All_GPUs <- [Link]("~/Downloads/HK232/Probability _ Statistics (XSTK)/BTL/All_GPUs
.csv")
2 str(All_GPUs)
Output:
The dataset includes 3406 observations with 34 different parameters related to the GPU. The
variables consist of different data types:
• chr (character): Textual data, which is used for categorical or descriptive information.
Examples include Architecture, Best_Resolution, Name, ...
• int (integer): Integer numbers, used for counts or other numerical data that do not have
decimals. Some variables such as HDMI_Connection and DVI_Connection are of this type.
• num (numeric): This can include real numbers (integers and floating-point). For instance,
Open_GL and PSU are numeric.
Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 7/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science
• Some columns, like Direct_X and SLI_Support, contain factors or ordered data represented
as characters.
• NA (Not Available) appears in several places, indicating missing or unavailable data for
those entries.
Based on initial observations of the dataset, the team didn’t choose qualitative variables such
as Architecture, Name, PSU, ROPs, ... as they have many classifications and their missing rate
is high, there aren’t any basis to fill those missing values. Besides, quantitative variables such as
Boost_Clock, DisplayPort_Connection, Release_Price, ... are also ignored since their missing
rate is considerably high (more than 50%), which greatly affects the overall results of the analy-
sis. Therefore, combined with articles about factors affecting the Memory Bandwidth, below are
the variables that are suitable for the main purpose of the analysis and satisfy the above two
conditions, the detail articles are available at: Computing GPU memory bandwidth with Deep
Learning Benchmarks and What Is Memory Bandwidth? Complete Guide
Figure 2 shows a table view of the data frame "Extract_Data", we can see a few rows
of data that have been mentioned above. We can see that "Extract_Data" is a structured sub-
set from "All_GPUs" consisting of 11 specific columns that are likely relevant to a particular
analysis. The data contains a mixture of numeric and character types, with some missing values
that may need to be addressed before further analysis.
Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 8/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science
Based on the output in the previous section, we can see that the data type of some variables
in the sub-data file "Extract_Data" are in wrong type. Only the qualitative variable "Manu-
facturer" containing the names of GPU manufacturers is in the correct data type - "character",
while the remaining quantitative variables containing GPU parameters that are in the wrong
type. To facilitate data processing, we have to delete characters representing units of these vari-
ables and then convert them into the correct type - "numeric".
This conversion is essential for any numeric computations or statistical analysis, as oper-
ations on numeric data cannot be performed while the data is in character format. However,
this code assumes that all values in these columns are properly formatted with the units at the
end. If there are missing values or inconsistencies in formatting, the "[Link]()" function will
return "NA" for those entries, and additional data cleaning may be required.
Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 9/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science
Figure 4: Checking the total number and the rate of missing data
After formatting, we have to check for missing data, so that we can handle it to gain the
result with high accuracy. There are some ways to handle with the values of missing data:
• Replace its value by the mean or median of that column.
Columns like "Core_Speed", "Memory", and "Pixel_Rate" have a high number of "NA" values,
with "Core_Speed" having 936 missing values, which is quite significant. Other columns such
as "Manufacturer", "Open_GL", and "Dedicated" have fewer "NA" values, with "Dedicated"
having no missing values at all.
Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 10/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science
Code:
1 Extract_Data$Core_Speed[[Link](Extract_Data$Core_Speed)] <- mean(Extract_Data$Core_
Speed, [Link] = TRUE)
2 Extract_Data$Memory[[Link](Extract_Data$Memory)] <- median(Extract_Data$Memory, [Link]
= TRUE)
3 Extract_Data$Pixel_Rate[[Link](Extract_Data$Pixel_Rate)] <- median(Extract_Data$Pixel
_Rate, [Link] = TRUE)
4 Extract_Data$TMUs[[Link](Extract_Data$TMUs)] <- median(Extract_Data$TMUs, [Link] =
TRUE)
5 Extract_Data$Texture_Rate[[Link](Extract_Data$Texture_Rate)] <- median(Extract_Data$
Texture_Rate, [Link] = TRUE)
6 Extract_Data <- [Link](Extract_Data)
7 apply([Link](Extract_Data), 2, sum)
Output:
Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 11/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science
The top part of Figure 6 shows a segment of a dataframe that likely represents specifica-
tions of GPUs. Notably, the "Core_Speed" for entries 2 through 5 (all corresponding to AMD
manufacturers) has been standardized to a precise value of approximately 946.8939 MHz,
which is indicative of an imputation where missing values have been replaced with the mean
"Core_Speed" of the dataset. The bottom part of this figure is the console output after execut-
ing the "apply()" function. The output indicates that there are 0 "NA" values in each of the
listed columns. This output confirms that the data frame does not have any missing values in
these columns, which is essential for analyses that cannot handle "NA" values.
In conclusion, all the "NA" values were replaced with mean or median values for differ-
ent columns, the figure indicates the end result of those data cleaning steps. Now that the data
frame has been cleansed of "NA" values, it’s ready for analysis without the risk of NA-related
computation errors.
Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 12/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science
6 Descriptive Statistics
6.1 Calculating Characteristic Values
Code:
1 Desc_Data <- Extract_Data[, c("Core_Speed", "Memory", "Memory_Bus", "Memory_Speed",
"Open_GL", "Pixel_Rate", "TMUs", "Texture_Rate", "Memory_Bandwidth")]
2 Mean <- apply(Desc_Data, 2, mean)
3 Standard_Deviation <- apply(Desc_Data, 2, sd)
4 Median <- apply(Desc_Data, 2, median)
5 Min <- apply(Desc_Data, 2, min)
6 Max <- apply(Desc_Data, 2, max)
7 Q1 <- apply(Desc_Data, 2, quantile, probs = 0.25)
8 Q3 <- apply(Desc_Data, 2, quantile, probs = 0.75)
9 t([Link](Mean, Standard_Deviation, Median, Min, Max, Q1, Q3))
Output:
The output provides descriptive statistics for various GPU specifications. Here’s an interpre-
tation of the trends for each variable based on the descriptive statistics:
• Core_Speed: There is a significant range in core speeds from 100 to 1784 MHz, with
the average being around 946 MHz. This suggests that the dataset includes a wide variety
of GPU models, from lower-end to high-performance ones. The relatively high standard
deviation indicates that core speed values are quite spread out.
• Open_GL: The minimum and maximum values are very close, and the median matches
the mean closely, suggesting a less variable and more consistent support level for the
OpenGL standard across GPUs in the dataset.
• Memory: The wide range from 16 MB to 32000 MB, with a large standard deviation,
indicates a significant diversity in memory capacities, possibly due to a mix of older and
newer GPU models.
• Memory_Bus: The range from 32 Bit to 8192 Bit and a high mean suggests that the
dataset contains both older GPUs with narrower buses and modern GPUs with wider buses,
which significantly impact overall memory bandwidth and performance.
• Memory_Speed: There is also a wide range in memory speeds, from 110 MHz to 2127
MHz. The average speed is about 1180 MHz, but the standard deviation is quite large,
indicating diverse performance capabilities.
• Pixel_Rate: With a range from 1 to 384 GPixel/s, the dataset spans GPUs that are
possibly several generations apart. The mean is relatively low compared to the maximum,
Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 13/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science
suggesting that most GPUs have modest pixel processing capabilities, with a few high-
performance outliers.
• TMUs (Texture Mapping Units): The TMUs range from 1 to 384, which is a broad
spread, reflecting varied texture processing powers within the dataset. The higher mean
relative to the median indicates there are some GPUs with very high TMU counts pulling
the average up.
• Memory_Bandwidth: There’s a substantial range from 1 to 1280 GB/sec, indicating a
significant variation in memory data transfer rates. The average bandwidth is somewhat
low compared to the maximum value, suggesting that the majority of GPUs have moderate
bandwidth with some exceptional high-end models.
Overall, the trends suggest a dataset comprising a wide array of GPUs, from basic to high-end
models, which is reflective in the substantial variability and range of values in the core speed,
memory specifications, and processing rates. The higher standard deviations in many of the vari-
ables indicate that the data are spread out over a wide range of values and not clustered around
the mean. This variety is typical for a dataset that includes multiple generations and levels of
GPU hardware.
Code:
1 table(Extract_Data$Manufacturer)
2 table(Extract_Data$Dedicated)
3
Output:
Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 14/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science
The histogram shows the frequency distribution of the Memory_Bandwidth variable mea-
sured in MHz. The distribution is skewed to the right, indicating that most of the GPUs in the
dataset have a lower memory bandwidth, with a decreasing number of GPUs as the memory
bandwidth increases.
There are a few key observations:
• Concentration in Lower Bins: A large concentration of GPUs have a memory bandwidth
of 200 MHz or less. This might suggest common lower-end or older models.
• Long Tail to the Right: There are fewer GPUs with higher memory bandwidths, but
they do exist up to 1200 MHz, which is the farthest bin on the right. This indicates some
high-performance GPUs are present in the data.
• Possible Outliers: There are a few GPUs with very high memory bandwidth (800 MHz
and above), which could be considered outliers compared to the rest of the data.
• Bins and Range: The range of "Memory_Bandwidth" values is divided into 20 equally
sized intervals. However, many of these intervals, especially at the higher end, have very
few or no GPUs, leading to a sparse right tail.
Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 15/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science
These box plots visualize the memory bandwidth (in GB/s) across different GPU manufac-
turers and whether the GPUs are dedicated or not. A few observations:
• Among the manufacturers, AMD and Nvidia have significantly higher memory bandwidth
compared to Arm, ATI, and Intel GPUs.
• Nvidia GPUs appear to have a wider range of memory bandwidth values, with some high-
end models reaching over 1000 GB/s.
Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 16/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science
• There is a distinct difference in memory bandwidth between dedicated GPUs and non-
dedicated GPUs, with dedicated GPUs having substantially higher bandwidth capabilities,
as expected.
• The boxes and whiskers indicate the distributions of memory bandwidth within each cat-
egory, allowing for comparison of central tendencies and spread.
• For the manufacturer box plots, there are several outliers visible above the main distri-
bution for AMD, Nvidia, and Intel GPUs. These likely represent high-end or specialized
GPU models from those manufacturers that have exceptionally high memory bandwidth
compared to their typical product lines. On the dedicated GPU box plot, there is one
significant outlier below the main distribution for non-dedicated GPUs. This could repre-
sent a lower-end or older integrated GPU model with relatively poor memory bandwidth
performance. Outliers in box plots can provide insight into extreme values or potential
anomalies in the data. Their presence indicates there are some GPU configurations that
deviate substantially from the typical ranges seen for each category. Examining these outlier
models specifically could reveal interesting technical factors behind their outlier memory
bandwidth capabilities.
Overall, these box plots effectively highlight the performance differences in terms of memory
bandwidth across GPU manufacturers and dedicated vs. non-dedicated GPU configurations.
6.4 Plotting Scatter Plot for the Linear Relationship between Mem-
ory_Bandwidth and Quantitative Variables
Code:
1 plot(Extract_Data$Core_Speed, Extract_Data$Memory_Bandwidth, xlab = "Core_Speed (MHz
)", ylab = "Memory_Bandwidth (GB/sec)", pch = 1)
2 plot(Extract_Data$Memory, Extract_Data$Memory_Bandwidth, xlab = "Memory (MB)", ylab
= "Memory_Bandwidth (GB/sec)", pch = 1)
3 plot(Extract_Data$Memory_Bus, Extract_Data$Memory_Bandwidth, xlab = "Memory_Bus (Bit
)", ylab = "Memory_Bandwidth (GB/sec)", pch = 1)
4 plot(Extract_Data$Memory_Speed, Extract_Data$Memory_Bandwidth, xlab = "Memory_Speed
(MHz)", ylab = "Memory_Bandwidth (GB/sec)", pch = 1)
5 plot(Extract_Data$Open_GL, Extract_Data$Memory_Bandwidth, xlab = "Open_GL", ylab = "
Memory_Bandwidth (GB/sec)", pch = 1)
6 plot(Extract_Data$Pixel_Rate, Extract_Data$Memory_Bandwidth, xlab = "Pixel_Rate (
GPixel/s)", ylab = "Memory_Bandwidth (GB/sec)", pch = 1)
7 plot(Extract_Data$TMUs, Extract_Data$Memory_Bandwidth, xlab = "TMUs", ylab = "Memory
_Bandwidth (GB/sec)", pch = 1)
8 plot(Extract_Data$Texture_Rate, Extract_Data$Memory_Bandwidth, xlab = "Texture_Rate
(GTexel/s)", ylab = "Memory_Bandwidth (GB/sec)", pch = 1)
Output:
Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 17/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science
Scatter Plot 1: displays the relationship between core speed and memory bandwidth.
The data points appear to form a rough positive correlation, indicating that processors with
Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 18/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science
higher core speeds tend to have higher memory bandwidth capabilities as well. However, there is
substantial variation in the data, with some processors having relatively low memory bandwidth
despite their high core speeds, and vice versa. The plot also shows a dense cluster of data points
around the core speed range of 800-1200 MHz, suggesting that this is a common range for the
processors represented in the data.
Scatter Plot 2: shows the relationship between the memory and the memory bandwidth.
The data points appear to form a distinct pattern, with a dense cluster of points at the lower
end of the memory range (around 0 to 4,000 MB) and relatively high memory bandwidth values
(around 200-600 GB/sec). As the memory increases beyond 4,000 MB, the data points become
more sparse and spread out, with memory bandwidth values ranging from very low to relatively
high, indicating a weaker correlation between memory size and bandwidth for systems with larger
memory capacities. The plot has a significant number of outliers, especially at higher memory
sizes, where some data points have unexpectedly low or high bandwidth values compared to the
general trend.
Scatter Plot 3: displays the relationship between the memory bus and the memory band-
width. The data points form a distinct pattern, with a dense cluster at the lower end of the mem-
ory bus range (around 0 to 1,000 bits) and relatively high memory bandwidth values (around
200-600 GB/sec). As the memory bus width increases beyond 1,000 bits, the data points be-
come more sparse and spread out vertically, indicating a weaker correlation between bus width
and bandwidth for systems with wider memory buses. There are several outliers, particularly at
higher bus widths, where some points have unexpectedly low or high bandwidth values compared
to the general trend.
Scatter Plot 4: shows the relationship between memory speed and memory bandwidth.
There appears to be a positive correlation between the two variables, with higher memory
speeds generally corresponding to higher memory bandwidth values. However, the correlation
is not perfectly linear, and there is substantial variation in the data points. The data points form
a somewhat triangular or fan-shaped pattern, with a dense cluster of points at lower memory
speeds (around 500-800 MHz) and a wider spread of bandwidth values as the memory speed
increases. This suggests that while higher memory speeds tend to enable higher bandwidths,
other factors likely influence the actual bandwidth achieved. There are also several outliers, par-
ticularly at higher memory speeds, where some points have unexpectedly low or high bandwidth
values compared to the general trend.
Scatter Plot 5: depicts the relationship between OpenGL and memory bandwidth. The
data points appear to form distinct horizontal bands or clusters, suggesting that memory band-
width tends to group around certain values corresponding to specific OpenGL versions. Notably,
there is a dense cluster of points around the OpenGL version 2.0, with varying but generally
lower memory bandwidth values. As the OpenGL version increases beyond 2.0, there are sep-
arate clusters with progressively higher memory bandwidth capabilities. However, within each
cluster corresponding to a specific OpenGL version, there is still some variation in memory
bandwidth, indicating that other factors beyond just the OpenGL version likely influence the
achievable memory performance. It’s interesting to note the presence of a few outliers, partic-
ularly at higher OpenGL versions, where some data points exhibit unexpectedly low or high
memory bandwidth compared to the general trend for that version.
Scatter Plot 6: appears to show the relationship between TMUs and memory bandwidth.
Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 19/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science
The data points are densely clustered at lower TMU values, indicating a large number of devices
or configurations with relatively low performance. As the TMU value increases, the data points
become more sparse, suggesting fewer devices or configurations achieve higher levels of perfor-
mance. There are also a few distinct outlier points at very high TMU and bandwidth values,
potentially representing high-end or specialized systems optimized for memory performance.
Scatter Plot 7: appears to show the relationship between texture rate and memory band-
width. The data points form a distinct pattern resembling an elongated cluster with a tail ex-
tending towards higher texture rates. This shape suggests that as the texture rate increases, the
corresponding memory bandwidth tends to plateau or reach a maximum level, with fewer data
points achieving very high bandwidth at the highest texture rates. There are also several outlier
points scattered above the main cluster, indicating configurations that achieve higher memory
bandwidth than typical for their texture rate. These could represent specialized or optimized
GPU setups.
Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 20/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science
Based on the correlation coefficient corresponding to each pair of variables, it shows that
Texture_Rate has a relatively strong linear relationship with Memory, Pixel_Rate, and TMUs,
with r of 0.81, 0.89, and 0.81, respectively. However, because r is not greater than 0.9, we
temporarily accept keeping these variables when building a multivariate regression model. The
correlation coefficients for the remaining pairs of variables all show that there is no strong
linear relationship. Therefore, the independent variables we consider satisfy the condition that
no multicollinearity phenomenon occurs.
Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 21/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science
7 Inferential Statistics
7.1 Confidence Interval
PROBLEM: Construct a 95% two-sided confidence interval for the mean of memory band-
width basing on the above dataset.
First, we have to check whether Memory_Bandwidth variable follows a Normal Distribution
or not. There are 2 methods that we can use to determine in this situation:
Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 22/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science
Based on the Q-Q Plot in Figure 13, it shows that the observations deviate greatly from the
expected line of Normal Distribution. Therefore, we can conclude that the Memory_Bandwidth
does not follow the Normal Distribution. However, this type of method is not always correct,
then we should use the Shapiro-Wilk test to have an accuracy conclusion for this task.
Similar to many other Testing Problems, we first need to formulate the Hypothesis:
• H0 : Memory_Bandwidth follows the Normal Distribution.
• H1 : Memory_Bandwidth does not follow the Normal Distribution.
From the output of the test, p-value (< 2.2 × 10−16 ), is very small compared to the signifi-
cance level α (0.05). Hence, we can reject H0 which means Memory_Bandwidth does not follow
the Normal Distribution.
=⇒ Thus, this is the problem of finding the confidence interval for the average of Mem-
ory_Bandwidth following an arbitrary distribution and a large sample size (n > 30).
Then we have to calculate some sample statistics, including sample size, sample mean, and
sample standard deviation.
Code:
1 n <- length(Extract_Data$Memory_Bandwidth)
2 x_bar <- mean(Extract_Data$Memory_Bandwidth)
3 s <- sd(Extract_Data$Memory_Bandwidth)
4 [Link](n, x_bar, s)
Output:
After that, we calculate the error for this testing following the formula: ε = Z α2 · √sn ,
where Z α2 is the z-score corresponding to the desired level of confidence, s is the sample standard
deviation, and n is the sample size.
Code:
1 Error = qnorm(p = 0.05/2, [Link] = FALSE) * s/sqrt(n)
2 print(Error)
Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 23/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science
Output:
Finally, we can make a conclusion about the confidence interval of the Memory_Bandwidth
variable as follows:
Code:
1 [Link](Left_Bound = x_bar - Error, Right_Bound = x_bar + Error)
Output:
Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 24/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science
From the output of the model, we have the following estimated coefficients βbi :
β0 = −1.699 × 101 , β
c c1 = −3.999 × 101 , β
c2 = −3.610 × 101 , β
c3 = −1.976 × 101 , β
c4 = 1.956 × 101 ,
−2 c −3 c −2 c −2 c
β5 = −9.813 × 10 , β6 = 3.362 × 10 , β7 = 9.342 × 10 , β8 = 3.556 × 10 , β9 = 1.448 × 101 ,
c
−1 c
10 = 6.777 × 10
βc , β11 = 4.120 × 10−2 , βc
12 = 1.018.
From there, we can obtain the estimated regression equation:
\
M emory_Bandwidth = − 1.699×101 − 3.999×101 ·ManufacturerATI − 3.610×101 ·ManufacturerIntel
− 1.976 × 10 ·ManufacturerNvidia + 1.956 × 101 ·DedicatedYes − 9.813 × 10−2 ·Core_Speed +
1
3.362 × 10−3 ·Memory + 9.342 × 10−2 ·Memory_Bus + 3.556 × 10−2 ·Memory_Speed + 1.448 ×
101 ·Open_GL + 6.777 × 10−1 ·Pixel_Rate + 4.120 × 10−2 ·TMUs + 1.018·Texture_Rate
After that, we want to check whether each of the variables affecting to the changes of Mem-
ory_Bandwidth or not. We implement the problem of Testing Regression Coefficients:
Here, we have to build the second multivariate linear regression model, removing TMUs
variable from the first one.
Code:
Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 25/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science
Figure 19: The result of the Second Multiple Linear Regression Model
The Adjusted Coefficient R2 (Adjusted R-squared) = 0.9138 shows that 91.38% of the fluc-
tuations in Memory_Bandwidth are explained by the independent variables in the model.
We continue to implement the Testing Problem about the effectiveness of 2 models as follows:
• H0 : Model 2 is more effective than Model 1.
Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 26/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science
3.264 × 10−3 ·Memory + 9.538 × 10−2 ·Memory_Bus + 3.680 × 10−2 ·Memory_Speed + 1.423 ×
101 ·Open_GL + 6.643 × 10−1 ·Pixel_Rate + 1.040·Texture_Rate
• Based on p-value: We can see that the smaller the p-value is, the more strongly that
variable has an impact on Memory Bandwidth. From there, it shows that Core_Speed,
ManufacturerATI, ManufacturerIntel, and ManufacturerNvidia have the strongest impact
on Memory_Bandwidth, followed by Memory, Memory_Speed, ...
• The residual errors follow Normal Distribution with an expectation of 0 and a constant
variance.
Code:
1 par(mfrow=c(2,2), mar=c(4,4,2,1))
2 plot(MLR_Model2)
Output:
Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 27/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science
Plot 1: Shows the Residuals corresponding to the Fitted values, which is used to check
the following assumptions:
• Y and the independent variables X having a linear relationship.
• The residual errors have an expectation of 0.
• The variance of the residual errors is constant.
Based on the results, it shows that:
• The red line is closely a straight horizontal line, so the assumption that Y and the inde-
pendent variables X have a linear relationship is satisfied.
• The red line is close to the line y = 0, so we assume that the expected error is 0, which is
satisfied.
• The error values are not randomly scattered along the red line, so the assumption that the
error variance is constant is not satisfied.
Plot 2: Shows the Standardized Residuals, used to check the assumption that the errors
follow Normal Distribution.
Based on the results, there are many points that deviate from the expected normal distribu-
tion line, so the assumption that the error is normally distributed is not satisfied.
Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 28/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science
Plot 3: Shows the square root of the Standardized Residuals, which is used to check the
assumption that the variance of the errors is constant.
Based on the results, the error values are not randomly dispersed along the red line, so the
assumption that the variance of the errors is constant is not really satisfied.
Plot 4: Shows the highly influential points in the dataset, specifically points 950, 60, 62.
However, only point 62 goes beyond Cook’s distance. Therefore, we need to remove point 62
from the data set.
Figure 22: The Scatter Plot shows the relationship of Predicted and Real Memory_Bandwidth
Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 29/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science
The plotted line in graph is (d) : y = x. The more concentration on this line the more correct
the model does. To be more specific, we will show 10 values with their predicted:
Code:
1 Pred <- [Link](predict(MLR_Model2, newdata = Extract_Data))
2 Compare <- cbind(Extract_Data$Memory_Bandwidth, Pred)
3 colnames(Compare) <- c("Memory_Bandwidth", "Prediction")
4 head(Compare, 10)
Output:
1 Memory_Bandwidth Prediction
2 64.0 79.677021
3 106.0 55.851926
4 51.2 26.993941
5 36.8 21.903033
6 22.4 1.457792
7 35.2 19.022765
8 134.4 80.299337
9 51.2 21.044412
10 160.0 129.703299
11 2.9 29.706407
Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 30/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science
• Assumption Testing: Multiple linear regression allows for the testing of various assumptions,
such as linearity, independence of errors, homoscedasticity, and normality of residuals,
providing insights into the validity of the model.
• Model Comparison: It enables the comparison of different models using metrics like R-
squared, adjusted R-squared, and AIC/BIC, helping researchers identify the most suitable
model for their data.
• Inference: Multiple linear regression provides inferential capabilities, allowing researchers
to draw conclusions about population parameters based on sample data.
• Control of Confounding Variables: By including relevant independent variables in the
model, multiple linear regression helps control for confounding factors, thereby enhanc-
ing the accuracy of the estimated relationships.
• Hypothesis Testing: Researchers can use multiple linear regression to test specific hypothe-
ses about the relationships between independent and dependent variables, providing em-
pirical support for theoretical constructs.
Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 31/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science
8.2 Disadvantages
Multiple linear regression, like any statistical method, has its limitations and disadvantages. Here
are some of them:
• Assumption of Linearity: Multiple linear regression assumes that the relationship between
the independent variables and the dependent variable is linear. If this assumption is vio-
lated, the model’s predictions may be inaccurate.
• Assumption of Independence: Multiple linear regression assumes that the independent vari-
ables are independent of each other. If there is multicollinearity (high correlation) among
the independent variables, it can lead to unreliable estimates of the regression coefficients.
• Overfitting: When too many independent variables are included in the model relative to
the number of observations, the model may overfit the data. Overfitting occurs when the
model captures noise in the data rather than the underlying relationship.
• Interpretability: As the number of independent variables increases, it becomes more chal-
lenging to interpret the coefficients of the model. Understanding the individual impact of
each independent variable on the dependent variable becomes less straightforward.
• Sensitive to Outliers: Multiple linear regression can be sensitive to outliers, which are data
points that deviate significantly from the rest of the data. Outliers can have a dispro-
portionate influence on the estimated regression coefficients and may lead to misleading
results.
• Assumption of Homoscedasticity: Multiple linear regression assumes that the residuals (the
differences between the observed and predicted values) have constant variance across all
levels of the independent variables. Violation of this assumption, known as heteroscedas-
ticity, can lead to biased standard errors and confidence intervals.
• Non-linearity of the Relationship: While multiple linear regression assumes a linear relation-
ship between the independent variables and the dependent variable, this may not always
be the case in reality. In such situations, the model may not accurately capture the true
relationship between the variables.
• Limited to Linear Relationships: Multiple linear regression is not suitable for modeling non-
linear relationships between variables. If the true relationship is nonlinear, the model may
provide poor predictions.
Linear regression may not be the appropriate choice in certain situations, particularly when
the assumptions of the model are violated or when the relationship between the variables is
not linear. For instance, when dealing with data that exhibit a nonlinear relationship, such as
exponential or polynomial patterns, linear regression may yield inaccurate predictions and un-
reliable parameter estimates. Additionally, if the independent variables are highly correlated
(multicollinearity), it can lead to inflated standard errors and difficulties in interpreting the co-
efficients. Moreover, when the data contain outliers or influential observations, linear regression
may produce biased estimates. In cases where the relationship between the variables is better
represented by a different model, such as logistic regression for binary outcomes or time se-
ries models for temporal data, linear regression should not be used. It’s crucial to assess the
appropriateness of linear regression based on the specific characteristics of the data and the re-
search question at hand, considering alternative methods when necessary to ensure accurate and
meaningful analysis.
Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 32/33
University of Technology, Ho Chi Minh City
Faculty of Applied Science
References
[1] A Brief History of GPU. (n.d.). In Medium. Retrieved April 8, 2024, from
[Link]
text=With%20the%20progress%20of%20manufacturing,first%20GPU%E2%80%9D%
20came%20in%201999
[2] D. C. Montgomery and G. C. Runger, Applied Statistics and Probability for Engineers, 7th
ed. Kendal- lville: Wiley, 2018.
[3] ARM. (n.d.). CPU - Glossary. Retrieved April 8, 2024, from
[Link]
20processes%20and,into%20more%20usable%20information%20output.
[4] How to Find and Count Missing Values in R DataFrame. (2023, Dec 21). Retrieved April
8, 2024, from
[Link]
[5] Most Important Computer Components. (n.d.). In HP® Tech Takes. Retrieved April 8,
2024, from
[Link]
[6] Multiple Linear Regression in R: Tutorial With Examples. (2022, Nov). Retrieved April 8,
2024, from
[Link]
[7] Revelle, W. (n.d.). describe: Basic descriptive statistics useful for psychometrics. RDocu-
mentation. Retrieved April 8, 2024, from
[Link]
describe
[8] T. D. Nguyen and D. H. Nguyen, Probability – Statistics and Data Analysis. Ho Chi Minh
City: VNUHCM Press, 2020.
[9] Uses for GPUs other than gaming. (n.d.). In OpenMetal Docs. Retrieved April 8, 2024,
from
[Link]
uses-for-gpus-other-than-gaming/#:~:text=GPUs%20can%20be%20used%20for,
machine%20learning%20and%20cryptocurrency%20mining.
Assignment for Probability and Statistics - Academic Year 2023 - 2024 Page 33/33