0% found this document useful (0 votes)
8 views1 page

Assignment 4

The assignment focuses on predicting the prices of used Toyota Corollas using a dataset from 2004, requiring students to perform multiple linear regression and assess model accuracy through metrics like mean absolute percentage error and root mean squared error. Additionally, it involves clustering analysis using the Framingham Heart Study dataset, where students will prepare data, determine the optimal number of clusters, and evaluate clustering quality. The tasks are to be completed using R Markdown in RStudio Cloud/Blackboard.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views1 page

Assignment 4

The assignment focuses on predicting the prices of used Toyota Corollas using a dataset from 2004, requiring students to perform multiple linear regression and assess model accuracy through metrics like mean absolute percentage error and root mean squared error. Additionally, it involves clustering analysis using the Framingham Heart Study dataset, where students will prepare data, determine the optimal number of clusters, and evaluate clustering quality. The tasks are to be completed using R Markdown in RStudio Cloud/Blackboard.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

MBA 739 – Advanced Analytics

Week 4 Classification Numeric Prediction, and Clustering


Assignment

Predicting Prices of Used Cars. The file ToyotaCorolla.csv contains data on used cars (Toyota
Corolla) on sale during late summer of 2004 in the Netherlands. It has 1436 records containing
details on 38 attributes, including Price, Age, Kilometers, HP, and other specifications. The goal is to
predict the price of a used Toyota Corolla based on its specifications.
• Split the data into training (60%) and validation (40%) datasets. Use the seed 739 to ensure consistent
output.
• Run a multiple linear regression with the outcome variable Price and predictor variables Age_08_04,
KM, Fuel_Type, HP, Automatic, Doors, Quarterly_Tax, Mfr_Guarantee, Guarantee_Period, Airco,
Automatic_airco, CD_Player, Powered_Windows, Sport_Model, and Tow_Bar.
• What factors which appear to have no predictive power in assessing price (p > 0.05)?
• Run the predictive model using the above variables. Assess the accuracy of the model in predicting
prices. What is the model’s mean absolute percentage error and root mean squared error? Interpret what
these values mean in real dollars and in plain language.
• Remove the factors which previously appeared to have no predictive power based on statistical
significance. Run the prediction and assess the accuracy of the model in predicting prices. What is the
model’s mean absolute percentage error and root mean squared error? Interpret what these values mean
in real dollars and in plain language. Is the model improved?

Framingham Heart Study – Clustering

When it launched in 1948, the original goal of the Framingham Heart Study (FHS) was to identify
common factors or characteristics that contribute to cardiovascular disease. Over the years, the FHS
has become a successful multigenerational study that analyzes family patterns of cardiovascular and
other diseases. We will use a small subset of that dataset for cluster analysis.

• To prepare the data. Remove any NA values. Similarly, remove the TenYearCHD outcome variable and
normalize the remaining data to 1.
• Using a seed of 10, produce an initial kmeans() cluster with three clusters. Then graph the clusters and
answer the following:
o Determine the optimal number of clusters using a silhouette plot. Produce the plot. What is the
optimal number?
o Replicate the clustering process with the optimal number of clusters. Assess the cluster plot.
Does this appear to meaningfully improve the quality of the clustering?

Use the R Markdown file available in RStudio Cloud/Blackboard to complete the homework.

You might also like