0% found this document useful (0 votes)
10 views10 pages

Assigment2 IndividualReport

This report analyzes customer satisfaction in a travel booking company using data from multiple waves. It involves data cleaning, model selection, and the application of statistical techniques to identify factors influencing satisfaction, with a focus on Net Promoter Score and Review Sentiment. The final model chosen for predicting customer satisfaction is a linear regression model that incorporates several key predictors.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views10 pages

Assigment2 IndividualReport

This report analyzes customer satisfaction in a travel booking company using data from multiple waves. It involves data cleaning, model selection, and the application of statistical techniques to identify factors influencing satisfaction, with a focus on Net Promoter Score and Review Sentiment. The final model chosen for predicting customer satisfaction is a linear regression model that incorporates several key predictors.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Assignment 2 - Individual Report

This assignment delves into understanding customer satisfaction within a travel booking company. It focuses
on data collected during waves 1, 2, and 5 to gain insights into traveler sentiments over time.

Our initial task was to prepare the dataset for analysis, which involved a meticulous cleaning process. We
aimed to ensure the integrity and reliability of the data by identifying and rectifying errors, as well as
handling missing values. To achieve this, we employed advanced algorithms tailored to refine errors and
eliminate inconsistencies, ultimately enhancing the quality of our dataset.

Throughout this assignment, we will delve into the intricacies of customer satisfaction, exploring the
multitude of variables that may influence this crucial aspect of the travel booking experience. By leveraging
robust statistical techniques and machine learning algorithms, we aim to uncover valuable insights that can
inform strategic decision-making within the travel industry.

Figure 1: BoxPlot of the Variables before cleaning process to understand them

Before getting into the selected model, we are going to look into the predictions and hypotheses we made
prior to the analysis itself. Traveler satisfaction is influenced by factors such as total spending, ease of use,
promotions, age, gender, booking time, device type, flight availability, review sentiment, repeat behavior,
and home region. Higher spending indicates higher expectations and criticism, while greater Ease of Use
ratings indicate higher satisfaction. Promotions offer better value for money, and demographic differences
affect satisfaction. Booking times during off-peak and peak hours, device type, flight availability, review
sentiment, and repeat behavior indicate overall customer contentment.

Looking at the hypotheses, travelers' satisfaction with a website can be influenced by various factors.
Higher spending indicates greater investment in the booking process, leading to higher expectations for
service quality. The ease of use of the website is also important, with users rating it higher. Promotional
offers can provide added value, increasing satisfaction. Demographic factors, such as younger and female
travelers, may also impact satisfaction. Booking during off-peak hours can lead to faster response times and
smoother transactions. Device and operating system differences also impact satisfaction. Flight availability is
also a factor, with a wider selection providing more options. Positive reviews and mentions from satisfied
travelers also contribute to satisfaction. Repeat behavior and engagement with the platform are indicators
of satisfaction.

Before deciding and analyzing more deeply which model was going to be better for the study of consumer
satisfaction, I have done an exhaustive analysis of which model is best.
Model Comparison Explanation: In our pursuit of finding the most effective model for predicting customer
satisfaction, we thoroughly examined various methodologies, including simple linear models, Bayesian
Information Criterion (BIC), decision tree models, and others.

The journey started with a simple linear analysis, which looked at the association between individual
predictor factors and satisfaction. Next, we went on and broadened our research to include multiple linear
regression, which incorporates many predictors at the same time to improve model accuracy.

Additionally, BIC analysis was used to pick the most dependable model among competing alternatives,
taking into account both model complexity and goodness of fit. To assess model performance, we used the
Mean Squared Error (MSE), a prediction accuracy measure that quantifies the average squared difference
between observed and anticipated values.

We determined that the formula-based predictive model was the best option after thorough analysis. This
model displayed beneficial prediction capabilities and provided useful insights into the elements that
influence customer happiness in our dataset.

model <- lm(Satisfaction ~ BookingCount + TotalSpending + Review Sentiment +


NetPromoterScore + WebsiteReviewed, data = final_data) summary(model)

The study examines the relationship between Net Promoter Score and TotalSatisfaction, a key predictor of
customer satisfaction. The coefficient estimates indicate that an increase in Net Promoter Score is
associated with a substantial increase in TotalSatisfaction. Review sentiment is also statistically significant,
indicating that reviews' sentiment plays a role in determining satisfaction. However, other factors like
BookingCount, TotalSpending, and WebsiteReviewed do not have a statistically significant effect on
TotalSatisfaction.

The model fits the data, with a residual standard error of 1.803, multiple R-squared value of 0.6223, and
adjusted R-squared value of 0.6189. The F-statistic of 180.6 indicates that the predictor variables jointly
contribute to explaining the variance in TotalSatisfaction.

The results suggest that NetPromoterScore and ReviewSentiment play a significant role in determining
TotalSatisfaction, while other factors like BookingCount, TotalSpending, and WebsiteReviewed may not have
a discernible impact, still relevant enough for being in the best model. Further investigation is needed to
explore potential reasons for the lack of significance of certain predictors and identify additional factors
influencing TotalSatisfaction.

The study compared the performance of several models using Mean Squared Error (MSE) analysis; the
model that was chosen had the lowest MSE, indicating better predictive ability. The evaluation of the entire
model's results shed light on the relationships between the predictor variables and customer satisfaction,
with some factors exhibiting strong correlations and others having minimal effects. Comparing the model to
other models, the confusion matrix analysis showed that it had greater sensitivity, specificity, positive
predictive value (PPV), negative predictive value (NPV), and balanced accuracy. The linear regression model
outperformed the decision tree model, which was interpretable, but it was more accurate at classifying
instances with different satisfaction levels. The most appropriate model for predicting customer happiness
in the dataset was found to be the linear regression model through a thorough review process that included
a variety of modeling approaches and performance criteria.
APPENDIX
Prediction tree
#TREE PREDICTION MODEL
tree_model <- rpart(Satisfaction ~ BookingCount + TotalSpending + ReviewSentiment + NetPromoterScore +
WebsiteReviewed, data = final_data)
predictions <- predict(tree_model, final_data, type = "class")
[Link](tree_model)

Confusion matrix
conf_matrix <- confusionMatrix(predictions, final_data$Satisfaction)
print(conf_matrix)

Full model:
full_model <- lm(Satisfaction ~ ., data = final_data)
CODE

team_data <-
[Link]("[Link]
[Link]", encoding="UTF-8")

summary(team_data)

names(team_data)

print(colSums([Link](team_data)))

#CLEANING DATA PROCESS

boxplot(team_data[, c("TotalSpending", "InternationalSpend", "DomesticSpend", "Female", "Age",


"PromotionsUsed", "EaseOfUse", "UsageDuration", "BookingCount",
"AvailableFlights", "Satisfaction", "NetPromoterScore", "TotalSatisfaction",
"ReviewSentiment", "WebsiteReturned")])

summary(team_data[, c("TotalSpending", "InternationalSpend", "DomesticSpend", "Female", "Age",


"PromotionsUsed", "EaseOfUse", "UsageDuration", "BookingCount",
"AvailableFlights", "Satisfaction", "NetPromoterScore", "TotalSatisfaction",
"ReviewSentiment")])

#CHECKING FOR OUTLIERS AND DELETING THOSE HOW AREN'T RELEVANT


[Link]("outliers")
library(outliers)
[Link](team_data$Age)
team_data_clean <- subset(team_data, Age != -99)
[Link](team_data_clean$Age)
team_data_clean <- subset(team_data_clean, Age != 99)
[Link](team_data_clean$Age)
team_data_clean <- team_data_clean[[Link](team_data_clean$TotalSpending), ]
team_data_clean <- team_data_clean[![Link](team_data_clean$HomeRegion), ]
team_data_clean <- team_data_clean[![Link](team_data_clean$HomeState), ]

summary(team_data[, c("ReviewSentiment")])
summary(team_data[, c("ReviewMentionsWebsite")])

# Load the dataset


# Separate data into two subsets: Subset A (valid data) and Subset B (missing/outlier data)
subset_A <- team_data_clean[team_data_clean$ReviewSentiment >= 0 &
team_data_clean$ReviewSentiment <= 5, ]
subset_B <- team_data_clean[team_data_clean$ReviewSentiment == 99, ]

# CLEANING AND PREDICTING DATA FROM THE COLUMN ReviewSentiment

model <- lm(ReviewSentiment ~ ., data = subset_A)


predictors_B <- subset_B[, !(colnames(subset_B) %in% "ReviewSentiment")]
predicted_values <- predict(model, newdata = predictors_B)
adjusted_values <- pmax(pmin(predicted_values, 5), 0)
subset_B$ReviewSentiment <- adjusted_values
updated_data <- rbind(subset_A, subset_B)

# CLEANING AND PREDICTING DATA FROM THE COLUMN ReviewMentionsWebsite


subset_C <- updated_data[updated_data$ReviewMentionsWebsite != 99, ]
subset_D <- updated_data[updated_data$ReviewMentionsWebsite == 99, ]
predicted_values <- ifelse(subset_D$WebsiteReviewed == 0, 0, subset_D$WebsiteReviewed)
subset_D$ReviewMentionsWebsite <- predicted_values
updated_data_2 <- rbind(subset_C, subset_D)

[Link]("dplyr")
library(dplyr)

# Filter rows where Wave is 1, 2, or 5


team_data_filtered <- updated_data_2 %>%
filter(Wave == 1 | Wave == 2 | Wave == 5)

# Check the first few rows of the filtered data


head(team_data_filtered)

#CONVERTING THE FINAL DATA AND SEPARATING THE WAVES IM GOING TO BE WORKING ON
final_data <- team_data_filtered

summary(final_data)
print(colSums([Link](final_data)))

#PRELIMINAR ANALYSIS
library(ggplot2)

# Create histogram for HomeRegion and Satisfaction


ggplot(final_data, aes(x = HomeRegion, y = Satisfaction)) +
geom_histogram(stat = "identity", fill = "skyblue", color = "skyblue") +
labs(title = "Distribution of Satisfaction by Home Region",
x = "Home Region",
y = "Satisfaction Score") +
theme_minimal()
# Create histogram for HomeState and Satisfaction
ggplot(final_data, aes(x = HomeState, y = Satisfaction)) +
geom_histogram(stat = "identity", fill = "lightgreen", color = "lightgreen") +
labs(title = "Distribution of Satisfaction by Home State",
x = "Home State",
y = "Satisfaction Score") +
theme_minimal()

# Explore the structure of the dataset


str(final_data)

# Check the first few rows of the dataset


head(final_data)

# Summary statistics of numerical variables


summary(final_data)

# Check for missing values


colSums([Link](final_data))

# Split the data into training and testing sets (80-20 split)
[Link](123) # for reproducibility
train_index <- sample(1:nrow(final_data), 0.8 * nrow(final_data))
train_data <- final_data[train_index, ]
test_data <- final_data[-train_index, ]

# Make predictions on the test set


predictions <- predict(lm_model, newdata = test_data)

# Evaluate model performance


mse <- mean((test_data$Satisfaction - predictions)^2)
rmse <- sqrt(mse)
mae <- mean(abs(test_data$Satisfaction - predictions))
r_squared <- summary(lm_model)$[Link]

# Print model evaluation metrics


cat("Mean Squared Error (MSE):", mse, "\n")
cat("Root Mean Squared Error (RMSE):", rmse, "\n")
cat("Mean Absolute Error (MAE):", mae, "\n")
cat("R-squared:", r_squared, "\n")

#PREDICTIVE MODELS
library(dplyr)

lmAll <- lm(TotalSpending ~ ., data =final_data)


summary(lmAll)

#MODEL 1
model_data1 <- lm(TotalSatisfaction ~ TotalSpending + DomesticSpend + ReviewSentiment , data =
final_data)
summary(model_data1)

# MODEL 2
model_data2 <- final_data %>%
select(TotalSatisfaction, TotalSpending, PromotionsUsed, DeviceTypeMobileTablet, Female)
lm_model_2 <- lm(TotalSatisfaction ~ TotalSpending + PromotionsUsed + DeviceTypeMobileTablet + Female,
data = model_data2)
summary(lm_model_2)

# MODEL 3
model_data3 <- final_data %>%
select(TotalSatisfaction, TotalSpending, PromotionsUsed, EaseOfUse, AvailableFlights)
lm_model_3 <- lm(TotalSatisfaction ~ TotalSpending + PromotionsUsed + EaseOfUse + AvailableFlights, data
= model_data3)
summary(lm_model_3)

# MODEL 4
model_data4 <- final_data %>%
select(Satisfaction, TotalSpending, PromotionsUsed, AvailableFlights)
lm_model_4 <- lm(Satisfaction ~ TotalSpending + PromotionsUsed + AvailableFlights, data = model_data4)
summary(lm_model_4)

# MODEL 5
model_data5 <- final_data %>%
select(TotalSatisfaction, TotalSpending, PromotionsUsed, AvailableFlights)
lm_model_5 <- lm(TotalSatisfaction ~ TotalSpending + PromotionsUsed + AvailableFlights, data =
model_data5)
summary(lm_model_5)

# MODEL 5 Variation
model_data5_variation <- final_data %>%
select(TotalSatisfaction, TotalSpending, PromotionsUsed, EaseOfUse)

# Filter out rows with missing values in TotalSpending column


model_data5_variation_clean <-
model_data5_variation[[Link](model_data5_variation$TotalSpending), ]

# Build a linear regression model


lm_model_5_variation <- lm(TotalSatisfaction ~ TotalSpending + PromotionsUsed + EaseOfUse, data =
model_data5_variation_clean)

# Summary of the model


summary(lm_model_5_variation)

# MODEL 6
model_data6 <- final_data %>%
select(TotalSatisfaction, AvailableFlights, TotalSpending, UsageDuration)
lm_model_6 <- lm(TotalSatisfaction ~ AvailableFlights + TotalSpending + UsageDuration, data =
model_data6)
summary(lm_model_6)

confint(lm_model_6)

# MODEL 8
model_data8 <- final_data %>%
select(Satisfaction, DomesticSpend, AvailableFlights)
lm_model_8 <- lm(Satisfaction ~ DomesticSpend + AvailableFlights , data = model_data8)
summary(lm_model_8)

# BEST MODEL
model <- lm(Satisfaction ~ BookingCount + TotalSpending + UsageDuration + NetPromoterScore , data =
final_data)
summary(model)

model2 <- lm(Satisfaction ~ BookingCount + TotalSpending + ReviewSentiment + NetPromoterScore , data =


final_data)
summary(model2)

#BEST ONE
model <- lm(Satisfaction ~ BookingCount + TotalSpending + ReviewSentiment + NetPromoterScore +
WebsiteReviewed, data = final_data)
summary(model)

# Assuming your dataset is named data, sort TotalSatisfaction in the order you prefer
final_data$Satisfaction <- factor(final_data$Satisfaction, levels = 3:1)
#Create a mosaic plot with the new order
mosaic_model <- mosaicplot(table(final_data$Satisfaction, final_data$ReviewSentiment), main = "Mosaic
Plot of Satisfaction, Booking Count, and Review Sentiment")
# Filter the data for waves 1, 2, and 5
wave_subset <- subset(final_data, Wave %in% c(1, 2, 5))

# Create a mosaic plot


mosaic_model <- mosaicplot(table(wave_subset$Satisfaction, wave_subset$ReviewSentiment), main =
"Satisfaction vs Review Sentiment (Waves 1, 2, 5)")

[Link]("rpart")
library(rpart)
[Link]("[Link]")
library([Link])

#TREE PREDICTION MODEL


tree_model <- rpart(Satisfaction ~ BookingCount + TotalSpending + ReviewSentiment + NetPromoterScore +
WebsiteReviewed, data = final_data)
predictions <- predict(tree_model, final_data, type = "class")
[Link](tree_model)

[Link]("ggplot2")
library(ggplot2)

[Link]("lattice")
library(lattice)

[Link]("caret")
library(caret)

predictions2 <- factor(predictions)

#DECISION TREE
library(rpart)

# Recode Satisfaction variable


final_data$Satisfaction_category <- ifelse(final_data$Satisfaction <= 5, 0, 1)

# Split data into training and testing sets


[Link](123) # For reproducibility
train_indices <- sample(1:nrow(final_data), 0.8 * nrow(final_data))
train_data <- final_data[train_indices, ]
test_data <- final_data[-train_indices, ]

# Fit decision tree model


tree_model <- rpart(Satisfaction_category ~ ., data = train_data, method = "class")
# Evaluate model performance on test data
predicted <- predict(tree_model, test_data, type = "class")
accuracy <- mean(predicted == test_data$Satisfaction_category)
accuracy

library([Link])

# Plot decision tree


[Link](tree_model, type = 4, extra = 2)

# Step 1: Run a multi-linear regression with all predictor variables


full_model <- lm(Satisfaction ~ BookingCount + ReviewSentiment + NetPromoterScore + WebsiteReviewed ,
data = final_data)

# Step 2: Select the most reliable model using BIC


reduced_model <- step(full_model, k = log(nrow(final_data)), direction = "both")

# Step 3: Compare performance using Mean Squared Error (MSE) over test data
# Split data into training and testing sets
[Link](123) # For reproducibility
train_indices <- sample(1:nrow(final_data), 0.8 * nrow(final_data))
train_data <- final_data[train_indices, ]
test_data <- final_data[-train_indices, ]

# Predict using reduced model


predicted <- predict(reduced_model, newdata = test_data)

# Calculate MSE
mse <- mean((predicted - test_data$Satisfaction)^2)

# Encode satisfaction variable into binary outcome


final_data$Satisfaction_binary <- ifelse(final_data$Satisfaction >= 6, "Satisfied", "Unsatisfied")

# Run classification tree model


tree_model <- rpart(Satisfaction_binary ~ ., data = final_data, method = "class")

# Evaluate performance using accuracy scores


# Predict using the decision tree model
predictions <- predict(tree_model, final_data, type = "class")

# Compute accuracy
accuracy <- sum(predictions == final_data$Satisfaction_binary) / nrow(final_data)

# Fit initial full model


full_model <- lm(Satisfaction ~ ., data = final_data)
summary(full_model)
full_model <- lm(Satisfaction ~ ., data = final_data)

You might also like