Students Exam Scores Analysis - Ipynb
Students Exam Scores Analysis - Ipynb
3","language":"python","name":"python3"},"language_info":{"codemirror_mode":
{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-
python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","
version":"3.12.3"},"kaggle":{"accelerator":"none","dataSources":
[{"sourceId":5399169,"sourceType":"datasetVersion","datasetId":3128523}],"dockerIma
geVersionId":30761,"isInternetEnabled":false,"language":"python","sourceType":"note
book","isGpuEnabled":false}},"nbformat_minor":4,"nbformat":4,"cells":
[{"cell_type":"markdown","source":"# Understand the Data\n","metadata":{}},
{"cell_type":"markdown","source":"## Import libraries","metadata":{}},
{"cell_type":"code","source":"# type: ignore\nimport numpy as np \nimport pandas
as pd\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nimport warnings\
nwarnings.filterwarnings('ignore')","metadata":{},"outputs":
[],"execution_count":null},{"cell_type":"markdown","source":"## Input
Data","metadata":{}},{"cell_type":"code","source":"df =
pd.read_csv(\"./Expanded_data_with_more_features.csv\", encoding=
'unicode_escape')\ndf.head(2)","metadata":{},"outputs":[],"execution_count":null},
{"cell_type":"code","source":"df.shape","metadata":{},"outputs":
[],"execution_count":null},{"cell_type":"code","source":"df.size","metadata":
{},"outputs":[],"execution_count":null},
{"cell_type":"code","source":"df.info()","metadata":{},"outputs":
[],"execution_count":null},
{"cell_type":"code","source":"df.describe(include='all').T","metadata":
{},"outputs":[],"execution_count":null},
{"cell_type":"code","source":"df.columns","metadata":{},"outputs":
[],"execution_count":null},{"cell_type":"markdown","source":"`Data Dictionary`\n\n|
Column Name | Description
|\
n|----------------------|----------------------------------------------------------
-------------------|\n| **Gender** | Gender of the student (male/female)
|\n| **EthnicGroup** | Ethnic group of the student (neither Christian nor
Jewish) |\n| **ParentEduc** | Parent(s)
education background (from some_highschool to master's degree) |\n|
**LunchType** | School lunch type (standard or free/reduced)
|\n| **TestPrep** | Test preparation course followed (completed or none)
|\n| **ParentMaritalStatus** | Parent(s) marital status
(married/single/widowed/divorced) |\n| **PracticeSport** | How
often the student practices sport (never/sometimes/regularly) |\n|
**IsFirstChild** | If the child is the first in the family (yes/no)
|\n| **NrSiblings** | Number of siblings the student has (0 to 7)
|\n| **TransportMeans** | Means of transport to school (schoolbus/private)
|\n| **WklyStudyHours** | Weekly self-study hours (less than 5 hrs; between 5
and 10 hrs; more than 10 hrs) |\n| **MathScore** | Math test score (0-100)
|\n| **ReadingScore** | Reading test score (0-100)
|\n| **WritingScore** | Writing test score (0-100)
|\n","metadata":{}},{"cell_type":"markdown","source":"# Data Cleaning","metadata":
{}},{"cell_type":"code","source":"df.columns","metadata":{},"outputs":
[],"execution_count":null},
{"cell_type":"code","source":"df.isnull().sum()","metadata":{},"outputs":
[],"execution_count":null},{"cell_type":"code","source":"df.fillna({\n
'EthnicGroup': 'Unknown',\n 'ParentEduc': 'No Edu info',\n
'ParentMaritalStatus': 'No info',\n \n}, inplace=True)","metadata":{},"outputs":
[],"execution_count":null},{"cell_type":"code","source":"df.drop(columns=['Unnamed:
0'], inplace=True)","metadata":{},"outputs":[],"execution_count":null},
{"cell_type":"code","source":"df","metadata":{},"outputs":
[],"execution_count":null},{"cell_type":"code","source":"df.info()","metadata":
{},"outputs":[],"execution_count":null},
{"cell_type":"code","source":"df[df['WklyStudyHours']=='05-Oct']","metadata":
{},"outputs":[],"execution_count":null},{"cell_type":"markdown","source":"## Add
new Col","metadata":{}},{"cell_type":"code","source":"# percentage col \
ndf['Percentage']= ( (df['WritingScore'] + df['MathScore'] +
df['ReadingScore'])/300 ) * 100\ndf['Percentage'] = df['Percentage'].apply(lambda
x: '{:,.2f}'.format(x))\ndf['Percentage'] =
df['Percentage'].astype('float16')","metadata":{},"outputs":
[],"execution_count":null},{"cell_type":"code","source":"# grade col\ndef
grade(score):\n \n if score >= 80.0:\n return 'A'\n elif score >=
60.0:\n return 'B'\n elif score >= 40.0:\n return 'C'\n elif
score >= 30.0:\n return 'D'\n else:\n return 'F'\n","metadata":
{},"outputs":[],"execution_count":null},{"cell_type":"code","source":"df['Grade'] =
df['Percentage'].apply(grade)","metadata":{},"outputs":[],"execution_count":null},
{"cell_type":"code","source":"df","metadata":{},"outputs":
[],"execution_count":null},{"cell_type":"code","source":"# for future reference\n\
ndef all_mean_score_set():\n return {'MathScore':'mean', 'ReadingScore':
'mean','WritingScore':'mean'}","metadata":{},"outputs":[],"execution_count":null},
{"cell_type":"markdown","source":"# EDA","metadata":{}},
{"cell_type":"markdown","source":"## Gender","metadata":{}},
{"cell_type":"code","source":"gender_count = df['Gender'].value_counts()\
nplt.pie(gender_count, labels=gender_count.index, autopct=lambda p : '{:.1f}%
({:,.1f})'.format(p,p * sum(gender_count)/100))\nplt.title('Gender Distribution')\
nplt.show()","metadata":{},"outputs":[],"execution_count":null},
{"cell_type":"code","source":"ax = sns.countplot(data=df, x='Gender', hue='Grade',
palette='viridis')\n\nfor container in ax.containers:\n
plt.bar_label(container)\n\nplt.title('Male & Female Grade') \
nplt.show()","metadata":{},"outputs":[],"execution_count":null},
{"cell_type":"markdown","source":"<div class=\"alert alert-block alert-info\">\
n<b>Info : </b> Both males and females have nearly equal
participation.\n</div>","metadata":{}},{"cell_type":"markdown","source":"## Parent
Education vs Score","metadata":{}},{"cell_type":"code","source":"par_edu =
df.groupby(['ParentEduc', ]).agg({'MathScore':'mean', \n
'ReadingScore': 'mean',\n
'WritingScore':'mean'})\n\npar_edu","metadata":{},"outputs":
[],"execution_count":null},{"cell_type":"code","source":"df.groupby(['ParentEduc'])
[['MathScore', 'ReadingScore', 'WritingScore']].agg(np.mean) \\\
n .style.background_gradient(cmap='RdPu')","metadata":{},"outputs":
[],"execution_count":null},{"cell_type":"code","source":"# Does parental education
have an affect on different genders?\ndf.groupby(['Gender', 'ParentEduc'])
[['MathScore', 'ReadingScore', 'WritingScore']].agg(np.mean)\\\n
.style.background_gradient(cmap='RdPu')","metadata":{},"outputs":
[],"execution_count":null},
{"cell_type":"code","source":"sns.clustermap(data=par_edu, cmap='viridis',
annot=True) \nplt.title('Relationship b/w student Score and Parent Education ',
size=19)\nplt.show()","metadata":{},"outputs":[],"execution_count":null},
{"cell_type":"markdown","source":"<div class=\"alert alert-block alert-info\">\
n<b>Info : </b>Children of parents who have a master's degree are more likely to
have better scores.\n</div>","metadata":{}},{"cell_type":"markdown","source":"##
Parent Marital Status vs Score","metadata":{}},
{"cell_type":"code","source":"par_mar =
df.groupby(['ParentMaritalStatus', ]).agg({'MathScore':'mean', \n
'ReadingScore': 'mean',\n
'WritingScore':'mean'})\n\
npar_mar.style.background_gradient(cmap='RdPu')","metadata":{},"outputs":
[],"execution_count":null},
{"cell_type":"code","source":"sns.clustermap(data=par_mar, cmap='viridis',
annot=True)\nplt.title('Relationship b/w student Score and Parent Marital Status ',
size=19)\nplt.show()","metadata":{},"outputs":[],"execution_count":null},
{"cell_type":"markdown","source":"<div class=\"alert alert-block alert-info\">\
n<b>Info : </b>There is no significant difference in children's scores due to their
parents' marital status.\n</div>","metadata":{}},
{"cell_type":"markdown","source":"## All Scores","metadata":{}},
{"cell_type":"code","source":"# df[df[\"ReadingScore\"] < 10].count()\nfig =
plt.figure(figsize=(20, 5))\n\nfor index, one in
enumerate([\"MathScore\", \"ReadingScore\", \"WritingScore\"]):\n
fig.add_subplot(1, 3, index + 1)\n sns.boxplot(x=df[one])","metadata":
{},"outputs":[],"execution_count":null},{"cell_type":"code","source":"#math\
nsns.catplot(data=df, kind='boxen', x='MathScore', palette='Set2')\nplt.title('Math
Boxen plot')\nfor x in [20, 40, 60, 80, 100]:\n plt.axvline(x=x, color='black',
linestyle='--', linewidth=0.7)\n \n#reading\nsns.catplot(data=df, kind='boxen',
x='ReadingScore', palette='Set1')\nplt.title('Reading Boxen plot')\nfor x in [20,
40, 60, 80, 100]:\n plt.axvline(x=x, color='black', linestyle='--',
linewidth=0.7)\n \n \n#writing\nsns.catplot(data=df, kind='boxen',
x='WritingScore', palette='Set3')\nplt.title('Writing Boxen plot')\nfor x in [20,
40, 60, 80, 100]:\n plt.axvline(x=x, color='black', linestyle='--',
linewidth=0.7)\n\n\nplt.show()","metadata":{},"outputs":[],"execution_count":null},
{"cell_type":"markdown","source":"## Ethnic group vs Score","metadata":{}},
{"cell_type":"code","source":"group_counts = df['EthnicGroup'].value_counts()\
nlabels = group_counts.index\n\nplt.pie(group_counts, labels=labels,
autopct='%1.1f%%')\nplt.title('Ethnic Groups')\nplt.show()","metadata":
{},"outputs":[],"execution_count":null},
{"cell_type":"code","source":"df.columns","metadata":{},"outputs":
[],"execution_count":null},{"cell_type":"markdown","source":"## Sport vs
Score","metadata":{}},{"cell_type":"code","source":"sport =
df.groupby(['PracticeSport']).agg({'MathScore':'mean', \n
'ReadingScore': 'mean',\n
'WritingScore':'mean'})\n\
nsport.style.background_gradient(cmap='RdPu')","metadata":{},"outputs":
[],"execution_count":null},{"cell_type":"code","source":"sns.clustermap(data=sport,
annot=True)\nplt.show()","metadata":{},"outputs":[],"execution_count":null},
{"cell_type":"markdown","source":"## Test Practice vs Score","metadata":{}},
{"cell_type":"code","source":"df.groupby(['PracticeSport']).agg(all_mean_score_set(
)) \\\n .style.background_gradient(cmap='RdPu')","metadata":{},"outputs":
[],"execution_count":null},
{"cell_type":"code","source":"df.groupby(['PracticeSport',
'TestPrep']).agg(all_mean_score_set()) \\\
n .style.background_gradient(cmap='RdPu')","metadata":{},"outputs":
[],"execution_count":null},{"cell_type":"markdown","source":"## Lunch vs
Score","metadata":{}},{"cell_type":"code","source":"df.groupby(['LunchType',
'Gender']).agg(all_mean_score_set()) \\\
n .style.background_gradient(cmap='RdPu')","metadata":{},"outputs":
[],"execution_count":null},{"cell_type":"markdown","source":"## First Child ? vs
Score","metadata":{}},
{"cell_type":"code","source":"df.groupby(['IsFirstChild']).agg(all_mean_score_set()
) \\\n .style.background_gradient(cmap='RdPu')","metadata":{},"outputs":
[],"execution_count":null},{"cell_type":"markdown","source":"## siblings vs
Score","metadata":{}},
{"cell_type":"code","source":"df['NrSiblings'].value_counts().plot(kind='bar')\
nplt.title('Nr of Siblings')\nplt.show()","metadata":{},"outputs":
[],"execution_count":null},
{"cell_type":"code","source":"df.groupby(['NrSiblings']).agg(all_mean_score_set())
\\\n .style.background_gradient(cmap='RdPu')","metadata":{},"outputs":
[],"execution_count":null},{"cell_type":"markdown","source":"## Transportation vs
Score","metadata":{}},
{"cell_type":"code","source":"df.groupby(['TransportMeans']).agg(all_mean_score_set
()) \\\n .style.background_gradient(cmap='RdPu')","metadata":{},"outputs":
[],"execution_count":null},
{"cell_type":"code","source":"df.groupby(['TransportMeans',
'TestPrep']).agg(all_mean_score_set()) \\\
n .style.background_gradient(cmap='RdPu')\n","metadata":{},"outputs":
[],"execution_count":null},
{"cell_type":"code","source":"df.groupby(['TransportMeans',
'PracticeSport']).agg(all_mean_score_set()) \\\
n .style.background_gradient(cmap='RdPu')\n","metadata":{},"outputs":
[],"execution_count":null},
{"cell_type":"code","source":"df.groupby(['TransportMeans',
'WklyStudyHours']).agg(all_mean_score_set()) \\\
n .style.background_gradient(cmap='RdPu')","metadata":{},"outputs":
[],"execution_count":null},{"cell_type":"markdown","source":"## weekly study hr vs
Score","metadata":{}},{"cell_type":"code","source":"df.groupby(['WklyStudyHours',
'TestPrep']).agg(all_mean_score_set()) \\\
n .style.background_gradient(cmap='RdPu')","metadata":{},"outputs":
[],"execution_count":null},{"cell_type":"code","source":"# Determine if there is
linearity between the target variable and the categorical features. This indicates
if linear regression is a good predictive model.\ntarget = 'MathScore'\n\n#
Identify categorical features\ncategorical_features =
df.select_dtypes(include=['object']).columns\n\n# Create box plots\nfor feature in
categorical_features:\n plt.figure(figsize=(10, 6))\n sns.boxplot(x=feature,
y=target, data=df)\n plt.title(f'Box Plot of {target} by {feature}')\n
plt.xlabel(target)\n plt.ylabel(feature)\n plt.show()","metadata":
{},"outputs":[],"execution_count":null}]}