100% found this document useful (2 votes)
1K views31 pages

Project +Sweta+Kumari+ +FRA+Milestone+1+ July+ 2021

The document discusses building a model to predict company defaults using financial data. It outlines preprocessing steps like outlier treatment, missing value imputation, and transforming the target variable. Univariate and bivariate analyses are conducted on important variables before splitting the data into train and test. A logistic regression model is built on the train set and validated on the test set, reporting performance metrics and interpreting the results.

Uploaded by

sweta kumari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
1K views31 pages

Project +Sweta+Kumari+ +FRA+Milestone+1+ July+ 2021

The document discusses building a model to predict company defaults using financial data. It outlines preprocessing steps like outlier treatment, missing value imputation, and transforming the target variable. Univariate and bivariate analyses are conducted on important variables before splitting the data into train and test. A logistic regression model is built on the train set and validated on the test set, reporting performance metrics and interpreting the results.

Uploaded by

sweta kumari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 31

Finance and Risk

Analytics

Name: Sweta Kumari


PGP-DSBA Online
July’ 21
Date: 08/05/2022

0
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Table of Contents
Problem : Company Analysis
1.   Outlier Treatment………………………………………………………………………………………………………….
2. Missing Value Treatment …………………11
3. Transform Target variable into 0 and 1……………………………………………………………………………………14
4. Univariate (4 marks) & Bivariate ( 6marks) analysis with proper interpretation. (You may choose
to include only those variables which were significant in the model building)…………………………………
15
5. Train Test Split……………………………………………………………………………………………………………………………20
6.  Build Logistic Regression Model (using statsmodel library) on most important variables on Train
Dataset and choose the optimum cutoff. Also showcase your model building
approach……………………………………………………………………………………………………………………………………...21
7. Validate the Model on Test Dataset and state the performance matrices. Also state
interpretation from the model ……………………………………………………………………………………………..…23
List of Figure.
Figure 1: Outlier Graph
Figure 2: Company data before scaling
Figure 3: Company data after scaling10
Figure 4: Visual Inspect of missing data 11
Figure 5:Corrleation Graph
Figure 6 :Filter Correlation Graph 13
Figure 7:Univaritant Analysis
Figure 8:Bivariant Analysis
Problem Statement:

Businesses or companies can fall prey to default if they are not able to keep up their debt
obligations. Defaults will lead to a lower credit rating for the company which in turn reduces its
chances of getting credit in the future and may have to pay higher interests on existing debts as well
as any new obligations. From an investor's point of view, he would want to invest in a company if it
is capable of handling its financial obligations, can grow quickly, and is able to manage the growth
scale.

A balance sheet is a financial statement of a company that provides a snapshot of what a company
owns, owes, and the amount invested by the shareholders. Thus, it is an important tool that helps
evaluate the performance of a business.

Data that is available includes information from the financial statement of the companies for the
previous year (2015). Also, information about the Net worth of the company in the following year
(2016) is provided which can be used to drive the labeled field.

Explanation of data fields available in Data Dictionary, 'Credit Default Data Dictionary.xlsx'

Introduction:

We need to create a default variable that should take the value of 1 when net worth next year is
negative & 0 when net worth next year is positive.

Question 1.1 Outlier Treatment

Ans: We will import all the necessary libraries in the Jupiter notebook and create Dataframe for the
dataset . I have used the original dataset file for importing and making analysis of it . I tried
converting into CSV file, but data got tampered and data type changed and created lot of issue while
making analysis . So , its advisable to work with original dataset file . So finally used excel file only .

Created Dataframe for the dataset as df for further analysis .

Checking top 5 records :


Checking bottom 5 records:

Checking the shape of the data frame:


The number of rows (observations) is 3586
The number of columns (variables) is 67

Information about the data frame:


Total of 3586 rows and Data type is 63 Float64 , 3 int 64 , 1 object and it does not have any null
values

<class 'pandas.core.frame.DataFrame'>
Range Index: 3586 entries, 0 to 3585
Data columns (total 67 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Co_Code 3586 non-null int64
1 Co_Name 3586 non-null object
2 Networth_Next_Year 3586 non-null float64
3 Equity_Paid_Up 3586 non-null float64
4 Networth 3586 non-null float64
5 Capital_Employed 3586 non-null float64
6 Total_Debt 3586 non-null float64
7 Gross_Block_ 3586 non-null float64
8 Net_Working_Capital_ 3586 non-null float64
9 Current_Assets_ 3586 non-null float64
10 Current_Liabilities_and_Provisions_ 3586 non-null float64
11 Total_Assets_to_Liabilities_ 3586 non-null float64
12 Gross_Sales 3586 non-null float64
13 Net_Sales 3586 non-null float64
14 Other_Income 3586 non-null float64
15 Value_Of_Output 3586 non-null float64
16 Cost_of_Production 3586 non-null float64
17 Selling_Cost 3586 non-null float64
18 PBIDT 3586 non-null float64
19 PBDT 3586 non-null float64
20 PBIT 3586 non-null float64
21 PBT 3586 non-null float64
22 PAT 3586 non-null float64
23 Adjusted_PAT 3586 non-null float64
24 CP 3586 non-null float64
25 Revenue_earnings_in_forex 3586 non-null float64
26 Revenue_expenses_in_forex 3586 non-null float64
27 Capital_expenses_in_forex 3586 non-null float64
28 Book_Value_Unit_Curr 3586 non-null float64
29 Book_Value_Adj._Unit_Curr 3582 non-null float64
30 Market_Capitalisation 3586 non-null float64
31 CEPS_annualised_Unit_Curr 3586 non-null float64
32 Cash_Flow_From_Operating_Activities 3586 non-null float64
33 Cash_Flow_From_Investing_Activities 3586 non-null float64
34 Cash_Flow_From_Financing_Activities 3586 non-null float64
35 ROG-Net_Worth_perc 3586 non-null float64
36 ROG-Capital_Employed_perc 3586 non-null float64
37 ROG-Gross_Block_perc 3586 non-null float64
38 ROG-Gross_Sales_perc 3586 non-null float64
39 ROG-Net_Sales_perc 3586 non-null float64
40 ROG-Cost_of_Production_perc 3586 non-null float64
41 ROG-Total_Assets_perc 3586 non-null float64
42 ROG-PBIDT_perc 3586 non-null float64
43 ROG-PBDT_perc 3586 non-null float64
44 ROG-PBIT_perc 3586 non-null float64
45 ROG-PBT_perc 3586 non-null float64
46 ROG-PAT_perc 3586 non-null float64
47 ROG-CP_perc 3586 non-null float64
48 ROG-Revenue_earnings_in_forex_perc 3586 non-null float64
49 ROG-Revenue_expenses_in_forex_perc 3586 non-null float64
50 ROG-Market_Capitalisation_perc 3586 non-null float64
51 Current_Ratio[Latest] 3585 non-null float64
52 Fixed_Assets_Ratio[Latest] 3585 non-null float64
53 Inventory_Ratio[Latest] 3585 non-null float64
54 Debtors_Ratio[Latest] 3585 non-null float64
55 Total_Asset_Turnover_Ratio[Latest] 3585 non-null float64
56 Interest_Cover_Ratio[Latest] 3585 non-null float64
57 PBIDTM_perc[Latest] 3585 non-null float64
58 PBITM_perc[Latest] 3585 non-null float64
59 PBDTM_perc[Latest] 3585 non-null float64
60 CPM_perc[Latest] 3585 non-null float64
61 APATM_perc[Latest] 3585 non-null float64
62 Debtors_Velocity_Days 3586 non-null int64
63 Creditors_Velocity_Days 3586 non-null int64
64 Inventory_Velocity_Days 3483 non-null float64
65 Value_of_Output_to_Total_Assets 3586 non-null float64
66 Value_of_Output_to_Gross_Block 3586 non-null float64
dtypes: float64(63), int64(3), object(1)
memory usage: 1.8+ MB

Checking Missing Values :

Summary of the dataset:

Observation: Using the company data, we compiled a descriptive summary. We can see the mean,
standard deviation, and percentile details for all columns since most of the data is continuous.

Checking duplicate value in the dataset :

There are no duplicate records present in the dataset .

Outliers:

 Below, we have removed the default column and created two different datasets - df_X, df_Y.
 The values that are outside of the upper limit (UL) and lower limit (LL) are detected, and null
values are imputed to the values that are above and below these limits.
 We are performing concatenation on the datasets and also removing the ‘Networth Next
Year’, as that will significantly influence the default variable, which is derived from the
‘Networth Next Year’ column variable.

Fig 1
Observation: Significant number of outliers are present for almost all the variables, the actual
percentage of data which are present above and below the third and first quantiles respectively.

Dataset before scaling

Fig 2

Dataset after scaling


Fig 3

Question 1.2 Missing Value Treatment


Ans:
 There are some missing values in the dataset that will be handled in the next steps.
 The data set, which consisted of 3586 rows, did not have many missing values to begin with.
 The total number of missing records was 118 in the entire dataset.

Observation:
 Null Values are present in many columns ,however significant number was present in
“Inventory _Velocity _Days “ .

Visual Inspect of Missing data :


Fig 4
Missing Value treatment : Many columns had null values, but significantly maximum number was
present in "Inventory_Vel_Days" column. we treated the column were imputed with the
average value

Correlation Matrix on Scaled Data :


Fig 5

Fig 5
Question 1.3 Transform Target variable into 0 and 1
Ans : Creating a Target Variable Default using Networth Next Year
 Based on the project notes, a new dependent variable named "Default" was created.
Criteria -
1 denotes the case of negative net worth for the company next year 
0 denotes the company will have a positive net worth next year 
 Made use of np.where function to achieve this.
Creating a binary target variable using 'Networth_Next_Year'.
 We checked whether the data was split based on this dependent variable after generating
the dependent column. This is illustrated in the chart below.
 Distinct values of the dependent variable – 0 and 1.

Lets see what does” Default “ variable look like

Fig 6

Question 1.4 Univariate (4 marks) & Bivariate ( 6marks) analysis with proper interpretation.
(You may choose to include only those variables which were significant in the model
building)
Ans :
 Univariate analysis is the simplest form of analyzing data. Uni means one, so in other words
the data has only one variable. Univariate data requires to analyze each variable separately.
Data is gathered for the purpose of answering a question, or more specifically, a research
question.
 Bivariate analysis is one of the simplest forms of quantitative (statistical) analysis. It
involves the analysis of two variables (often denoted as X, Y), for the purpose of determining
the empirical relationship between them.
Univariant Analysis :

Fig 7
 Histplot & Boxplot has been created for the numerical variables which have importance
w.r.t. features in the dataset.

Skewness in the data before imputation

Skewness in the data after imputation

We have 3 clusters 0,1,2 To find the optimal number of clusters, we can use k-elbow method.

Bivariant Analysis :

Gross _Sales Vs Net_Sales


Networth Vs Capital_Employed

Networth Vs Cost_of Production


Fig 8
Multivariant Analysis :
1. Furthermore, we conducted a multivariate analysis on the data to see if there are any
correlation that are observed within the data.
2. Using the correlation function and seaborn clustermap, we plotted correlations and
obtained a better understanding of the data.
3. Our observations were as follows:networth and networth next year were highly correlated. 
Variables pertaining to Rate of Growth were also highly correlated.
4. The analysis shows that there is a collinearity issue with this data set.
Before we proceed to the logistic regression model building exercise, the dataset will be scaled. It is
a step of data pre-processing that applies to independent variables to normalize the data within a
specific range. It brings all of the data in the range of 0 and 1.

Question 1.5 Train Test Split


Ans: The train-test split procedure is used to estimate the performance of machine learning
algorithms when they are used to make predictions on data not used to train the model.

It is a fast and easy procedure to perform, the results of which allow you to compare the
performance of machine learning algorithms for your predictive modeling problem. Although simple
to use and interpret, there are times when the procedure should not be used, such as when you
have a small dataset and situations where additional configuration is required, such as when it is
used for classification and the dataset is not balanced.

The data set is split into response variables (data with independent variables) and predictor variables
(data with a predictor variable).
After splitting the training and testing sets in the ratio 67:33,try to the fit the model into the testing
and training sets and find out the performance of those sets.

After splitting the dataset let’s see the classification confusion matrix .

Confusion matrix and classification report of train data:


Confusion matrix and classification report of test data:

Question 1.6 Build Logistic Regression Model (using statsmodel library) on most important
variables on Train Dataset and choose the optimum cutoff. Also showcase your model building
approach

Ans : Logistic regression is a process of modeling the probability of a discrete outcome given an
input variable. The most common logistic regression models a binary outcome; something that can
take two values such as true/false, yes/no, and so on.
For model building, we try to approach recursive feature elimination and we want to select top 15
features that would contribute to the model well.

For Modeling we will use Logistic Regression with recursive feature elimination.

Below are the highest contributing independent variables to the model building.
Now let’s fit the model and find classification matrix :
Question 1.7 Validate the Model on Test Dataset and state the performance matrices. Also state
interpretation from the model

Ans : We train the model and then validate the model in both the training and testing sets.

 For both sets, we are plotting the confusion matrix and classification report.
The training data shows high precision and accuracy, but the recall seems to be lower.
 As long as the recall value is improved, we will get True Positives (TP), which means that we
will be able to correctly identify defaulters because if we miss one, the bank will have to pay
higher interest on existing debts and the cash flow of the bank will not be regularized.

Confusion matrix and Classification Report for the training set:


Confusion Matrix :
[[2134 17]
[ 81 170]]

Classification Report :
Confusion matrix and Classification Report for the test set:

Confusion matrix :

Classification Report

Observations:
 Over 94% accuracy was achieved while recall, precision and f1 score were also very high at
99,94% and 97% respectively.
 We observed accuracy and precision for both models (ratio of True) Positives to the entire
Positives) is on the higher side, but recall seems to be the main problem.
It is likely that we had an imbalanced dataset for our model.
 Therefore, we have balanced our default values (the ratio of 0's to 1's must be greater). We
had only 11% of the defaults in our dataset, so we used SMOTE before fitting it to the model.

We see poor recall score for both train and test. Since only 10.8% of the total data had defaults, we
will now try to balance the data before fitting the model.

I have imported SMOTE and Imblearn function for better performance analysis.
pred_train_smote = selector_smote.predict(X_res)
pred_test_smote = selector_smote.predict(X_test)

 The recall has improved greatly, which means the chances of identifying the product have
increased.
 There have been significant improvements in defaulters and there is less chance of the
model missing out on any potential default candidates/companies to our bank.

Classification Report for the Training Set:

Classification Report for the Test Set:

Finally, we are able to achieve a descent recall value without overfitting. Considering the
opportunities such as outliers, missing values and correlated features this is a fairly good model. It
can be improved if we get better quality data where the features explaining the default are not
missing to this extent. Of course, we can try other techniques which are not sensitive towards
missing values and outliers.

You might also like