lOM oAR c P S D | 3049 913 0
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE
LABORATORY MANUAL
COURSE CODE : AD3301
COURSE NAME : Data Exploration and Visualization
REGULATION : R2021
CLASS : II
SEMESTER : III
lOM oAR c P S D | 3049 913 0
Ex.No:1 Installation of Data Analysis And Visualization Tool: Python
DATE
Packages that we will need
Python 3 and the following Python libraries/packages are needed for data exploration and visualization:
• jupyter
• jupyterlab
• numpy
• scipy
• pandas
• matplotlib
• seaborn
How to install Python and the packages
Install Anaconda which will give you a Python 3 environment and all the above required
packages.After you have installed Anaconda, please verify the installation.
$ conda install -c conda-forge altair vega_datasets
How to verify your installation
1. Open the Anaconda Navigator.
2. Find the JupyterLab tile and “launch” it.
lOM oAR c P S D | 3049 913 0
import numpy
import scipy
import pandas
import matplotlib
import seaborn
print("all good")
click on the “play”/”run” icon.
Result:
Thus, the python tool was installed and verified successfully.
lOM oAR c P S D | 3049 913 0
Ex.No:2 Exploratory Data Analysis
DATE
Aim:
To perform exploratory data analysis (EDA) on with datasets.
Procedure:
1. Import the dataset
2. View the head of the data
3. View the basic information of data and description of data
4. Find the unique value of data and verify the duplication of data
5. Plot a graph for unique value of dataset
6. Verify the presence of null value and replace thenull value
7. Visualize the needed data
Program:
#Load the required libraries
import pandas as pd
import numpy as np
import seaborn as sns
#Load the data
df = pd.read_csv('titanic.csv')
#View the data
df.head()
df.info()
df.describe()
lOM oAR c P S D | 3049 913 0
#Find the duplicates
df.duplicated().sum()
#unique values
df['Pclass'].unique()
df['Survived'].unique()
df['Sex'].unique()
array([3, 1, 2], dtype=int64)
array([0, 1], dtype=int64)
array(['male', 'female'], dtype=object)
#Plot the unique values
sns.countplot(df['Pclass']).unique()
#Find null values
df.isnull().sum()
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
#Replace null values
df.replace(np.nan,'O',inplace = True)
#Check the changes now
df.isnull().sum()
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 0
Embarked 0
lOM oAR c P S D | 3049 913 0
dtype: int64
#Filter data
df[df['Pclass']==1].head()
#Boxplot
df[['Fare']].boxplot()
Result:
Thus, the program to perform exploratory data analysis (EDA) on with datasets was executed
lOM oAR c P S D | 3049 913 0
Ex.No:3.1 Numpy Arrays
DATE
Aim:
To write a program to work with Numpy arrays .
Procedure:
1. Create array using numpy
2. Access the element in the array
3. Retrieve element using slice operation
4. Compute calculation in the array
Program:
import numpy as np
a = np.array([1, 2, 3]) # Create a rank 1 array
print(type(a)) # Prints "<class 'numpy.ndarray'>"
print(a.shape) # Prints "(3,)"
print(a[O], a[1], a[2]) # Prints "1 2 3"
a[O] = 5 # Change an element of the array
print(a) # Prints "[5, 2, 3]"
b = np.array([[1,2,3],[4,5,6]]) # Create a rank 2 array
print(b.shape) # Prints "(2, 3)"
print(b[O, O], b[O, 1], b[1, O])
# Create the following rank 2 array with shape (3, 4)
# [[ 1 2 3 4]
# [ 5 6 7 8]
# [ 9 1O 11 12]]
a = np.array([[1,2,3,4], [5,6,7,8], [9,1O,11,12]])
# Use slicing to pull out the subarray consisting of the first 2 rows
# and columns 1 and 2; b is the following array of shape (2, 2):
# [[2 3]
# [6 7]]
b = a[:2, 1:3]
# A slice of an array is a view into the same data, so modifying it
# will modify the original array.
print(a[O, 1]) # Prints "2"
b[O, O] = 77 # b[O, O] is the same piece of data as a[O, 1]
print(a[O, 1])
a = np.array([[1,2,3], [4,5,6], [7,8,9], [1O, 11, 12]])
print(a) # prints "array([[ 1, 2, 3],
# [ 4, 5, 6],
# [ 7, 8, 9],
# [1O, 11, 12]])"
# Create an array of indices
b = np.array([O, 2, O, 1])
# Select one element from each row of a using the indices in b
print(a[np.arange(4), b]) # Prints "[ 1 6 7 11]"
# Mutate one element from each row of a using the indices in b
a[np.arange(4), b] += 1O
print(a)
x = np.array([1, 2]) # Let numpy choose the datatype
print(x.dtype) # Prints "int64"
x = np.array([[1,2],[3,4]], dtype=np.float64)
y = np.array([[5,6],[7,8]], dtype=np.float64)
# Elementwise sum; both produce the array
# [[ 6.O 8.O]
# [1O.O 12.O]]
print(x + y)
print(np.add(x, y))
lOM oAR c P S D | 3049 913 0
x = np.array([[1,2],[3,4]])
print(np.sum(x)) # Compute sum of all elements; prints "1O"
print(np.sum(x, axis=O)) # Compute sum of each column; prints "[4 6]"
print(np.sum(x, axis=1)) # Compute sum of each row; prints "[3 7]"
Output:
<class 'numpy.ndarray'>
(3,)
123
[5 2 3]
(2, 3)
124
2
77
[[ 1 2 3]
[ 4 5 6]
[ 7 8 9]
[1O 11 12]]
[ 1 6 7 11]
[[11 2 3]
[ 4 5 16]
[17 8 9]
[1O 21 12]]
int32 [[
6. 8.]
[1O. 12.]]
[[ 6. 8.]
[1O. 12.]]
1O
[4 6]
[3 7]
Result:
Thus, the program using NumPy array was executed.
lOM oAR c P S D | 3049 913 0
Ex.No:3.2 Pandas Data Frames
DATE
Aim:
To write a program for working with pandas data frames.
Procedure:
1. Import panda library
2. Construct a panda dataframe
3. Modify, drop columns in dataframe
4. Calculate median in the dataframe
Program:
import pandas as pd
data = pd.DataFrame({"x1":["y", "x", "y", "x", "x", "y"], # Construct a pandas DataFrame
"x2":range(16, 22),
"x3":range(1, 7),
"x4":["a", "b", "c", "d", "e", "f"],
"x5":range(3O, 24, - 1)})
print(data)
data_row = data[data.x2 < 2O] # Remove particular rows
print(data_row) # Print pandas DataFrame subset
data_col = data.drop("x1", axis = 1) # Drop certain variable from DataFrame
print(data_col)
data_col = data.drop("x1", axis = 1) # Drop certain variable from DataFrame
print(data_col)
data_med = data["x5"].median() # Calculate median
print(data_med)
27.5
Result:
Thus the program to work with panda data frame was executed
lOM oAR c P S D | 3049 913 0
Ex.No:3.3 Basic Plots Using Matplotlib
DATE
Aim:
To write a program to visualize basic plots using Matplotlib.
Procedure:
1. Import matplotlib library
2. Define x,y axis
3. Label the axis
4. Visualize the data using line plot
Program:
from matplotlib import pyplot as plt
import numpy as np
x=[2O,25,37]
y=[25OOO,4OOOO,6OOOO]
plt.plot(x,y)
plt.xlabel("Age")
plt.ylabel('salary')
plt.title('Salary by age')
plt.show()
Output:
Result:
Thus, the program to plot the basic plots using Matplotlib was executed.
lOM oAR c P S D | 3049 913 0
Ex.No:4 Data Cleaning using R
DATE
Aim:
To explore various variable and row filters , plot features in R for cleaning data and visualize it.
Procedure:
1. Import dplyr and ggplot2 library
2. Import iris dataset
3. Using dplyr select and filter functions rearrange the data
4. Using plotting visualize the selected data
Program:
plot(iris$Sepal.Length)
Result:
Thus, the program for cleaning and visualizing the data using R was executed.
lOM oAR c P S D | 3049 913 0
Ex.No: 5 Time Series
DATE
Aim:
To write a program to visualize time series analysis.
Procedure:
1. Import the temperature dataset
2. Import panda and matplotlib library
3. Visualize the data using line plot, histogram and boxplot
Program:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# reading the dataset using read_csv
df = pd.read_csv("stock_data.csv",
parse_dates=True,
index_col="Date")
# displaying the first five rows of dataset
df.head()
from pandas import read_csv
from matplotlib import pyplot
series = read_csv('daily-minimum-temperatures.csv', header=O, index_col=O, parse_dates=True, squeeze=True)
print(series.head())
Date
1981-01-01 20.7
1981-01-02 17.9
1981-01-03 18.8
1981-01-04 14.6
1981-01-05 15.8
Name: Temp, dtype: float64
from pandas import read_csv
from matplotlib import pyplot
series = read_csv('daily-minimum-temperatures.csv', header=O, index_col=O,parse_dates=True,
squeeze=True)
series.plot()
pyplot.show()
from pandas import read_csv
from matplotlib import pyplot
series = read_csv('daily-minimum-temperatures.csv', header=O, index_col=O, parse_dates=True,
squeeze=True)
series.plot(style='k.')
pyplot.show()
lOM oAR c P S D | 3049 913 0
from pandas import read_csv
from matplotlib import pyplot
series = read_csv('daily-minimum-temperatures.csv', header=O, index_col=O, parse_dates=True, squeeze=True)
series.hist()
pyplot.show()
from pandas import read_csv
from pandas import DataFrame
from pandas import Grouper
from matplotlib import pyplot
series = read_csv('daily-minimum-temperatures.csv', header=O, index_col=O, parse_dates=True,
squeeze=True)
groups = series.groupby(Grouper(freq='A'))
years = DataFrame()
for name, group in groups:
years[name.year] = group.values
years.boxplot()
pyplot.show()
Result:
Thus, the program to visualize time series analysis was executed.
lOM oAR c P S D | 3049 913 0
Ex.No:6 Interactive Map Visualization
DATE
Aim:
To represent on a Map using various Map data sets with Mouse Rollover effect.
Procedure:
1. Import the library
2. Import map dataset
3. specify the width, height, title for mouse rollover
4. visualize the map
Program:
pip install pyecharts
pip install echarts-countries-pypkg
pip install echarts-china-provinces-pypkg
pip install echarts-china-cities-pypkg
pip install echarts-china-counties-pypkg
import pyecharts
print(pyecharts. version )
import pandas as pd
from pyecharts.charts import Map
from pyecharts import options as opts
data = pd.read_excel('GDP.xlsx')
province = list(data["province"])
gdp = list(data["2O19_gdp"])
list = [list(z) for z in zip(province,gdp)]
c=(
Map(init_opts=opts.InitOpts(width="1OOOpx", height="6OOpx")) #Initialize map size
.set_global_opts(
title_opts=opts.TitleOpts(title="2O19 Provinces in GDP Distribution unit:1OO million yuan"),
#Configuration title
visualmap_opts=opts.VisualMapOpts(
type_ = "scatter" #Scatter type
)
)
.add("GDP",list,maptype="china") #take list Imported, map type is China Map
.render("Map1.html")
)
Output:
Result:
Thus, the program for mouse rollover in map visualization was executed successful
lOM oAR c P S D | 3049 913 0
Ex.No:7 Cartographic Visualization
DATE
Aim:
To build cartographic visualization for multiple datasets involving states and districts in India.
Procedure:
1. Import basemap and library
2. Import the state data and map
3. Using matplotlib add the title and attributes to display the map
Program:
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
map = Basemap()
map.drawcoastlines()
plt.show()
plt.savefig(‘test.png’)
pip install geopandas
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd
import shapefile as shp
from shapely.geometry import Point
sns.set_style('whitegrid')
fp = r'Maps_with_python\india-polygon.shp'
map_df = gpd.read_file(fp)
map_df_copy = gpd.read_file(fp)
map_df.head()
map_df.plot()
df = pd.read_csv('globallandslides.csv')
lOM oAR c P S D | 3049 913 0
pd.set_option('display.max_columns', None)
df = df[df.country_name=="India"]
df["Year"] = pd.to_datetime(df["event_date"]).dt.year
df = df[df.landslide_category=="landslide"]
ls_df["admin_division_name"].replace("Nāgāland", "Nagaland",inplace = True)
ls_df["admin_division_name"].replace("Meghālaya", "Meghalaya",inplace = True)
ls_df["admin_division_name"].replace("Tamil Nādu", "Tamil Nadu",inplace = True)
ls_df["admin_division_name"].replace("Karnātaka", "Karnataka",inplace = True)
ls_df["admin_division_name"].replace("Gujarāt", "Gujarat",inplace = True)
ls_df["admin_division_name"].replace("Arunāchal Pradesh", "Arunachal Pradesh",inplace = True)
state_df = ls_df["admin_division_name"].value_counts()
state_df = state_df.to_frame()
state_df.reset_index(level=O, inplace=True)
state_df.columns = ['State', 'Count']
state_df.at[15,"Count"] = 69
state_df.at[O,"State"] = "Jammu and Kashmir" state_df.at[2O,"State"] = "Delhi"
state_df.drop(7)
#Merging the data
merged = map_df.set_index('st_nm').join(state_df.set_index('State'))
merged['Count'] = merged['Count'].replace(np.nan, O)
merged.head()
#Create figure and axes for Matplotlib and set the title
fig, ax = plt.subplots(1, figsize=(1O, 1O))
lOM oAR c P S D | 3049 913 0
ax.axis('off')
ax.set_title('Number of landslides in India state-wise', fontdict={'fontsize': '2O', 'fontweight' : '1O'})
# Plot the figure
merged.plot(column='Count',cmap='YlOrRd', linewidth=O.8, ax=ax,
edgecolor='O',legend=True,markersize=[39.739192, -1O4.99O337], legend_kwds={'label': "Number of
landslides"})
Result:
Thus, the program to display the cartographic visualization of India was executed successfully.
lOM oAR c P S D | 3049 913 0
Ex.No:8 EDA on Wine Quality Data Set
DATE
Aim:
To write a python program for EDA on Wine Quality Data Set.
Procedure:
1. Import library
2. Import wine dataset
3. Perform EDA to display information ,description of data.
4. Analyse the content of alcohol consumption and visualize it
Program:
import pandas as pd
df_red = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learningdatabases/wine-quality/winequality-
red.csv", delimiter=";")
df_white = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learningdatabases/wine-
quality/winequality-white.csv", delimiter=";")
df_red.columns
Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar','chlorides', 'free sulfur dioxide', 'total
sulfur dioxide',
'density','pH', 'sulphates', 'alcohol', 'quality'],dtype='object')
df_red.iloc[1OO:11O]
df_red.dtypes
fixed acidity float64
volatile acidity float64
citric acid float64
residual sugar float64
chlorides float64
free sulfur dioxide float64
total sulfur dioxide float64
density float64
pH float64
sulphates float64
alcohol float64
quality int64
dtype: object
df_red.describe()
df_red.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, O to 1598
Data columns (total 12 columns):
fixed acidity 1599 non-null float64
lOM oAR c P S D | 3049 913 0
volatile acidity 1599 non-null float64
citric acid 1599 non-null float64
residual sugar 1599 non-null float64
chlorides 1599 non-null float64
free sulfur dioxide 1599 non-null float64
total sulfur dioxide 1599 non-null float64
density 1599 non-null float64
pH 1599 non-null float64
sulphates 1599 non-null float64
alcohol 1599 non-null float64
quality 1599 non-null int64
dtypes: float64(11), int64(1)
memory usage: 15O.O KB
import seaborn as sns
sns.set(rc={'figure.figsize': (14, 8)})
sns.countplot(df_red['quality'])
Sns.distplot(df_red[‘alchol’])
Result:
Thus, the program to execute the EDA on wine dataset was executed successfully.
lOM oAR c P S D | 3049 913 0
Ex.No:9 Case Study on a Data Set to present an Analysis Report
DATE
Aim:
To analysis following using the diabetes data set from UCI and Pima Indians Diabetes data set for
performing the following
Procedure:
a. Univariate analysis: Frequency, Mean, Median, Mode, Variance, Standard Deviation,
Skewness and Kurtosis.
b. Bivariate analysis: Linear and logistic regression modeling
c. Multiple Regression analysis
d. Also compare the results of the above analysis for the two data sets.
a. Univariate analysis: Frequency, Mean, Median, Mode, Variance, Standard Deviation,
Skewness and Kurtosis.
import pandas as pd
import numpy as np
import statistics as st
# Load the data
df = pd.read_csv("data_desc.csv")
print(df.shape)
print(df.info())
Output:
(600, 10)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 600 entries, 0 to 599
Data columns (total 10 columns):
Marital_status 600 non-null object
Dependents 600 non-null int64
Is_graduate 600 non-null object
Income 600 non-null int64
Loan_amount 600 non-null int64
Term_months 600 non-null int64
Credit_score 600 non-null object
approval_status 600 non-null object
Age 600 non-null int64
Sex 600 non-null object
dtypes: int64(5), object(5)
memory usage: 47.0+ KB
None
Measures of Central Tendency
Measures of central tendency describe the center of the data, and are often represented by the
mean, the median, and the mode.
Mean
df.mean()
python
Output:
1 Dependents 0.748333
2 Income 705541.333333
3 Loan_amount 323793.666667
4 Term_months 183.350000
5 Age 49.450000
6 dtype: float64
lOM oAR c P S D | 3049 913 0
It is also possible to calculate the mean of a particular variable in a data, as shown below, where we
calculate the mean of the variables 'Age' and 'Income'.
print(df.loc[:,'Age'].mean())
print(df.loc[:,'Income'].mean())
python
Output:
1 49.45
2 705541.33
It is also possible to calculate the mean of the rows by specifying the (axis = 1) argument. The code
below calculates the mean of the first five rows.
df.mean(axis = 1)[0:5]
python
Output:
1 0 70096.0
2 1 161274.0
3 2 125113.4
4 3 119853.8
5 4 120653.8
6 dtype: float64
Median
df.median()
python
Output:
1 Dependents 0.0
2 Income 508350.0
3 Loan_amount 76000.0
4 Term_months 192.0
5 Age 51.0
6 dtype: float64
Mode
df.mode()
Python
Output:
1 Marital_stat Dependen Is_gradua Income Loan_amou Term_mont Credit_scor approval_stat Ag Se
Us ts Te nt hs e us e x
2
- - -- -- - ---
3 yes 0 Yes 33330 70000 192.0 satisfacto yes 55 M
0 ry
Measures of Dispersion
lOM oAR c P S D | 3049 913 0
The most popular measures of dispersion are standard deviation, variance, and the interquartile
range.
Standard Deviation
df.std()
python
Output:
1 Dependents 1.026362
2 Income 711421.814154
3 Loan_amount 724293.480782
4 Term_months 31.933949
5 Age 14.728511
6 dtype: float64
Variance
df.var()
python
Output:
1 Dependents 1.053420e+00
2 Income 5.061210e+11
3 Loan_amount 5.246010e+11
4 Term_months 1.019777e+03
5 Age 2.169290e+02
6 dtype: float64
Interquartile Range (IQR)
from scipy.stats import iqr
iqr(df['Age'])
python
Output:
1 25.0
Skewness
print(df.skew())
python
Output:
1 Dependents 1.169632
2 Income 5.344587
3 Loan_amount 5.006374
4 Term_months -2.471879
5 Age -0.055537
6 dtype: float64
The skewness values can be interpreted in the following manner:
• Highly skewed distribution: If the skewness value is less than −1 or greater than +1.
• Moderately skewed distribution: If the skewness value is between −1 and −½ or between +½
and +1.
• Approximately symmetric distribution: If the skewness value is between −½ and +½.
lOM oAR c P S D | 3049 913 0
Putting Everything Together
df.describe()
python
Output:
1| | Dependents | Income | Loan_amount | Term_months | Age |
2| | | | | | |
3| count | 600.000000 | 6.000000e+02 | 6.000000e+02 | 600.000000 | 600.000000
|
4| mean | 0.748333 | 7.055413e+05 | 3.237937e+05 | 183.350000 | 49.450000
|
5| std | 1.026362 | 7.114218e+05 | 7.242935e+05 | 31.933949 | 14.728511
|
6| min | 0.000000 | 3.000000e+04 | 1.090000e+04 | 18.000000 | 22.000000
|
7| 25% | 0.000000 | 3.849750e+05 | 6.100000e+04 | 192.000000 | 36.000000
|
8| 50% | 0.000000 | 5.083500e+05 | 7.600000e+04 | 192.000000 | 51.000000
|
9| 75% | 1.000000 | 7.661000e+05 | 1.302500e+05 | 192.000000 | 61.000000
|
10| max | 6.000000 | 8.444900e+06 | 7.780000e+06 | 252.000000 | 76.000000
|
df.describe(include='all')
b. Bivariate analysis: Linear and logistic regression modeling
Linear Regression
import matplotlib.pyplot as plt
from scipy import stats
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
slope, intercept, r, p, std_err = stats.linregress(x, y)
def myfunc(x):
return slope * x + intercept
mymodel = list(map(myfunc, x))
plt.scatter(x, y)
plt.plot(x, mymodel)
plt.show()
lOM oAR c P S D | 3049 913 0
OUTPUT:
Logistic Regression
import numpy
from sklearn import linear_model
X = numpy.array([3.78, 2.44, 2.09, 0.14, 1.72, 1.65, 4.92, 4.37, 4.96, 4.52, 3.69, 5.88]).reshape(-1,1)
y = numpy.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1])
logr = linear_model.LogisticRegression()
logr.fit(X,y)
def logit2prob(logr, X):
log_odds = logr.coef_ * X + logr.intercept_
odds = numpy.exp(log_odds)
probability = odds / (1 + odds)
return(probability)
print(logit2prob(logr, X))
OUTPUT:
[[0.60749955]
[0.19268876]
[0.12775886]
[0.00955221]
[0.08038616]
[0.07345637]
[0.88362743]
[0.77901378]
[0.88924409]
[0.81293497]
[0.57719129]
[0.96664243]]
Results Explained
3.78 0.61 The probability that a tumor with the size 3.78cm is cancerous is 61%.
2.44 0.19 The probability that a tumor with the size 2.44cm is cancerous is 19%.
2.09 0.13 The probability that a tumor with the size 2.09cm is cancerous is 13%.
lOM oAR c P S D | 3049 913 0
c. Multiple Regression analysis
Multiple regression works by considering the values of the available multiple independent
variables and predicting the value of one dependent variable.
import pandas as pd
from sklearn import linear_model
import statsmodels.api as sm
data = {'year':
[2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2016,2016,2016,2016,2016,2016,20
16,2016,2016,2016,2016,2016],
'month': [12,11,10,9,8,7,6,5,4,3,2,1,12,11,10,9,8,7,6,5,4,3,2,1],
'interest_rate':
[2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75],
'unemployment_rate':
[5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6.2,6.1],
'index_price':
[1464,1394,1357,1293,1256,1254,1234,1195,1159,1167,1130,1075,1047,965,943,958,971,949,884,866,
876,822,704,719]
}
df = pd.DataFrame(data)
x = df[['interest_rate','unemployment_rate']]
y = df['index_price']
# with sklearn
regr = linear_model.LinearRegression()
regr.fit(x, y)
print('Intercept: \n', regr.intercept_)
print('Coefficients: \n', regr.coef_)
# with statsmodels
x = sm.add_constant(x) # adding a constant
model = sm.OLS(y, x).fit()
predictions = model.predict(x)
print_model = model.summary()
print(print_model)
lOM oAR c P S D | 3049 913 0
OUTPUT:
Intercept:
1798.4039776258564
Coefficients:
[ 345.54008701 -250.14657137]
OLS Regression Results
==========================================================================
====
Dep. Variable: index_price R-squared: 0.898
Model: OLS Adj. R-squared: 0.888
Method: Least Squares F-statistic: 92.07
Date: Sat, 30 Jul 2022 Prob (F-statistic): 4.04e-11
Time: 13:47:01 Log-Likelihood: -134.61
No. Observations: 24 AIC: 275.2
Df Residuals: 21 BIC: 278.8
Df Model: 2
Covariance Type: nonrobust
==========================================================================
coef std err t P>|t| [0.025 0.975]
Const 1798.4040 899.248 2.000 0.059 -71.685 3668.493
interest_rate 345.5401 111.367 3.103 0.005 113.940 577.140
unemployment_rate -250.1466 117.950 -2.121 0.046 -495.437 -4.856
==========================================================================
====
Omnibus: 2.691 Durbin-Watson: 0.530
Prob(Omnibus): 0.260 Jarque-Bera (JB): 1.551
Skew: -0.612 Prob(JB): 0.461
Kurtosis: 3.226 Cond. No. 394.
==========================================================================
====
Result:
Thus, the Univariate, Bivariate and Multiple Regression Analysis using the diabetes data set from
UCI and Pima Indians Diabetes data set was completed and verified successfully.