Machine_learning_lab_manual r
Machine_learning_lab_manual r
University Bhopal
Index
S.No. ListOfExperiments
1 InstallationofAnacondaDistributioninWindowsOperatingSystem.
2
Casestudyofhowtogetvarious DataSetsforTraining.
3 ImportDataset,usingDatasetloadingutilities(sklearn).
4 ImplementationofLinearRegressionAlgorithm.
5 ImplementationofK-MeansClusteringAlgorithm.
6 StudyofTrainingandTestingDataSets.
7 ImplementationofLogisticRegression.
8 VisualizationofDatausingMatplotlib.
9 ImplementationofDecisionTrees
10 ImplementationofSupportVectorMachines(SVM)
Aim:-InstallationofAnaconda andPythoninwindowsOperatingSystem
To learn machine learning, we will use the Python programming language IDEsSo,
in order to use Python for machine learning, we need to install it in our
computersystemwithcompatible IDEs(IntegratedDevelopmentEnvironment).
InthisPractical,wewilllearntoinstallPythonandanIDEwiththehelpof
Anacondadistribution.
Anaconda distribution is a free and open-source platform for Python/R
programming languages. It can be easily installed on any OS such as Windows,
Linux, and MAC OS. It provides more than 1500 Python/R data science packages
which are suitable for developing machine learning and deep learning models.
Anaconda distribution provides installation of Python with various IDE's such as
JupyterNotebook, Spyder,Anacondaprompt,etc.Hence itisaveryconvenient
packaged solution which you can easily download and install in your computer. It
will automatically install Python and some basic IDEs and libraries with it.
Below some steps are given to show the downloading and installing process of
Anaconda and IDE:
Step-1:Download AnacondaPython:
To download Anaconda in your system, firstly, open your favorite browser and
type Download Anaconda Python, and then click on the first link as given in the
below image. Alternatively, you can directly download it by clicking on this link,
https://www.anaconda.com/distribution/#download-section
• Since, Anaconda is available for Windows, Linux, and Mac OS, hence, you candownload
it as per your OS type by clicking on available options shown in below image.
Itwillprovide you Python2.7andPython3.7versions, butthe latest version is3.7, hence
Step-2:InstallAnaconda Python(Python3.7version):
Once the downloading process gets completed, go to downloads → double clickon the ".exe" file
(Anaconda3-2019.03-Windows-x86_64.exe) of Anaconda. It will open a setup window for
Anaconda installations as given in below image, then click on Next.
• Oncetheinstallationgetscomplete,clickonNext.
Step-3:OpenAnacondaNavigator
• Aftersuccessfulinstallationof Anaconda,useAnacondanavigator tolaunchaPython IDE
such as Spyder and Jupyter Notebook.
• ToopenAnacondaNavigator,clickonwindowKey andsearchforAnaconda navigator,
and click on it. Consider the below image:
RunyourPython programinSpyderIDE.
• OpenSpyderIDE,itwilllooklikethebelowimage:
Aim:-CaseStudy ofhowtogetvariousDatasetsforTraining.
The key to success in the field of machine learning or to become a great data scientist is to
practice with different types of datasets. But discovering a suitable dataset for each kind of
machine learning project is a difficult task. So, we will provide the detail of the sources from
where you can easily get the dataset according to your project.
Beforeknowingthesourcesofthemachinelearningdataset,let'sdiscussdatasets.
Whatisadataset?
A dataset is a collection of data in which data is arranged in some order. A dataset can contain
any data from a series of an array to a database table. Below table shows an example of the
dataset:
Typesofdataindatasets
• Numericaldata:Suchashouseprice,temperature,etc.
• Categoricaldata:SuchasYes/No,True/False,Blue/green,etc.
• Ordinaldata:Thesedataaresimilarto categorical databutcanbe measuredonthebasis of
comparison.
NeedofDataset
To work with machine learning projects, we need a huge amount of data, because, without the
data, one cannot train ML/AI models. Collecting and preparing the dataset is one of the most
crucial parts while creating an ML/AI project.
The technology applied behind any ML projects cannot work properly if the dataset is not well
prepared and pre-processed.
During the development of the ML project, the developers completely rely on the datasets. In
building ML applications, datasets are divided into two parts:
• Training dataset:
• TestDataset
Note: The datasets are of large size, so to download these datasets, you must have fast
internet on your computer.
PopularsourcesforMachineLearningdatasets
Belowisthelistofdatasetswhicharefreelyavailableforthepublictoworkonit:
1. KaggleDatasets
2. UCIMachineLearningRepository
3. DatasetsviaAWS
We can search, download, access, and share the datasets that are publicly available via AWS
resources. These datasets can be accessed through AWS resources but provided and maintained
by different government organizations, researches, businesses, or individuals.
Anyonecananalyzeand build variousservicesusingshareddataviaAWS resources. Theshared
dataset on cloud helps users to spend more time on data analysis rather than on acquisitions of
data.
This source provides the various types of datasets with examples and ways to use the dataset. It
also provides the searchboxusing whichwe cansearch for the required dataset. Anyone canadd
any dataset or example to the Registry of Open Data on AWS.
4. Google'sDatasetSearchEngine
GoogledatasetsearchengineisasearchenginelaunchedbyGoogleonSeptember5,2018.
Thissourcehelpsresearchersto getonlinedatasetsthatarefreelyavailableforuse.
ThelinkfortheGoogledatasetsearchengineishttps://toolbox.google.com/datasetsearch.
5. MicrosoftDatasets
6. AwesomePublicDatasetCollection
Awesome public dataset collection provides high-quality datasets that are arranged in a well-
organized manner within a list according to topics such as Agriculture, Biology, Climate,
Complex networks, etc. Most of the datasets are available free, but some may not, so it is better
to check the license before downloading the dataset.
The link to download the dataset from Awesome public dataset collection is
https://github.com/awesomedata/awesome-public-datasets.
7. GovernmentDatasets
There are different sources to get government-related data. Various countries publishgovernment
data for public use collected by them from different departments.
The goal of providing these datasets is to increase transparency of government work among the
people and to use the data in an innovative approach. Below are some links of government
datasets:
• IndianGovernmentdataset
7. ComputerVisionDatasets
Visual data provides multiple numbers of the great dataset that are specific to computer visions
such as Image Classification, Video classification, Image Segmentation, etc. Therefore, if you
want to build a project on deep learning or image processing, then you can refer to this source.
Thelinkfordownloadingthedatasetfromthissourceishttps://www.visualdata.io/.
8. Scikit-learn dataset
Aim:-Import Dataset,usingDatasetloadingutilities(sklearn).
scikit-learn (formerly scikits.learn andalsoknownas sklearn)isa freesoftware machine learning library
for thePython programming language. scikit-learnmake available a host of datasets for testing learning
algorithms
OutPut
Whatislinearregression?
2 2
3 3
4 4
In[5]:price=df.price
price
Out[5]:0 2
1 3
2 5
3 4
4 6
Name:price,dtype:int64
Out[6]:LinearRegression()
In[7]:reg.predict([[10]])
C:\Users\Aftab\anaconda3\lib\site-packages\sklearn\base.py:450:UserWarning:Xdoesnothavevalidfeature
ressionwasfittedwithfeaturenames
warnings.warn(
Out[7]:array([11.2])
WhatisKMeansClustering?
K-Means Clustering is an Unsupervised Learning algorithm,which groups the unlabeled dataset into
different clusters. Here K defines the number of pre-defined clusters that need to be created in theprocess,
as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.
“It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that
each dataset belongs only one group that has similar properties.”
In[5]:df['cluster']=predicted
df.head()
dfl=df[df.cluster==0]
df2 = df[df.cluster==l]
df3 = df[df.cluster==2]
plt.scatter(dfl.rollno,dfl['marks'],color='green')
plt.scatter(df2.rollno,df2['marks'],color='red')plt
.scatter(df3.rollno,df3['marks'],color='blue')
plt.xlabel('rollno')
plt.ylabel('marks')
Out[5]:Text(0,0.5,'marks')
In[5].:df[·cluster':]predicted
df.head()
dfl=df[df..cluster::0]1
df2;;;df[df.d.uster;;;;;;l]df3=
df[df.cluster==2]
plt.scatter(dh.r-ollnodf1['marks'],color"'green')
plt.scatter(d2..rollno,df2['rnrks'],,color;;;'red')
plt.scatter(difl.rollno,dB[·marks'Lcolor=·blue')
plt.xlabel('rollno")
plt.ylabel('marks')
out[5]:Text(e,0.5,·marks')
160 •• ••
• •
140
120
t"!'
poo
Ill •
6D
• •
40
275 300 35.1 375 40.0 42.5
nlllno
scale.fit(df[['rollno']])
df['rollno']::scale.transform(df[['rollno']])
In[7]:kin=KMeans(n_cluters:3)
prediced::;km.fityredict(df[[•rollno',·marks·]J)
predicted
Out[/]:array([2,2,.2,2,1,1,1,0,0,0,0,0,0,0,0,0,1,1, ,1])
rn[8]:df=df.drop(['cluster'],axis='column"')
In(9]:df['cluster')=predictd
df.head()
dfl=df[df.cluster==O]
df2=df(df.cLuster==lJdf3=
df[df. duster==2j
plt.catter(dfl.rollno,dfl['marks'],color='g,re11')
plt.scatter(df2.rollno,df2[ 'marks·], color=·red')
plt.scatter(df3..r-ollno,df3['marks'],color='blue')
plt.xlabel('rollno')
plt.ylabel('marks')
out[9]:Text(0,.0,5,'marks')
••
•
0.0 0-2 LO
Iri[10]:k1n.cluster_centers_
Out[10]:array([[0,B72549;0,11585945],
[0..72268908,0.8974359]
[0.86764106,0,1965812]])
10
1).11
.,lUi0.
41_2
IHI
0.0 02 04 06 08 u
l!IIIIIO
TrainingandTestingDataset:-
The training data is the biggest (in -size) subset of the original dataset, which is used to train or fit the
machine learning model. Firstly, thetraining data is fedtothe ML algorithms, whichlets them learn how to
make predictions for the given task.
Thetestdatasetisanothersubsetoforiginaldatawhichisusedtochecktheaccuracyofthemodel.
0 5000
1 4500
2 7000
3 6000
4 3500
5 2000
Name:price,dtype:int64
(13]fromsklearn.model_selectionimporttrain_test_split
2s
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=10)
Os
(14]Xtrain
1 600 3
[15] Xtest
Os
[16] y_train
Os
0 5000
3 6000
4 3500
1 4500
Name:priceJdtype:int64
Os0 y_test
7000
2000
• This data consist of 6 samples and 3 features. Data is about scooter price which depends on
distance travelled and years its being used, hence here price is dependent variable, distance and
years are independent variables.
• As mentioned below X holds Independent variables(distance and years) and y holds Dependent
variable(price)
WhatisLogisticRegression?
Logistic Regression is much similar tothe Linear Regression except that how theyare used. Linear Regression is
used for solving Regression problems, whereas Logistic regression is used for solving the classification
problems.
Os
0 fromsklearn.linear_modelimportLogisticRegression
model= LogisticRegression()
model.fit(X_train,y_train)
I
c.. • LogisticRegressionI
I LogisticRegression()I
Os
[8]y_predictedmodel.predict(X_test)
y_predicted
array([1,1,0,0,0,0])
o,[9] model.score(X_test,y_test)
1.0
• PlotLine