100% found this document useful (1 vote)
660 views1 page

Cheat Sheet: Python For Data Science

This document provides a cheat sheet on using Python and the Scikit-learn library for machine learning. It summarizes the main steps in a machine learning workflow including loading and preparing data, choosing a model and training/testing it, tuning hyperparameters, and evaluating performance. Key estimators for supervised, unsupervised and dimensionality reduction techniques are listed.

Uploaded by

Shishir Ray
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
660 views1 page

Cheat Sheet: Python For Data Science

This document provides a cheat sheet on using Python and the Scikit-learn library for machine learning. It summarizes the main steps in a machine learning workflow including loading and preparing data, choosing a model and training/testing it, tuning hyperparameters, and evaluating performance. Key estimators for supervised, unsupervised and dimensionality reduction techniques are listed.

Uploaded by

Shishir Ray
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

P Y T H O N F O R D ATA Working On Model

SCIENCE Model Choosing Train-Test


Data
C H E AT S H E E T Supervised Learning Estimator:
• Linear Regression:
• Naive Bayes:
>>> from sklearn.naive_bayes import
GaussianNB
Unsupervised Learning Estimator:
• Principal Component Analysis (PCA):
>>> from sklearn.decomposition import
Supervised:
>>> from sklearn.linear_model import >>>new_ lr.fit(X, y)
LinearRegression >>> new_gnb = GaussianNB() PCA
>>> knn.fit(X_train, y_train)
• KNN:
Python Scikit-Learn >>> new_lr =
LinearRegression(normalize=True) >>> from sklearn import neighbors
>>>
>>> new_pca= PCA(n_components=0.95)
• K Means:
>>> from sklearn.cluster import KMeans
>>>new_svc.fit(X_train, y_train)
Unsupervised :
• Support Vector Machine: >>> k_means.fit(X_train)
>>> from sklearn.svm import SVC knn=neighbors.KNeighborsClassifier(n_ne >>> k_means = KMeans(n_clusters=5,
random_state=0) >>> pca_model_fit =
>>> new_svc = SVC(kernel='linear') ighbors=1)
new_pca.fit_transform(X_train)
Introduction
Scikit-learn:“sklearn" is a machine learning library for the Python programming language.
Simple and efficient tool for data mining, Data analysis and Machine Learning. Post-Processing
Importing Convention - import sklearn

Preprocessing Prediction Model Tuning


Supervised: Grid Search: Randomized Parameter Optimization:

>>> y_predict = >>> from sklearn.grid_search import GridSearchCV >>> from sklearn.grid_search import RandomizedSearchCV
Data Loading Train-Test new_svc.predict(np.random.random((3,5))) >>> params = {"n_neighbors": np.arange(1,3), "metric": >>> params = {"n_neighbors": range(1,5), "weights":
• Using NumPy: >>> y_predict = new_lr.predict(X_test) ["euclidean", "cityblock"]} ["uniform", "distance"]}
Data >>> y_predict = knn.predict_proba(X_test) >>> grid = GridSearchCV(estimator=knn, >>> rsearch = RandomizedSearchCV(estimator=knn,
>>>import numpy as np param_grid=params) param_distributions=params, cv=4, n_iter=8, random_state=5)
>>>a=np.array([(1,2,3,4),(7,8,9,10)],dtype=int) >>> grid.fit(X_train, y_train) >>> rsearch.fit(X_train, y_train)
>>>data = np.loadtxt('file_name.csv', >>>from sklearn.model_selection Unsupervised:
>>> y_pred = k_means.predict(X_test) >>> print(grid.best_score_) >>> print(rsearch.best_score_)
delimiter=',') import train_test_split
• Using Pandas:
>>> print(grid.best_estimator_.n_neighbors)

>>>import pandas as pd >>> X_train, X_test, y_train, y_test =


>>>df=pd.read_csv file_name.csv ,header=0) train_test_split(X,y,random_state=0) Evaluate Performance
Classification: Regression: Clustering: Cross-validation:
Data Preparation 1. Confusion Matrix: 1. Mean Absolute Error:
>>> from sklearn.metrics import mean_absolute_error
1. Homogeneity: >>> from
>>> from sklearn.metrics import >>> from sklearn.metrics import sklearn.cross_validation

• Standardization • Normalization
confusion_matrix homogeneity_score import cross_val_score
>>> print(confusion_matrix(y_test, >>> y_true = [3, -0.5, 2] >>> homogeneity_score(y_true, >>>
>>>from sklearn.preprocessing import >>>from sklearn.preprocessing import y_pred)) >>> mean_absolute_error(y_true, y_predict) y_predict) print(cross_val_score(knn,
StandardScaler Normalizer 2. Accuracy Score: 2. Mean Squared Error: 2. V-measure: X_train, y_train, cv=4))
>>>get_names = df.columns >>> knn.score(X_test, y_test) >>> from sklearn.metrics import mean_squared_error >>> from sklearn.metrics import >>>
>>>pd.read_csv("File_name.csv")
>>>scaler = >>> from sklearn.metrics import >>> mean_squared_error(y_test, y_predict) v_measure_score print(cross_val_score(new_
>>>x_array = np.array(df[ Column1 ]
preprocessing.StandardScaler() accuracy_score 3. R² Score : >>> metrics.v_measure_score(y_true, lr, X, y, cv=2))
#Normalize Column1
>>>scaled_df = scaler.fit_transform(df) >>> accuracy_score(y_test, y_pred) >>> from sklearn.metrics import r2_score y_predict)
>>>normalized_X =
>>>scaled_df = >>> r2_score(y_true, y_predict)
preprocessing.normalize([x_array])
pd.DataFrame(scaled_df,
columns=get_names)m
FURTHERMORE:
Python for Data Science Certification Training Course

You might also like