0% found this document useful (0 votes)
27 views

Unit 3

This document discusses various topics related to handling missing data in Python for data science. It covers identifying missing data, represented as None or NaN in Pandas, and imputing missing values by replacing them using statistical techniques. The document also discusses dealing with missing data as it is a common problem in machine learning and data analysis due to incomplete data collection. Handling missing data appropriately is important for building accurate predictive models.

Uploaded by

mr explorer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Unit 3

This document discusses various topics related to handling missing data in Python for data science. It covers identifying missing data, represented as None or NaN in Pandas, and imputing missing values by replacing them using statistical techniques. The document also discusses dealing with missing data as it is a common problem in machine learning and data analysis due to incomplete data collection. Handling missing data appropriately is important for building accurate predictive models.

Uploaded by

mr explorer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 102

L. J.

Institute of Engineering & Technology


Department of Computer Engineering

Python for Data Science


(3150713)

Unit-3
Getting your hands dirty with data

INSTRUCTOR:
Vishal Parikh
Assistant Professor
[email protected]
Outline
▪ Jupyter Notebook
▪ Ipython Notebook
▪ Ipython help
▪ Magic functions
Jupyter/Ipython Notebook

Jupyter notebook or ipython notebook is a web application that


allows you to run live code, combine visualization and explanatory text all in
one place.
Ipython Help
Magic functions

▪ Magic commands or magic functions are one of the


important enhancements that IPython offers compared
to the standard Python shell.
▪ These magic commands are intended to solve common
problems in data analysis using Python.
▪ There are two types of magic functions
▪ Line Magics
▪ Cell Magics
Line Magic functions
▪ Prefix : %
▪ Rest of the line is its argument passed without
parentheses or quotes.
▪ Used as an expression and their return value can be
assigned to variable.
Cell Magic functions
▪ Prefix : %%
▪ Operate on multiple lines
▪ Information of a specific magic function is obtained by
%magicfunction? Command.
L. J. Institute of Engineering & Technology
Department of Computer Engineering

Python for Data Science


(3150713)

Unit-3
Getting your hands dirty with data

INSTRUCTOR:
Vishal Parikh
Assistant Professor
[email protected]
Outline
▪ Working with styles
▪ Multimedia and Graphics Integration
▪ Plots and images
▪ Loading data from online sites
▪ Accessing data in structured flat-file form
Multimedia and Graphics Integration
Data
▪ There are two types of data
1. Structured Data
• The data that is already present in a row and column format or which can be
easily converted to rows and columns so that later it can fit nicely into a
database is known as structured data.
• Eg: CSV, TXT, XLS files etc.
2. UnStructured Data
• Sometimes we get data where the lines are not fixed width, such data is
known as unstructured data.
• Eg: HTML, image or pdf files etc.
Text File
▪ Opening a text file
• We use built-in function open().
• The open function returns a file object that contains methods and
attributes to perform various operations on the file.
• Syntax:
• File_object=open(“filename”,”mode”)
➢ filename : gives name of the file that the file object has opened.
➢ mode: attribute of a file object tells you which mode a file was opened in.
Text File
▪ Modes :
• r : Opens a file for reading only. The file pointer is placed at the beginning
of the file. This is the default mode.
• r+ : Opens a file for both reading and writing. The file pointer placed at
the beginning of the file.
Text File
▪ Reading a text file
• Open a file using the open() in r mode.
• If you have to read and write data using a file, then open it in an r+ mode.
• Read data from the file using read() or readline() or readlines() methods.
1. read(size) :
• Returns the specified number of bytes from the file.
• Default is -1 which means the whole file.
• size : Optional
• Syntax : file_object.read()
2. readline(size)
• Returns one line from the file.
• Default is -1 which means the whole file.
• size : Optional
• Syntax : file_object.readline()
CSV File
▪ Reading a CSV file
• CSV : Comma Seperated Values
• We use Pandas library to read CSV files.
• To read CSV files pandas provide read_csv(“filename”)
• Syntax
• data_frame=pandas.read_csv(“filename”)
Excel File
▪ Reading a Excel file
• We use Pandas library to read excel files.
• To read Excel files pandas provide read_excel(“filename”)
• Syntax
• data_frame=pandas.read_excel(“filename”)
HTML File
▪ Reading a HTML file
• We use Pandas library to read excel files.
• To read Excel files pandas provide read_excel(“filename”)
• Syntax
• data_frame=pandas.read_excel(“filename”)
Interacting data from Relational Database

▪ To connect to RDBMS for analysis we use pandas library and for


implementing RDBMS we use SQLAlchemy.
▪ Supports MySql, Oracle and Postgresql and Mssql.
Interacting data from NOSQL Database

▪ As more and more data become available as unstructured or


semi-structured, the need of managing them through NoSql
database increases.
▪ We will use python to interact with MongoDB as a NoSQL
database.
▪ In order to connect to MongoDB, python uses a library known
as pymongo.
▪ Syntax :
• conda install pymongo
L. J. Institute of Engineering & Technology
Department of Computer Engineering

Python for Data Science


(3150713)

Unit-3
Getting your hands dirty with data

INSTRUCTOR:
Vishal Parikh
Assistant Professor
[email protected]
Outline
▪ Accessing data in structured flat-file form
▪ Kernel
▪ Restoring a checkpoint
Data
▪ There are two types of data
1. Structured Data
• The data that is already present in a row and column format or which can be
easily converted to rows and columns so that later it can fit nicely into a
database is known as structured data.
• Eg: CSV, TXT, XLS files etc.
2. Unstructured Data
• Sometimes we get data where the lines are not fixed width, such data is
known as unstructured data.
• Eg: HTML, image or pdf files etc.
CSV File
▪ CSV stands for Comma Seperated Values
▪ A CSV is a comma-separated values file, which allows data to be
saved in a tabular format.
▪ Extension of the file is .csv
▪ Reading a CSV file
• We use Pandas library to read CSV files.
• To read CSV files pandas provide read_csv(“filename”)
• Syntax
• data_frame=pandas.read_csv(“filename”)
Excel File
▪ Reading a Excel file
• We use Pandas library to read excel files.
• To read Excel files pandas provide read_excel(“filename”)
• Syntax
• data_frame=pandas.read_excel(“filename”)
Kernel
▪ Behind every notebook kernel is running.
▪ When you run a code cell, that code is executed within the
kernel and any output is returned back to the cell to be
displayed.
▪ For example, if you import libraries or declare variables in one
cell, they will be available in another.
▪ There are several options available for Kernels
▪ Interrupt
▪ Restart : Restarts the kernel, thus clearing all the variables etc that were
defined.
▪ Restart & Clear Output: Same as above but will also wipe the output
displayed below your code cells.
▪ Restart & Run All: Same as above but will also run all your cells in order from
first to last.
L. J. Institute of Engineering & Technology
Department of Computer Engineering

Python for Data Science


(3150713)

Unit-3
Getting your hands dirty with data

INSTRUCTOR:
Vishal Parikh
Assistant Professor
[email protected]
Outline
▪ Dealing with Missing Data
▪ Finding the Missing Data
▪ Imputing the Missing Data
Dealing with Missing Data
▪ Missing data is always a problem in real life scenarios.
▪ Areas like machine learning and data mining face severe issues
in the accuracy of their model predictions because of poor
quality of data caused by missing values.
▪ In these areas, missing value treatment is a major point of focus
to make their models more accurate and valid.
▪ When and Why is Data missed?
▪ Let us consider an online survey for a product.
▪ Many a times, people do not share all the information
related to them.
Dealing with Missing Data
▪ Missing Data can occur when no information is provided for one
or more items or for a whole unit.
▪ Missing Data can also refer to as NA (Not Available).
▪ In Pandas missing data is represented by two value:
• None: None is a Python singleton object that is often used for missing
data in Python code.
• NaN : NaN (an acronym for Not a Number), is a special floating-point
value.
▪ Functions area available for detecting, removing, and replacing
null values in Data Frame.
Imputing Missing Data
▪ Imputing refers to using a model to replace missing values.
▪ There are many options we could consider when replacing a
missing value, for example:
• A constant value that has meaning within the domain, such as 0, distinct
from all other values.
• A value from another randomly selected record.
• A mean, median or mode value for the column.
• A value estimated by another predictive model.
▪ Pandas provides the fillna() function for replacing missing values
with a specific value.
L. J. Institute of Engineering & Technology
Department of Computer Engineering

Python for Data Science


(3150713)

Unit-3
Getting your hands dirty with data

INSTRUCTOR:
Vishal Parikh
Assistant Professor
[email protected]
Outline
▪ Slicing and dicing
▪ Filtering and selecting data
Slicing and Dicing
▪ In pandas, .loc , .iloc and .ix are three ways you can select rows and columns
by label(s) or a Boolean array.
1. .loc()
▪ Pandas provide various methods to have purely label based indexing.
▪ When slicing, the start bound is also included.
▪ Integers are valid labels, but they refer to the label and not the position.
▪ loc() has multiple access methods like −
• A single scalar label
• A list of labels
• A slice object
• A Boolean array
▪ Loc takes two arguments separated by comma.
▪ The first one indicates row and the second one indicates the column.
Slicing and Dicing
2. .iloc()
▪ Pandas provide various methods to have purely integer based indexing.
▪ Indexes are 0 based.
▪ Integers are valid labels, but they refer to the label and not the position.
▪ The various access methods are as follows −
• An Integer
• A list of integers
• A range of values
▪ Loc takes two arguments separated by comma.

3. .ix()
▪ Based on Label and Integer.
▪ Pandas provides a hybrid method for selections and sub setting the
object using the .ix() operator.
▪ Depreciated
Filtering and Selecting Data
L. J. Institute of Engineering & Technology
Department of Computer Engineering

Python for Data Science


(3150713)

Unit-3
Getting your hands dirty with data

INSTRUCTOR:
Vishal Parikh
Assistant Professor
[email protected]
Outline
▪ Concatenation and Transformation
▪ Adding new cases and variable
▪ Removing Data
▪ Sorting Data
▪ Aggregating Data
Concatenation and Transformation
Adding new cases and variables
Removing Data
Sorting Data
Aggregating Data
L. J. Institute of Engineering & Technology
Department of Computer Engineering

Python for Data Science


(3150713)

Unit-3
Getting your hands dirty with data

INSTRUCTOR:
Vishal Parikh
Assistant Professor
[email protected]
Outline
▪ Regular Expression
Regular Expression (RE)
▪ A regular expression is a string that contains special symbols and
characters to find and extract the information needed by us from the
given data.
▪ Many a times, we are needed to extract required information from given
data.
Regular Expression (RE)
▪ Python provides re module that stands for regular expressions.
▪ A regular expression is also called simply regex.
▪ This module contains methods like
1. compile()
2. search()
3. match()
4. findall()
5. split()
etc, which are used in finding the information in the available data.
Sequence Character in RE
▪ Some of the special sequences beginning with ”\” represent predefined sets of
characters that are often useful, such as the set of digits, the set of letters, or the
set of anything that isn’t whitespace.
Symbol Description
\d Matches any digits. [0-9]
\D Matches non-digit character. [^0-9]
\s Matches any whitespace.
[\t\n\r\f\v]
\S Matches any non whitespace.
[^\t\n\r\f\v]
\w Matches alphanumeric characters.
[a-zA-Z0-9]
Sequence Character in RE
▪ Some of the special sequences beginning with ”\” represent predefined sets of
characters that are often useful, such as the set of digits, the set of letters, or the
set of anything that isn’t whitespace.
Symbol Description
\d Matches any digits. [0-9]
\D Matches non-digit character. [^0-9]
\s Matches any whitespace.
[\t\n\r\f\v]
\S Matches any non whitespace.
[^\t\n\r\f\v]
\w Matches alphanumeric characters.
[a-zA-Z0-9]
Special Character and Pattern matching
Character Meaning Example
* Zero or more ab*c matches ac, abc, a
occurrences of a
Character bbc, and so on
+ One or more ab+c matches abc, a
occurrences of a
character bbc, and so on
? Zero or one occurrences ab?c matches ac and ab
of a character c
. Any character a.*c matches any
substring starting
with a and ending
with c
[chars] Any character inside the a[bB]c matches abc and a
Brackets Bc
Special Character and Pattern matching
Character Meaning Example
[char1-char2] A range of characters a[a-z]c matches a,
followed by any non-
capitalized letter,
followed by c
[^chars] Any character not inside a[^bB]c matches a,
the brackets followed by anything
but b or B, followed by c
[char1-char2] A range of characters a[a-z]c matches a,
followed by any non-
capitalized letter,
followed by c
{num} An exact number of ab{3}c matches abbbc
occurrences of a
Character
Special Character and Pattern matching
Character Meaning Example
{num1,num2} A number of ab{1,3}c matches abc, a
occurrences of a bbc and abbbc
character in a specified
Range
| Matches either of two abc|aBc matches abc or
alternatives aBc
^ Matches the start of the ^abc matches abc in ab
string only cd, but does not
match abc in dabc
$ Matches the end of the abc$ matches abc in da
string only bc, but does not
match abc in abcd
Accessing data from
Database

Mr. Vishal Parikh


Outline
❑ Interacting data from Relational Database
RDBMS
❑ The Python standard for database interfaces is the
Python DB-API.
❑ Python Database API supports a wide range of
database servers such as −
• MySQL
• PostgreSQL
• Microsoft SQL Server 2000
• Informix
• Interbase
• Oracle
• Sybase
Database Operations
❑ With the help of MySQL Database we can perform
following operations
– Creating Database
– Creating Database Table
– Insert Operation
– Retrieve Operation
– Update Operation
– Delete Operation
Creating Database Table
❑ Once a database connection is established, we are
ready to create tables or records into the database
tables using execute method of the created cursor.
❑ Syntax
• CREATE TABLE tablename (column_name
data_type)
• To create a table inside database we have use
execute method.
Insertion Operation
❑ We can easily insert record into our table using
insert query.
❑ Syntax
• INSERT INTO table_name (list of columns)
VALUES (list of values)
• To insert record inside our table we have use
execute method.
Retrieve Operation
❑ Retrieve Operation on any database means to fetch
some useful information from the database.
❑ Following methods can be used to extract data from
database.
– fetchone() : It fetches the next row of a query
result set.
– fetchall() : It fetches all the rows in a result set.
– rowcount : Read-only attribute and returns the
number of rows that were affected by an
execute method.
❑ Syntax
• SELECT * FROM table_name [WHERE ]
Update Operation
❑ UPDATE Operation on any database means to
update one or more records, which are already
available in the database.
❑ Syntax
• UPDATE table_name SET column_name = value
WHERE column_name = value
Delete Operation
❑ DELETE operation is required when you want to
delete some records from your database.
❑ Syntax
• DELETE FROM table_name WHERE
column_name = value
Outline
❑ Stemming and Removing stop words
Outline
❑ Stemming and Removing stop words
Stemming & removing stop words
❑ Stemming is the process of reducing words to their
stem (or root) words.
❑ The act of stemming and removing stop words
simplifies the text and reduces the number of
textual elements so that only the essential
elements remains.
❑ We just need to keep the terms that are nearest to
the true sense of the phrase.
❑ By reducing phrases a computational algorithm
can work faster and process the text more
effectively.
Natural Language Toolkit (NLTK)
❑ Natural Language Toolkit library is used whenever
we want to perform stemming and removing stop
words.
❑ We need to download and install NLTK from the
following website :
1. Download NLTK using following website
http://www.nltk.org/data.html
2. Import package and download NLTK
import nltk
nltk.download()
Outline
❑ Bag of Words Model
Bag of Words
❑ The bag-of-words model is a way of representing
text data when modelling text with machine
learning algorithms.
❑ Simple and easy to implement.
❑ Bag-of-words is successful in problems such as
• Language Modelling
• Document Classification
The problem with Text
❑ A problem with modelling text is that it is messy,
and techniques like machine learning algorithms
prefer well defined fixed-length inputs and
outputs.
❑ Machine learning algorithms cannot work with
raw text directly; the text must be converted into
numbers. Specifically, vectors of numbers.
❑ We need to extract features from our text.
❑ A popular and simple method of feature
extraction with the text data is called the bags-of-
words model of text.
Bag of Words Model
❑ It is a representation of text that describes the
occurrences of words within a document.
❑ Bag-of-Words involves two things
1. A vocabulary of known words.
2. A measure of presence of known words.
❑ It is called a “bag” of words, because any
information about the order or structure of words
in the document is discarded.
❑ The model is only concerned with whether known
words occur in the document, not where in the
document.
Steps for Bag of Words
❑ Step 1: Collect Data
It was the best of times,
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness,
❑ Here let us treat each line as a separate
“document” and the 4 lines as our entire corpus of
documents.
Steps for Bag of Words
❑ Step 2: Design the Vocabulary
❑ We can make a list of all of the words in our model
vocabulary.
❑ The unique words here are as follows (ignoring
case and punctuation marks)
1. it
2. was
3. the
4. best
5. of
6. times
7. worst
8. age
9. wisdom
10. foolishness
❑ Total 10 words from corpus containing 24 words.
Steps Bag of Words
❑ Step 3: Create Document Vectors
❑ Here we score the words in each document.
❑ The simplest scoring method is to mark the
presence of words as a Boolean value
• 0 for absence
• 1 for presence
Example of Bag of Words
❑ Consider the document “It was the best of times”
❑ The scoring of the document would look as follows:
• “it” = 1 Designed Vocabulary
• “was” = 1 it
• “the” = 1
• “best” = 1 was
• “of” = 1 the
• “times” = 1 best
• “worst” = 0
of
• “age” = 0
• “wisdom” = 0 time
• “foolishness” = 0 worst
age
wisdom
foolishness
Example of Bag of Words
❑ Binary Vector representation:
• it was the best of times : [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
• it was the worst of times : [1, 1, 1, 0, 1, 1, 1, 0, 0, 0]
• it was the age of wisdom : [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]
• it was the age of foolishness : [1, 1, 1, 0, 1, 0, 0, 1, 0, 1]
Working with n-grams
❑ A more sophisticated approach is to create a
vocabulary of grouped words.
❑ In this approach, each word or token is called a
“gram”.
❑ Creating a vocabulary of two-word pairs is, in turn,
called a bigram model.
❑ An N-gram is an N-token sequence of words:
• 2-gram : it is a two-words sequence of words.
Example : “please turn”, “turn your”, or “your
homework”
• 3-gram : it is a three-words sequence of words.
Example : please turn your”, or “turn your
homework”.
Example n-grams
❑ Consider the document “It was the best of times”
❑ The scoring of the document would look as follows:
• “it was”
• “was the”
• “the best”
• “best of”
• “of times”
Outline
❑ Working with HTML Pages
❑ Parsing HTML Document
HTML
❑ HTML : Hyper Text Markup Language
❑ It is a standard markup language for Web pages.
❑ It describes structure of a Web page.
❑ It consists of a series of elements.
❑ HTML elements tell the browser how to display the
content.
A Simple HTML Document
❑ <!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
</body>
</html>
A Simple HTML Document
❑ <!DOCTYPE html> : Declaration defines that
this document is an HTML5 document.
❑ <html> : The root element of an HTML page.
❑ <head> : Contains meta information about the
HTML page.
❑ <title> : Specifies a title for HTML page.
❑ <body> : Defines the document’s body and it is
a container for all the visible content.
❑ <h1> : Defines a large heading.
❑ <p> : Defines a paragraph.
Parsing HTML Document

❑ For parsing HTML documents Beautiful Soup library


is used.
❑ It works on tree based data.
❑ For e.g. Automatic conversion of HTML
documents from UTF-8 to Unicode.
Outline
❑ Working with XML
❑ Parsing XML Document
XML
❑ XML : eXensible Markup Language
❑ It is designed to store and transport data.
❑ It is designed for both human and machine
readable.
❑ It is used to distribute data over the internet.
XML Document
❑ XML creates a tree-like structure that is easy to
interpret.
❑ XML documents have sections called elements.
❑ A tag is a markup that begins with < and ends with
>.
❑ The top-level element is called root which contains
all other elements.
❑ Attributes are name-value pair that exist within a
start-tag or empty element tag.
A Simple XML Document
Parsing XML Document
❑ Python has a built-in library, ElementTree which
provides functions to read and manipulate XMLs.
❑ Syntax
❑ import xml.etree.ElementTree as ET

❑ Beautiful Soup is a Python library for parsing XML


data.
❑ Syntax
❑ pip install beautifulsoup
Outline
❑ TF IDF
TF
❑ TF : Term Frequency

❑ It measures the frequency of a word in a


document.
❑ It depends on the length of the document and
generality of word.

❑ For example a very common word such as


“was” can appear multiple times in a document,
but if we take two documents one which have 100
words and other which have 10,000 words.
❑ We can’t conclude that longer document is more
important than the shorter document.
TF
❑ The final value of the normalised TF value will be
in the range of [0 to 1]. 0, 1 inclusive.

❑ TF is individual to each document and word, hence


we can formulate TF as follows.
❑ tf(t,d) = count of t in d / number of words in d
❑ t = term (word); d = document (set of words)
IDF
❑ IDF : Inverse Document Frequency

❑ TF-IDF is a statistical measure that evaluates how


relevant a word is to a document in a collection of
documents.
❑ TF-IDF (term frequency-inverse document
frequency) was invented for document search and
information retrieval.
Outline
❑ Working with Graph Data
❑ Understanding Adjacency Matrix
❑ NetworkX Library
Graph
❑ Graph is a non-linear data structure consisting of
nodes and edges.
❑ Nodes are referred as Vertices represented by V.
❑ Edges are the lines or arcs that connect any two
nodes in the graph represented by E.

❑ Set of Vertices : {1,2,3,4,5,6,7,8,9}


Graph Representation
❑ The most commonly used representation of a graph
are :
1. Adjacency Matrix

2. Adjacency List
Adjacency Matrix Representation
NetworkX

❑ NetworkX is a Python language software package


for the creation, manipulation, and study of the
structure, dynamics, and functions of complex
networks.
❑ Python language data structures for graphs,
digraphs, and multigraphs.

You might also like