Unit 3
Unit 3
Unit-3
Getting your hands dirty with data
INSTRUCTOR:
Vishal Parikh
Assistant Professor
[email protected]
Outline
▪ Jupyter Notebook
▪ Ipython Notebook
▪ Ipython help
▪ Magic functions
Jupyter/Ipython Notebook
Unit-3
Getting your hands dirty with data
INSTRUCTOR:
Vishal Parikh
Assistant Professor
[email protected]
Outline
▪ Working with styles
▪ Multimedia and Graphics Integration
▪ Plots and images
▪ Loading data from online sites
▪ Accessing data in structured flat-file form
Multimedia and Graphics Integration
Data
▪ There are two types of data
1. Structured Data
• The data that is already present in a row and column format or which can be
easily converted to rows and columns so that later it can fit nicely into a
database is known as structured data.
• Eg: CSV, TXT, XLS files etc.
2. UnStructured Data
• Sometimes we get data where the lines are not fixed width, such data is
known as unstructured data.
• Eg: HTML, image or pdf files etc.
Text File
▪ Opening a text file
• We use built-in function open().
• The open function returns a file object that contains methods and
attributes to perform various operations on the file.
• Syntax:
• File_object=open(“filename”,”mode”)
➢ filename : gives name of the file that the file object has opened.
➢ mode: attribute of a file object tells you which mode a file was opened in.
Text File
▪ Modes :
• r : Opens a file for reading only. The file pointer is placed at the beginning
of the file. This is the default mode.
• r+ : Opens a file for both reading and writing. The file pointer placed at
the beginning of the file.
Text File
▪ Reading a text file
• Open a file using the open() in r mode.
• If you have to read and write data using a file, then open it in an r+ mode.
• Read data from the file using read() or readline() or readlines() methods.
1. read(size) :
• Returns the specified number of bytes from the file.
• Default is -1 which means the whole file.
• size : Optional
• Syntax : file_object.read()
2. readline(size)
• Returns one line from the file.
• Default is -1 which means the whole file.
• size : Optional
• Syntax : file_object.readline()
CSV File
▪ Reading a CSV file
• CSV : Comma Seperated Values
• We use Pandas library to read CSV files.
• To read CSV files pandas provide read_csv(“filename”)
• Syntax
• data_frame=pandas.read_csv(“filename”)
Excel File
▪ Reading a Excel file
• We use Pandas library to read excel files.
• To read Excel files pandas provide read_excel(“filename”)
• Syntax
• data_frame=pandas.read_excel(“filename”)
HTML File
▪ Reading a HTML file
• We use Pandas library to read excel files.
• To read Excel files pandas provide read_excel(“filename”)
• Syntax
• data_frame=pandas.read_excel(“filename”)
Interacting data from Relational Database
Unit-3
Getting your hands dirty with data
INSTRUCTOR:
Vishal Parikh
Assistant Professor
[email protected]
Outline
▪ Accessing data in structured flat-file form
▪ Kernel
▪ Restoring a checkpoint
Data
▪ There are two types of data
1. Structured Data
• The data that is already present in a row and column format or which can be
easily converted to rows and columns so that later it can fit nicely into a
database is known as structured data.
• Eg: CSV, TXT, XLS files etc.
2. Unstructured Data
• Sometimes we get data where the lines are not fixed width, such data is
known as unstructured data.
• Eg: HTML, image or pdf files etc.
CSV File
▪ CSV stands for Comma Seperated Values
▪ A CSV is a comma-separated values file, which allows data to be
saved in a tabular format.
▪ Extension of the file is .csv
▪ Reading a CSV file
• We use Pandas library to read CSV files.
• To read CSV files pandas provide read_csv(“filename”)
• Syntax
• data_frame=pandas.read_csv(“filename”)
Excel File
▪ Reading a Excel file
• We use Pandas library to read excel files.
• To read Excel files pandas provide read_excel(“filename”)
• Syntax
• data_frame=pandas.read_excel(“filename”)
Kernel
▪ Behind every notebook kernel is running.
▪ When you run a code cell, that code is executed within the
kernel and any output is returned back to the cell to be
displayed.
▪ For example, if you import libraries or declare variables in one
cell, they will be available in another.
▪ There are several options available for Kernels
▪ Interrupt
▪ Restart : Restarts the kernel, thus clearing all the variables etc that were
defined.
▪ Restart & Clear Output: Same as above but will also wipe the output
displayed below your code cells.
▪ Restart & Run All: Same as above but will also run all your cells in order from
first to last.
L. J. Institute of Engineering & Technology
Department of Computer Engineering
Unit-3
Getting your hands dirty with data
INSTRUCTOR:
Vishal Parikh
Assistant Professor
[email protected]
Outline
▪ Dealing with Missing Data
▪ Finding the Missing Data
▪ Imputing the Missing Data
Dealing with Missing Data
▪ Missing data is always a problem in real life scenarios.
▪ Areas like machine learning and data mining face severe issues
in the accuracy of their model predictions because of poor
quality of data caused by missing values.
▪ In these areas, missing value treatment is a major point of focus
to make their models more accurate and valid.
▪ When and Why is Data missed?
▪ Let us consider an online survey for a product.
▪ Many a times, people do not share all the information
related to them.
Dealing with Missing Data
▪ Missing Data can occur when no information is provided for one
or more items or for a whole unit.
▪ Missing Data can also refer to as NA (Not Available).
▪ In Pandas missing data is represented by two value:
• None: None is a Python singleton object that is often used for missing
data in Python code.
• NaN : NaN (an acronym for Not a Number), is a special floating-point
value.
▪ Functions area available for detecting, removing, and replacing
null values in Data Frame.
Imputing Missing Data
▪ Imputing refers to using a model to replace missing values.
▪ There are many options we could consider when replacing a
missing value, for example:
• A constant value that has meaning within the domain, such as 0, distinct
from all other values.
• A value from another randomly selected record.
• A mean, median or mode value for the column.
• A value estimated by another predictive model.
▪ Pandas provides the fillna() function for replacing missing values
with a specific value.
L. J. Institute of Engineering & Technology
Department of Computer Engineering
Unit-3
Getting your hands dirty with data
INSTRUCTOR:
Vishal Parikh
Assistant Professor
[email protected]
Outline
▪ Slicing and dicing
▪ Filtering and selecting data
Slicing and Dicing
▪ In pandas, .loc , .iloc and .ix are three ways you can select rows and columns
by label(s) or a Boolean array.
1. .loc()
▪ Pandas provide various methods to have purely label based indexing.
▪ When slicing, the start bound is also included.
▪ Integers are valid labels, but they refer to the label and not the position.
▪ loc() has multiple access methods like −
• A single scalar label
• A list of labels
• A slice object
• A Boolean array
▪ Loc takes two arguments separated by comma.
▪ The first one indicates row and the second one indicates the column.
Slicing and Dicing
2. .iloc()
▪ Pandas provide various methods to have purely integer based indexing.
▪ Indexes are 0 based.
▪ Integers are valid labels, but they refer to the label and not the position.
▪ The various access methods are as follows −
• An Integer
• A list of integers
• A range of values
▪ Loc takes two arguments separated by comma.
3. .ix()
▪ Based on Label and Integer.
▪ Pandas provides a hybrid method for selections and sub setting the
object using the .ix() operator.
▪ Depreciated
Filtering and Selecting Data
L. J. Institute of Engineering & Technology
Department of Computer Engineering
Unit-3
Getting your hands dirty with data
INSTRUCTOR:
Vishal Parikh
Assistant Professor
[email protected]
Outline
▪ Concatenation and Transformation
▪ Adding new cases and variable
▪ Removing Data
▪ Sorting Data
▪ Aggregating Data
Concatenation and Transformation
Adding new cases and variables
Removing Data
Sorting Data
Aggregating Data
L. J. Institute of Engineering & Technology
Department of Computer Engineering
Unit-3
Getting your hands dirty with data
INSTRUCTOR:
Vishal Parikh
Assistant Professor
[email protected]
Outline
▪ Regular Expression
Regular Expression (RE)
▪ A regular expression is a string that contains special symbols and
characters to find and extract the information needed by us from the
given data.
▪ Many a times, we are needed to extract required information from given
data.
Regular Expression (RE)
▪ Python provides re module that stands for regular expressions.
▪ A regular expression is also called simply regex.
▪ This module contains methods like
1. compile()
2. search()
3. match()
4. findall()
5. split()
etc, which are used in finding the information in the available data.
Sequence Character in RE
▪ Some of the special sequences beginning with ”\” represent predefined sets of
characters that are often useful, such as the set of digits, the set of letters, or the
set of anything that isn’t whitespace.
Symbol Description
\d Matches any digits. [0-9]
\D Matches non-digit character. [^0-9]
\s Matches any whitespace.
[\t\n\r\f\v]
\S Matches any non whitespace.
[^\t\n\r\f\v]
\w Matches alphanumeric characters.
[a-zA-Z0-9]
Sequence Character in RE
▪ Some of the special sequences beginning with ”\” represent predefined sets of
characters that are often useful, such as the set of digits, the set of letters, or the
set of anything that isn’t whitespace.
Symbol Description
\d Matches any digits. [0-9]
\D Matches non-digit character. [^0-9]
\s Matches any whitespace.
[\t\n\r\f\v]
\S Matches any non whitespace.
[^\t\n\r\f\v]
\w Matches alphanumeric characters.
[a-zA-Z0-9]
Special Character and Pattern matching
Character Meaning Example
* Zero or more ab*c matches ac, abc, a
occurrences of a
Character bbc, and so on
+ One or more ab+c matches abc, a
occurrences of a
character bbc, and so on
? Zero or one occurrences ab?c matches ac and ab
of a character c
. Any character a.*c matches any
substring starting
with a and ending
with c
[chars] Any character inside the a[bB]c matches abc and a
Brackets Bc
Special Character and Pattern matching
Character Meaning Example
[char1-char2] A range of characters a[a-z]c matches a,
followed by any non-
capitalized letter,
followed by c
[^chars] Any character not inside a[^bB]c matches a,
the brackets followed by anything
but b or B, followed by c
[char1-char2] A range of characters a[a-z]c matches a,
followed by any non-
capitalized letter,
followed by c
{num} An exact number of ab{3}c matches abbbc
occurrences of a
Character
Special Character and Pattern matching
Character Meaning Example
{num1,num2} A number of ab{1,3}c matches abc, a
occurrences of a bbc and abbbc
character in a specified
Range
| Matches either of two abc|aBc matches abc or
alternatives aBc
^ Matches the start of the ^abc matches abc in ab
string only cd, but does not
match abc in dabc
$ Matches the end of the abc$ matches abc in da
string only bc, but does not
match abc in abcd
Accessing data from
Database
2. Adjacency List
Adjacency Matrix Representation
NetworkX