0% found this document useful (0 votes)
13 views50 pages

C0 Report

The document is a virtual internship report by Neelam Sasi Priya, detailing her experience in Data Science, Machine Learning, and AI at Data Valley IT Solutions from February 12 to April 16, 2024. It includes acknowledgments, an abstract outlining key concepts in data science, machine learning, and artificial intelligence, as well as a weekly overview of internship activities and modules covered. The report serves as a partial fulfillment for the Bachelor of Technology degree in Computer Science and Engineering at Sir C R Reddy College of Engineering.

Uploaded by

sasipriyaneelam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views50 pages

C0 Report

The document is a virtual internship report by Neelam Sasi Priya, detailing her experience in Data Science, Machine Learning, and AI at Data Valley IT Solutions from February 12 to April 16, 2024. It includes acknowledgments, an abstract outlining key concepts in data science, machine learning, and artificial intelligence, as well as a weekly overview of internship activities and modules covered. The report serves as a partial fulfillment for the Bachelor of Technology degree in Computer Science and Engineering at Sir C R Reddy College of Engineering.

Uploaded by

sasipriyaneelam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

VIRTUAL INTERNSHIP REPORT

ON
DATA SCIENCE MACHINE LEARNING, AI
A Report submitted in partial fulfilment of the requirements for the Award of Degree of
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING
BY

NEELAM SASI PRIYA (20B81A05C0)


Under the Esteemed Supervision of

MR. PAVAN CHALAMALASETTI


DATA VALLEY IT SOLUTIONS Vijayawada
(Duration: 12th February 2024 to 16th April 2024)

DEPARTMENT OF COMPUTER SCIENCE ANF ENGINEERING

SIR C R REDDY COLLEGE OF ENGONEERING

ACCREDITED BY NBA

Approved by AICTE, permanently affiliated to JNTUK, Kakinada

ELURU, ANDHRA PRADESH

2023-24
SIR C R REDDY COLLEGE OF ENGINEERING

ACCREDITED BY NBA

Approved by AICTE, permanently affiliated to JNTUK, Kakinada

ELURU, ANDHRA PRADESH

2023-24

CERTIFICATE

This is to certify that the Internship report entitled “DATA SCIENCE MACHINE
LEARNING, AI” submitted by NEELAM SASI PRIYA (20B81A05C0) in partial fulfillment for
the award of degree of BACHELOR OF TECHNOLOGY in COMPUTER SCIENCE AND
ENGINEERING, at SIR C R REDDY COLLEGE OF ENGINEERING, ELURU affiliated to
Jawaharlal Nehru Technological University, Kakinada during the academic year 2023-2024.

Internship Coordinator Internship Guide Head of the Department


DR.B. MADHAVA RAO, M. Tech, Ph. D MR.V. PRANAV, M. Tech Dr. A. YESU BABU, M. Tech, Ph. D
Associate Professor, CSE Assistant Professor, CSE Professor & HOD, CSE

External Examiner
INTERNSHIP ACCEPTANCE LETTER
INTERNSHIP CERTIFICATE
ACKNOWLEDGEMENT

First I would like to thank Mr. Pavan Chalamalasetti, CEO of Data Valley pvt.limited,
Vijayawada for giving me the opportunity to do an internship within the organization.

I also would like all the people that worked along with me Data Valley pvt.limited,Vijayawada
with their patience and openness they created an enjoyable working environment.

It is indeed with a great sense of pleasure and immense sense of gratitude that I acknowledge
the help of these individuals.

I am highly indebted to the management and principal DR.K. VENKATESWAR RAO, for
the facilities provided toaccomplish this internship.

I would like to thank my Head of the Department DR. A. YESU BABU, for his constructive
criticism throughout my internship.

I would like to thank DR.B. MADHAVA RAO, Internship coordinator, Department of CSE
for his support to complete my internship.

I would like to thank MR. V. PRANAV, Internship guide, Department of CSE for his support
and advice to get and complete internship in DATA VALLEY.

I am extremely grateful to my department staff members and friends who helped me in


successful completion of this internship.

NEELAM SASI PRIYA


(20B81A05C0)
ABSTRACT

Data Science:

Data Science involves the collection, processing, and analysis of large volumes of data to extract
meaningful insights and patterns. It combines techniques from statistics, mathematics, computer
science, and domain knowledge to uncover valuable information that can be used for decision-
making and problem-solving. Data scientists use various tools and algorithms to clean and
preprocess data, perform exploratory data analysis, build predictive models, and communicate
findings to stakeholders.

Machine Learning (ML):

Machine Learning is a subset of Artificial Intelligence that focuses on developing algorithms and
statistical models that enable computers to learn from and make predictions or decisions based on
data, without being explicitly programmed. ML algorithms are categorized into supervised
learning (where models learn from labelled data), unsupervised learning (for finding patterns in
unlabeled data), and reinforcement learning (where agents learn to make decisions through trial
and error based on rewards). Common ML techniques include regression, classification,
clustering, and deep learning.

Artificial Intelligence (AI):

Artificial Intelligence encompasses a broader range of technologies and methods that enable
machines to simulate human intelligence. This includes not only machine learning but also areas
such as natural language processing (NLP), computer vision, robotics, expert systems, and
knowledge representation. AI systems aim to perform tasks that typically require human cognitive
abilities, such as understanding language, recognizing objects in images, making decisions, and
solving complex problems.
ORGANIZATION INFORAMTION

Organization Information:

Datavalley.ai is a leading provider of top-notch training and consulting services in the cutting-
edge fields of Big Data, Data Engineering, Data Architecture, DevOps, Data Science, Machine
Learning,IoT, and Cloud Technologies.

Training:

Data Valley training programs, led by industry experts, are tailored to equip professionals and
organizations with the essential skills and knowledge needed to thrive in the rapidly evolving
data landscape. We believe in continuous learning and growth, and our commitment to staying
on top of emerging trends and technologies ensures that our clients receive the most cutting-edge
training possible.
WEEKLY OVERVIEW OF INTERNSHIP ACTIVITIES

DATE NAME OF THE TOPIC/MODULE COMPLETED


14/02/24 Introduction to Data Science
15/02/24
Data Science Overview
16/02/24
Introduction to Python
1st WEEK

17/02/24 Python Data types, operators


20/02/24 Python lists, Dictionaries
21/02/24 Standard Libraries in python

DATE NAME OF THE TOPIC/MODULE COMPLETED


22/02/24 Data Frames, Indexing Data Frames
23/02/24 Reading a csv file, basic operations
24/02/24 Statistics for Data Science introduction
2nd WEEK

26/02/24 Measures Of Central Tendency


27/02/24 Understanding spread data
28/02/24 Examples on statistics
DATE NAME OF THE TOPIC/MODULE COMPLETED
1/03/24 Predictive Modeling
2/03/24
Introduction to Predictive Modeling
4/03/24 Understanding the types of Predictive Models
3rd WEEK

5/03/24 Stages of Predictive Models

6/03/24 Hypothesis Generation


8/03/24 Data Extraction

DATE NAME OF THE TOPIC/MODULE COMPLETED


9/03/24 Data Exploration
11/03/24 Reading Data into Python
4th WEEK

12/03/24 Variable identification


14/03/24 Linear Regression
15/03/24 Logistic, Decision Trees, K-means
18/03/24 Project introduction

DATE NAME OF THE TOPIC/MODULE COMPLETED


20/03/24 Design& Analysis
5th WEEK

22/03/24 Coding
Project, quizzes
Contents

S.NO List of Contents Page. No

Module 1 Introduction to Data Science 1


1.1 Data Science Overview 1
Module 2 Python for Data Science 2
2.1 Introduction to Python 2
2.2 Lists 2
2.3 Dictionaries 3
2.4 Understanding Standard Libraries in Python 4
2.5 Reading a CSV File in Python 5
2.6 Data Frames and basic operations with Data Frames 5
2.7 Indexing Data Frame 6
Module 3 Understanding the Statistics for Data Science 8
3.1 Introduction to Statistics 8
3.2 Measures of Central Tendency 9
3.3 Understanding the spread of data 10
Module 4 Predictive Modeling and Basics of Machine Learning 11
4.1 Introduction to Predictive Modeling 11
4.2 Understanding the types of Predictive Models 11
4.3 Stages of Predictive Models 11
4.4 Hypothesis Generation 12
4.5 Data Extraction 13
4.6 Data Exploration 13
4.7 Reading the data into Python 13
4.8 Variable Identification 14
4.9 Linear Regression 15
4.10 Logistic Regression 15
4.11 Decision Trees 16
4.12 K-Means 17
S.no List of Contents Page. No

Module 5 Video Screenshots 25


Module 6 Quiz Screenshots 27
Module 7 Internship Project 28
Module 8 List of References 38
Module 9 Conclusion 39

List of Figures

S.no Figure no Figure Names Page No

1 2.6 Data Frames 6

2 3.1 Types of Statistics 7

3 4.12.1 K means 15

4 4.12.2 Elbow method 16

5 4.12.3 Elbow method 2 16

6 4.12.4 Silhouette method 17

7 4.12.5 K means plot Difference 18


Module-1: Introduction to Data Science

1.1. Data Science Overview

Data science is the study of data. Like biological sciences is a study of biology, physical sciences, it’s
the study of physical reactions. Data is real, data has real properties, and we need to study them if
we’re going to work on them. Data Science involves data and some signs. It is a process, not an event.
It is the process ofusing data to understand too many different things, to understand the world.
Let Suppose when you have a model or proposed explanation of a problem, and you try to validate
thatproposed explanation or model with your data. It is the skill of unfolding the insights and trends that
are hiding (or abstract) behind data. It’s when you translate data into a story. So, use storytelling to
generate insight. And with these insights, you can make strategic choices for a company or an
institution.

Predictive modeling:
Predictive modeling is a form of artificial intelligence that uses data mining and probability to forecast or
estimate more granular, specific outcomes.
For example, predictive modeling could help identify customers who are likely to purchase our new One AI
software over the next 90 days.
Machine Learning:
Machine learning is a branch of artificial intelligence (ai) where computers learn to act and adapt to new
datawithout being programmed to do so. The computer is able to act independently of human interaction.
Forecasting:
Forecasting is a process of predicting or estimating future events based on past and present data and most
commonly by analysis of trends. "Guessing" doesn't cut it. A forecast, unlike a prediction, must have logic
to it.It must be defendable. This logic is what differentiates it from the magic 8 ball's lucky guess. After
all, even abroken watch is right two times a day.

1
Module-2: Python for Data Science

Introduction to Python
Python is a high-level, general-purpose and a very popular programming language.
Python programming language is being used in web development, Machine Learning
applications,along with all cutting-edge technology in Software Industry.
Python Programming Language is very well suited for Beginners, also for experienced programmers
with other programming languages like C++ and Java.

2.2 Lists
Lists in Python are the most versatile data structure. They are used to store heterogeneous data
items, from integers to strings or even another list! They are also mutable, which means that their
elements can be changed even after the list is created.
Creating Lists
Lists are created by enclosing elements within [square] brackets and each item is separated
by acomma.

Creating lists in Python


Since each element in a list has its own distinct position, having duplicate values in a list is not a
problem.

Accessing List elements


To access elements of a list, we use Indexing. Each element in a list has an index related to it
depending on its position in the list. The first element of the list has the index 0, the next element
has index 1, and so on. The last element of the list has an index of one less than the length of the
list.

Indexing in Python lists


While positive indexes return elements from the start of the list, negative indexes return values
from the end of the list. This saves us from the trivial calculation which we would have to
otherwise perform if we wanted to return the nth element from the end of the list. So instead of
trying to return List name [Len (List name)-1] element, we can simply write List name [-1].

Using negative indexes, we can return the nth element from the end of the list easily. If we
wanted to return the first element from the end, or the last index, the associated index is -1.

2
Similarly, the index for the second last element will be -2, and so on. Remember, the 0th index
will still refer to the very first element in the list.

Appending values in Lists


We can add new elements to an existing list using the append () or insert () methods.

• append () – Adds an element to the end of the list.


• insert () – Adds an element to a specific position in the list which needs to be
specified alongwith the value.

Removing elements from Lists Removing elements from a list is as easy as adding them and can be done.

using the remove () or pop ()methods:


remove () – Removes the first occurrence from the list that matches the given value pop() – This
is used when we want to remove an element at a specified index from the list. However, if we
don’t provide an index value, the last element will be removed from the list.

Sorting Lists
On comparing two strings, we just compare the integer values of each character from the
beginning. If we encounter the same characters in both the strings, we just compare the next
character until we find two differing characters.
Concatenating Lists
We can even concatenate two or more lists by simply using the + symbol. This will return a new
list containing elements from both the lists:

List comprehensions.
A very interesting application of Lists is List comprehension which provides a neat way of
creating new lists. These new lists are created by applying an operation on each element of an
existing list.
It will be easy to see their impact if we first check out how it can be done using the good old for- loops.

Stacks & Queues using Lists.


A list is an in-built data structure in Python. But we can use it to create user-defined data structures.Two very
popular user-defined data structures built using lists are Stacks and Queues

3
Stacks are a list of elements in which the addition or deletion of elements is done from the end of the
list. Think of it as a stack of books. Whenever you need to add or remove a book from the stack,
youdo it from the top. It uses the simple concept of Last-In-First-Out.
Queues, on the other hand, are a list of elements in which the addition of elements takes place at
the end of the list, but the deletion of elements takes place from the front of the list. You can think
of it as a queue in the real-world. The queue becomes shorter when people from the front exit the
queue. The queue becomes longer when someone new adds to the queue from the end. It uses the
concept of First-In-First-Out.

2.3 Dictionaries
Dictionary is another Python data structure to store heterogeneous objects that are immutable
butunordered.

Generating Dictionary
Dictionaries are generated by writing keys and values within a {curly} bracket separated by a
semi-colon. And each key-value pair is separated by a comma:

Using the key of the item, we can easily extract the associated value of the item:

4
Dictionaries are very useful to access items quickly because, unlike lists and tuples, a dictionary
does not have to iterate over all the items finding a value. Dictionary uses the item key to quickly
find the item value. This concept is called hashing.

Accessing keys and values


You can access the keys from a dictionary using the keys() method and the values using the values()
method. These we can view using a for-loop or turn them into a list using list():

We can even access these values simultaneously using the items() method which returns therespective key and
value pair for each element of the dictionary

5
2.4 Understanding Standard Libraries in Python

PANDAS
When it comes to data manipulation and analysis, nothing beats Pandas. It is the most
popular Python library, period. Pandas is written in the Python language especially for
manipulation and analysis tasks.
Pandas provides features like:
• Dataset joining and merging.
• Data Structure column deletion and insertion
• Data filtration
• Reshaping datasets
• Data Frame objects to manipulate data, and much more!

NUMPY
NumPy, like Pandas, is an incredibly popular Python library. NumPy brings in functions to
support large multi-dimensional arrays and matrices. It also brings in high-level mathematical
functions to work with these arrays and matrices. NumPy is an open-source library and has
multiple contributors.

MATPLOTLIB
Matplotlib is the most popular data visualization library in Python. It allows us to generate
and build plots of all kinds. This is my go-to library for exploring data visually along with
Seaborn.

2.5 Reading a CSV File in Python


A CSV (Comma Separated Values) file is a form of plain text document which uses a particular
format to organize tabular information. CSV file format is a bounded text document that uses a
comma to distinguish the values. Every row in the document is a data log. Each log is composed
of one or more fields, divided by commas. It is the most popular file format for importing and
exporting spreadsheetsand databases.

2.6 Data Frames and basic operations with Data Frames

Pandas Data Frame is two-dimensional size-mutable, potentially heterogeneous tabular data


6
structure with labeled axes (rows and columns). A Data frame is a two- dimensional data
structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas Data Frame
consists of three principal components, the data, rows, and columns.

Fig 2.6 Data Frame

2.7 Indexing Data Frame

This function allows us to retrieve rows and columns by position. To do that, we’ll need to
specify the positions of the rows that we want, and the positions of the columns that we want as
well. The df.iloc indexer is very similar to floc but only uses integer locations to make its
selections.

7
Module-3: Understanding the Statistics for Data Science

3.1Introduction to Statistics

Statistics simply means numerical data and is field of math that generally deals with collection of
data, tabulation, and interpretation of numerical data. It is actually a form of mathematical analysis
that uses different quantitative models to produce a set of experimental data or studies of real life.
It is an area of applied mathematics concern with data collection analysis, interpretation, and
presentation. Statistics deals with how data can be used to solve complex problems. Some people
consider statistics to be a distinct mathematical science rather than a branch of mathematics.
Statistics makes work easy and simple and provides a clear and clean picture of work you do on a
regular basis.

Basic terminology of Statistics:


• Population –

It is a collection of set of individuals or objects or events whose


properties are to be analyzed.
• Sample –
It is the subset of a population.
Types of Statistics:

8
3.2 Measures of Central Tendency

(i) Mean:
It is measure of average of all value in a sample set. For example,

(ii) Median:
It is measure of central value of a sample set. In these, data set is ordered from lowest to
highestvalue and then finds exact middle.
For example,

9
(iii) Mode:

It is value most frequently arrived in sample set. The value repeated most of time in central set is
actually mode.

For example,

3.3 Understanding the spread of data

Measure of Variability is also known as measure of dispersion and used to describe variability in a
sample or population. In statistics, there are three common measures of variability as shown below:

(i) Range:

It is given measure of how to spread apart values in sample set or data set.

Range = Maximum value - Minimum value

(ii) Variance:

It simply describes how much a random variable defers from expected value and it is also computed as
square of deviation.

i=1
S2 = ∑ n [(xi - x)2 ÷ n]

individual data points.

(i) Dis persion:

It is measure of dispersion of set of data fromits mean.

σ= √ (1÷n) ∑ni=1 (xi - μ)2

10
Module-4: Predictive Modeling and Basics of Machine Learning

4.1 Introduction to Predictive Modeling

Predictive analytics involves certain manipulations on data from existing data sets with the goal
of identifying some new trends and patterns. These trends and patterns are then used to predict
future outcomes and trends. By performing predictive analysis, we can predict future trends and
performance. It is also defined as the prognostic analysis; the word prognostic means prediction.
Predictive analytics uses the data, statistical algorithms and machine learning techniques to
identify the probability of future outcomes based on historical data.
4.2 Understanding the types of Predictive
ModelsSupervised learning.
Supervised learning as the name indicates the presence of a supervisor as a teacher. Basically
supervised learning is a learning in which we teach or train the machine using data which is well
labeled that means some data is already tagged with the correct answer. After that, the machine is
provided with a new set of examples(data) so that supervised learning algorithm analyses the
trainingdata (set of training examples) and produces a correct outcome from labeled data.

Unsupervised learning
Unsupervised learning is the training of machine using information that is neither classified nor
labeled and allowing the algorithm to act on that information without guidance. Here the task of
machine is to group unsorted information according to similarities, patterns and differences
without any prior training of data.

4.3 Stages of Predictive


Models

Steps To Perform Predictive Analysis:

Some basic steps should be performed in order to perform predictive analysis.

1. Define Problem Statement:


Define the project outcomes, the scope of the effort, objectives, identify the data sets that are
going tobe used.
2. Data Collection:
11
Data collection involves gathering the necessary details required for the analysis.
3. Data Cleaning:
Data Cleaning is the process in which we refine our data sets. In the process of data cleaning,
we remove un-necessary and erroneous data. It involves removing the redundant data and
duplicate datafrom our data sets
4. Data Analysis:
It involves the exploration of data. We explore the data and analyze it thoroughly in order to
identify some patterns or new outcomes from the data set. In this stage, we discover useful
information and conclude by identifying some patterns or trends.
5. Build Predictive Model:
In this stage of predictive analysis, we use various algorithms to build predictive models based on
thepatterns observed. It requires knowledge of python, R, Statistics and MATLAB and so on. We
also test our hypothesis using standard statistic models.
6. Validation:
It is a very important step in predictive analysis. In this step, we check the efficiency of our
model by performing various tests. Here we provide sample input sets to check the validity of
our model. The model needs to be evaluated for its accuracy in this stage.
7. Deployment:
In deployment we make our model work in a real environment, and it helps in everyday
discussionmaking and make it available to use.
8. Model Monitoring:
Regularly monitor your models to check performance and ensure that we have proper results.
It isseeing how model predictions are performing against actual data sets.

4.4 Hypothesis Generation


A hypothesis is a function that best describes the target in supervised machine learning. The hypothesisthat an
algorithm would come up depends upon the data and also depends upon the restrictions and bias that we have
imposed on the data. To better understand the Hypothesis Space and Hypothesis consider the following
coordinate that shows the distribution of some data.

12
4.5 Data Extraction

In general terms, “Mining” is the process of extraction of some valuable material from the earth
e.g. coal mining, diamond mining etc. In the context of computer science, “Data Mining” refers
to the extraction of useful information from a bulk of data or data warehouses. One can see that
the term itself is a little bit confusing. In case of coal or diamond mining, the result of extraction
process is coal or diamond. But in case of Data Mining, the result of extraction process is not
data!! Instead, the result of data mining is the patterns and knowledge that we gain at the end of
the extraction process. In that sense, Data Mining is also known as Knowledge Discovery or
Knowledge Extraction.

Data Mining as a whole process


The whole process of Data Mining comprises of three main phases:

➢ Data Pre-processing – Data cleaning, integration, selection and transformation takes place.
➢ Data Extraction – Occurrence of exact data mining
➢ Data Evaluation and Presentation – Analyzing and presenting results.

4.6 Data Exploration

Steps of Data Exploration and Preparation

Remember the quality of your inputs decide the quality of your output. So, once you have got
your business hypothesis ready, it makes sense to spend lot of time and efforts here. With my
personal estimate, data exploration, cleaning and preparation can take up to 70% of your total
project time.

Below are the steps involved to understand, clean and prepare your data for building your
predictive model:

• Variable Identification
• Univariate Analysis
• Bi-variate Analysis
• Missing values treatment
• Outlier treatment
• Variable transformation
• Variable creation

13
Finally, we will need to iterate oversteps 4 – 7 multiple times before we come up with our
refinedmodel.

4.7 Reading the data into Python

Python provides inbuilt functions for creating, writing and reading files. There are two types of
filesthat can be handled in python, normal text files and binary files (written in binary language,0s
and 1s).

• Text files: In this type of file, each line of text is terminated with a special character
calledEOL (End of Line), which is the new line character (‘\n’) in python by default.
• Binary files: In this type of file, there is no terminator for a line and the data is
storedafter converting it into machine-understandable binary language.

Access modes govern the type of operations possible in the opened file. It refers to how the file
will be used once it’s opened. These modes also define the location of the File Handle in the file.
File handle is like a cursor, which defines from where the data has to be read or written in the file.
Different accessmodes for reading a file are –

Read Only (‘r’): Open text file for reading. The handle is positioned at the beginning of the file.
If the file does not exist, raises I/O error. This is also the default mode in which file is opened.
Read and Write (‘r+’): Open the file for reading and writing. The handle is positioned at the
beginning of the file. Raises I/O error if the file does not exist.
Append and Read (‘a+’): Open the file for reading and writing. The file is created if it does not exist.The handle
is positioned at the end of the file. The data being written will be inserted at the end, afterthe existing data.

4.8 variable Identification

First, identify Predictor (Input) and Target (output) variables. Next, identify the data type and
category of the variables.
Basics of Model Building
Lifecycle of Model Building:
• Select variables.
• Balance data
• Build models.
• Validate
• Deploy
14
• Maintain
• Define success.
• Explore data.
• Condition data.

Data exploration is used to figure out gist of data and to develop first step assessment of its
quality, quantity, and characteristics. Visualization techniques can be also applied. However, this
can be difficult task in high dimensional spaces with many input variables. In the conditioning of
data, we group functional data which is applied upon modeling techniques after then rescaling is
done, in some cases rescaling is an issue if variables are coupled. Variable section is very important
to develop qualitymodel.
This process is implicitly model-dependent since it is used to configure which combination of variables
should be used in ongoing model development. Data balancing is to partition data into appropriate subsets
for training, test, and validation. Model building is to focus on desired algorithms. The most famous
technique is symbolic regression, other techniques can also be preferred Linear Regression

4.9 Linear Regression

It is a machine learning algorithm based on supervised regression algorithm. Regression


models a target prediction value based on independent variables. It is mostly used for finding out
the relationship between variables and forecasting. Different regression models differ based on –
the kind of relationship between the dependent and independent variables, they are considering,
and the number of independent variables being used.

4.10Logistic Regression

Logistic regression is basically a supervised classification algorithm. In a classification


problem, the target variable (or output), y, can take only discrete values for a given set of features
(or inputs), X.

Any change in the coefficient leads to a change in both the direction and the steepness of the
logistic function. It means positive slopes result in an S-shaped curve and negative slopes result in
a Z- shapedcurve.

4.11 Decision Trees

Decision Tree: Decision tree is the most powerful and popular tool for classification and
prediction. A Decision tree is a flowchart like tree structure, where each internal node denotes a

15
test on an attribute, each branch represents an outcome of the test, and each leaf node (terminal
node) holds a class label.

Decision Tree Representation:


Decision trees classify instances by sorting them down the tree from the root to some leaf node,
which provides the classification of the instance. An instance is classified by starting at the root
node of the tree, testing the attribute specified by this node, then moving down the tree branch
corresponding to the value of the attribute as shown in the above figure. This process is then
repeated for the subtree rootedat the new node.

Strengths and Weakness of Decision Tree approach:

The strengths of decision tree methods are:

• Decision trees are able to generate understandable rules.


• Decision trees perform classification without requiring much computation.
• Decision trees are able to handle both continuous and categorical variables.
• Decision trees provide a clear indication of which fields are most important for prediction or
classification.

4.12K-means

k-means clustering tries to group similar kinds of items in form of clusters. It finds the similarity
between the items and groups them into the clusters. K-means clustering algorithm works in three
steps. Let’s see what these three steps are.

➢ Select the k values.


➢ Initialize the centroids.
➢ Select the group and find the average.

Let us understand the above steps with the help of the figure because a good picture is better than
thethousands of words.

16
fig 4.12.1 K-means

We will understand each figure one by one.

• Figure 1 shows the representation of data of two different items. the first item has
shown in blue color and the second item has shown in red color. Here I am choosing
the value of K randomly as 2. There are different methods by which we can choose the
right k values.
• In figure 2, Join the two selected points. Now to find out centroid, we will draw a
perpendicular line to that line. The points will move to their centroid. If you will
notice there, then you will see that some of the red points are now moved to the blue
points. Now,these points belong to the group of blue color items.
• The same process will continue in figure 3. we will join the two points and draw a
perpendicular line to that and find out the centroid. Now the two points will move to
its centroid and again some of the red points get converted to blue points.
• The same process is happening in figure 4. This process will be continued until and
unlesswe get two completely different clusters of these groups.

17
How to choose the value of K?

One of the most challenging tasks in this clustering algorithm is to choose the right values of k.
What should be the right k-value? How to choose the k-value? Let us find the answer to these
questions. If you are choosing the k values randomly, it might be correct or may be wrong. If you
will choose the wrong value, then it will directly affect your model performance. So there are two
methods by which you can select the right value of k.

1. Elbow
Method.
2.Silhouette
Method.

Now, Let’s understand the concept one by one in detail.

Elbow Method:
Elbow is one of the most famous methods by which you can select the right value of k and boost
yourmodel performance. We also perform the hyperparameter tuning to chose the best value of k.
Let us see how this elbow method works.
It is an empirical method to find out the best value of k. it picks up the range of values and takes
the best among them. It calculates the sum of the square of the points and calculates the average
distance.

When the value of k is 1, the within-cluster sum of the square will be high. As the value of k
increases,the within-cluster sum of square value will decrease.

18
Fig 4.12.2 Elbow method

Finally, we will plot a graph between k-values and the within-cluster sum of the square to get the
k value. we will examine the graph carefully. At some point, our graph will decrease abruptly. That
pointwill be considered as a value of k.

Fig 4.12.3 Elbow method

19
Silhouette Method

The silhouette method is somewhat different.


The elbow method it also picks up the range of the k values and draws the silhouette
graph.It calculates the silhouette coefficient of every point.
It calculates the average distance of points within its cluster a (I) and the average distance of the points
tonext closest cluster called b (I).

Note: The a (i) value must be less than the b (i) value, that is ai<<bi.
Fig 4.12.4 Silhoette method

Now, we have the values of a (i) and b (i). we will calculate the silhouette coefficient by using
thebelow formula.

Now, we can calculate the silhouette coefficient of all the points in the clusters and plot the
silhouette graph This plot will also helpful in detecting the outliers. The plot of the silhouette is
between -1 to 1.1

Also, check for the plot which has fewer outliers which means a less negative value. Then choose
thatvalue of k for your model to tune.

20
A
l

o
,

fig 4.12.5 K-means plot difference

for the plot which has fewer outliers which means a less negative value. Then choose that value
of kfor your model to tune.

Advantages of K-means

1. It is very simple to implement.


2. It is scalable to a huge data set and also faster to large datasets.
3. it adapts the new examples very frequently.
4. Generalization of clusters for different shapes and sizes.

Disadvantages of K-means

1. It is sensitive to the outliers.


2. Choosing the k values manually is a tough job.
3. As the number of dimensions increases its scalability decreases.

21
Software Requirement Specifications

For data science, you typically need a combination of software tools to perform various tasks
such as data manipulation, analysis, visualization, and machine learning. Here's a list of essential
software requirements for data science:

1. Programming Languages:

• Python: It's the most widely used language in data science due to its extensive
libraries for data manipulation (e.g., Pandas), visualization (e.g., Matplotlib,
Seaborn), and machine learning (e.g., Scikit-learn, TensorFlow, PyTorch).

• R: Another popular language for statistical analysis and visualization, particularly


in academia.

2. Integrated Development Environments (IDEs):

• Jupyter Notebook: A web-based interactive computing environment that allows


you to create and share documents containing live code, equations, visualizations,
and narrative text.

• Spyder: A powerful IDE for Python that provides a MATLAB-like interface for
data analysis.

3. Data Manipulation and Analysis:

• Pandas: A Python library for data manipulation and analysis.

• NumPy: Provides support for large, multi-dimensional arrays and matrices, along
witha collection of mathematical functions to operate on these arrays.

• R Studio: An integrated development environment (IDE) for R that makes data


analysis easier with its intuitive interface.

4. Data Visualization:

• Matplotlib: A plotting library for Python that provides a MATLAB-like interface


forcreating static, interactive, and animated visualizations.

• Seaborn: A Python visualization library based on Matplotlib that provides a


high-level interface for drawing attractive statistical graphics.
22
• ggplot2: A plotting system for R, based on the grammar of graphics, which
providesa highly customizable approach to data visualization.

5. Machine Learning:

• Scikit-learn: A simple and efficient tool for data mining and data analysis, built
on NumPy, SciPy, and Matplotlib.

• TensorFlow / Keras: TensorFlow is an open-source machine learning library


developed by Google. Keras is a high-level neural networks API, which can run on
topof TensorFlow.

• PyTorch: An open-source machine learning library developed by Facebook's


AIResearch lab that provides flexibility and speed.

6. Deep Learning (Optional):

• TensorFlow / Keras: Widely used for deep learning tasks due to its flexibility
andperformance.

• PyTorch: Another popular choice for deep learning, known for its
dynamiccomputational graph and ease of use.

7. Data Storage and Management:

• SQL: Understanding of SQL is essential for querying databases and


extractingrelevant data.

• SQLite: A C-language library that implements a small, fast, self-contained,


high-reliability, full-featured, SQL database engine.

• NoSQL Databases (Optional): Depending on the project requirements,


familiaritywith NoSQL databases like MongoDB or Cassandra might be
necessary.

8. Version Control:

• Git: Essential for tracking changes in code and collaborating with other team
members. Platforms like GitHub, GitLab, or Bitbucket are commonly used for
hosting Git repositories.

9. Text Editors:

• VS Code: A lightweight and powerful source code editor that comes with built-in

23
support for Python and many other languages.

• Atom, Sublime Text, etc.: Other popular text editors with extensive support for
various programming languages.

10. Cloud Computing Platforms (Optional but increasingly important):

• AWS, Azure, Google Cloud Platform (GCP): Familiarity with cloud services
for deploying, managing, and scaling data science applications can be
advantageous.

24
Module 5. Video Screenshots

25
26
Module 6. Quiz Screenshots

27
Module 7. Internship Project

Title: Lung Cancer Prediction

1. Introduction
Lung cancer is a leading cause of cancer-related deaths worldwide, and early detection plays a
crucial role in improving patient outcomes. This project aims to develop a predictive model using
data science,machine learning, and AI techniques to assist in the early detection and prediction of
lung cancer basedon patient data.

2. Objectives
• Utilize data science methodologies to preprocess and analyze large datasets containing
patientdemographics, medical history, lifestyle factors, and potentially genetic
information.
• Apply machine learning algorithms to build a predictive model capable of identifying
patternsand features associated with lung cancer risk.
• Develop an AI-driven system that can predict the likelihood of lung cancer based on input variables.
• Evaluate the model's performance metrics such as accuracy, sensitivity, specificity, and
areaunder the ROC curve (AUC) to assess its reliability and effectiveness.

3. Methodology

Data Collection and Preprocessing:


• Gather a comprehensive dataset comprising patient records including demographics (age,
gender),medical history (smoking status, exposure to carcinogens), genetic markers (if
available), and diagnostic tests (CT scans, X-rays).
• Perform data cleaning to handle missing values, outliers, and ensure data consistency.

Feature Engineering:
• Extract relevant features from the dataset that are indicative of lung cancer
risk(e.g., smoking history, age, exposure duration).

28
Model Development:
• Implement machine learning algorithms such as logistic regression, decision trees, random
forests,support vector machines (SVM), or deep learning techniques like convolutional neural
networks (CNNs).
• Split the dataset into training and testing sets for model training and validation.

Model Evaluation and Optimization:


• Evaluate the performance of the developed models using appropriate
metrics(accuracy, precision, recall, F1-score).
• Optimize hyperparameters using techniques like grid search or random search to improve
modelperformance.
4. AI Implementation

• Develop an AI-driven system that takes input data (patient attributes) and provides a
predicteprobability or risk score of developing lung cancer.
• Integrate the predictive model into a user-friendly interface (web application or API) for
easyaccessibility and use by healthcare professionals.

5. Ethical Considerations
• Ensure patient data privacy and confidentiality throughout the project lifecycle.
• Address bias and fairness issues in the dataset and model predictions to ensure equitable outcomes.

6. Expected Outcomes
• A robust predictive model capable of accurately identifying individuals at high risk of
developinglung cancer.
• Insights into key risk factors and predictors associated with lung cancer development.
• A scalable AI solution that can be further extended or integrated into clinical decision
supportsystems.

29
7. Summary

This project aims to leverage the power of data science and machine learning to enhance early
detection and prediction of lung cancer, ultimately contributing to improved patient outcomes and
healthcaredecision-making. By integrating AI-driven technologies into cancer screening processes, this
work has the potential to revolutionize preventive healthcare strategies.

IMPLEMENTATION

30
31
32
33
34
35
36
37
Module 8. List of References

Schapiro, R.E. (2003). The boosting approach to machine learning: An overview. In


NonlinearEstimation and Classification, pp. 149–172. Springer. 341Google Scholar

Schapiro, R.E., Freund, Y., Bartlett, P. and Lee, W.S. (1998). Boosting the margin: A new
explanationfor the effectiveness of voting methods. Annals of Statistics 26(5):1651–1686.
341Google Scholar

Quinlan, J.R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann. 156Google Scholar

Ragavan, H. and Rendell, L.A. (1993). Lookahead feature construction for learning hard concepts.
In Proceedings of the Tenth International Conference on Machine Learning (ICML 1993), pp. 252–
259.Morgan Kaufmann. 328Google Scholar

Raj Narayan, D.G. and Wolpert, D. (2010). Bias-variance trade-offs: Novel applications.
In C., Sammut and G.I., Webb (eds.), Encyclopedia of Machine Learning, pp. 101–110.
Springer.103Google Scholar

38
Module 9. Conclusion

In summary, the convergence of data science, machine learning, and artificial intelligence
represents a transformative force in our digital landscape. This interdisciplinary fusion empowers us to
extract valuable insights from vast datasets, automate processes, and create intelligent systems capable
of learningand adapting. By leveraging advanced algorithms, statistical techniques, and computational
power, practitioners in these fields can tackle complex problems, drive innovation, and enhance
decision-making across diverse domains. As we continue to advance, interdisciplinary collaboration
and ongoing research will further propel the capabilities of data science, machine learning, and AI,
ushering in a new era of technological sophistication and societal impact.

39

You might also like