C0 Report
C0 Report
ON
DATA SCIENCE MACHINE LEARNING, AI
A Report submitted in partial fulfilment of the requirements for the Award of Degree of
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING
BY
ACCREDITED BY NBA
2023-24
SIR C R REDDY COLLEGE OF ENGINEERING
ACCREDITED BY NBA
2023-24
CERTIFICATE
This is to certify that the Internship report entitled “DATA SCIENCE MACHINE
LEARNING, AI” submitted by NEELAM SASI PRIYA (20B81A05C0) in partial fulfillment for
the award of degree of BACHELOR OF TECHNOLOGY in COMPUTER SCIENCE AND
ENGINEERING, at SIR C R REDDY COLLEGE OF ENGINEERING, ELURU affiliated to
Jawaharlal Nehru Technological University, Kakinada during the academic year 2023-2024.
External Examiner
INTERNSHIP ACCEPTANCE LETTER
INTERNSHIP CERTIFICATE
ACKNOWLEDGEMENT
First I would like to thank Mr. Pavan Chalamalasetti, CEO of Data Valley pvt.limited,
Vijayawada for giving me the opportunity to do an internship within the organization.
I also would like all the people that worked along with me Data Valley pvt.limited,Vijayawada
with their patience and openness they created an enjoyable working environment.
It is indeed with a great sense of pleasure and immense sense of gratitude that I acknowledge
the help of these individuals.
I am highly indebted to the management and principal DR.K. VENKATESWAR RAO, for
the facilities provided toaccomplish this internship.
I would like to thank my Head of the Department DR. A. YESU BABU, for his constructive
criticism throughout my internship.
I would like to thank DR.B. MADHAVA RAO, Internship coordinator, Department of CSE
for his support to complete my internship.
I would like to thank MR. V. PRANAV, Internship guide, Department of CSE for his support
and advice to get and complete internship in DATA VALLEY.
Data Science:
Data Science involves the collection, processing, and analysis of large volumes of data to extract
meaningful insights and patterns. It combines techniques from statistics, mathematics, computer
science, and domain knowledge to uncover valuable information that can be used for decision-
making and problem-solving. Data scientists use various tools and algorithms to clean and
preprocess data, perform exploratory data analysis, build predictive models, and communicate
findings to stakeholders.
Machine Learning is a subset of Artificial Intelligence that focuses on developing algorithms and
statistical models that enable computers to learn from and make predictions or decisions based on
data, without being explicitly programmed. ML algorithms are categorized into supervised
learning (where models learn from labelled data), unsupervised learning (for finding patterns in
unlabeled data), and reinforcement learning (where agents learn to make decisions through trial
and error based on rewards). Common ML techniques include regression, classification,
clustering, and deep learning.
Artificial Intelligence encompasses a broader range of technologies and methods that enable
machines to simulate human intelligence. This includes not only machine learning but also areas
such as natural language processing (NLP), computer vision, robotics, expert systems, and
knowledge representation. AI systems aim to perform tasks that typically require human cognitive
abilities, such as understanding language, recognizing objects in images, making decisions, and
solving complex problems.
ORGANIZATION INFORAMTION
Organization Information:
Datavalley.ai is a leading provider of top-notch training and consulting services in the cutting-
edge fields of Big Data, Data Engineering, Data Architecture, DevOps, Data Science, Machine
Learning,IoT, and Cloud Technologies.
Training:
Data Valley training programs, led by industry experts, are tailored to equip professionals and
organizations with the essential skills and knowledge needed to thrive in the rapidly evolving
data landscape. We believe in continuous learning and growth, and our commitment to staying
on top of emerging trends and technologies ensures that our clients receive the most cutting-edge
training possible.
WEEKLY OVERVIEW OF INTERNSHIP ACTIVITIES
22/03/24 Coding
Project, quizzes
Contents
List of Figures
3 4.12.1 K means 15
Data science is the study of data. Like biological sciences is a study of biology, physical sciences, it’s
the study of physical reactions. Data is real, data has real properties, and we need to study them if
we’re going to work on them. Data Science involves data and some signs. It is a process, not an event.
It is the process ofusing data to understand too many different things, to understand the world.
Let Suppose when you have a model or proposed explanation of a problem, and you try to validate
thatproposed explanation or model with your data. It is the skill of unfolding the insights and trends that
are hiding (or abstract) behind data. It’s when you translate data into a story. So, use storytelling to
generate insight. And with these insights, you can make strategic choices for a company or an
institution.
Predictive modeling:
Predictive modeling is a form of artificial intelligence that uses data mining and probability to forecast or
estimate more granular, specific outcomes.
For example, predictive modeling could help identify customers who are likely to purchase our new One AI
software over the next 90 days.
Machine Learning:
Machine learning is a branch of artificial intelligence (ai) where computers learn to act and adapt to new
datawithout being programmed to do so. The computer is able to act independently of human interaction.
Forecasting:
Forecasting is a process of predicting or estimating future events based on past and present data and most
commonly by analysis of trends. "Guessing" doesn't cut it. A forecast, unlike a prediction, must have logic
to it.It must be defendable. This logic is what differentiates it from the magic 8 ball's lucky guess. After
all, even abroken watch is right two times a day.
1
Module-2: Python for Data Science
Introduction to Python
Python is a high-level, general-purpose and a very popular programming language.
Python programming language is being used in web development, Machine Learning
applications,along with all cutting-edge technology in Software Industry.
Python Programming Language is very well suited for Beginners, also for experienced programmers
with other programming languages like C++ and Java.
2.2 Lists
Lists in Python are the most versatile data structure. They are used to store heterogeneous data
items, from integers to strings or even another list! They are also mutable, which means that their
elements can be changed even after the list is created.
Creating Lists
Lists are created by enclosing elements within [square] brackets and each item is separated
by acomma.
Using negative indexes, we can return the nth element from the end of the list easily. If we
wanted to return the first element from the end, or the last index, the associated index is -1.
2
Similarly, the index for the second last element will be -2, and so on. Remember, the 0th index
will still refer to the very first element in the list.
Removing elements from Lists Removing elements from a list is as easy as adding them and can be done.
Sorting Lists
On comparing two strings, we just compare the integer values of each character from the
beginning. If we encounter the same characters in both the strings, we just compare the next
character until we find two differing characters.
Concatenating Lists
We can even concatenate two or more lists by simply using the + symbol. This will return a new
list containing elements from both the lists:
List comprehensions.
A very interesting application of Lists is List comprehension which provides a neat way of
creating new lists. These new lists are created by applying an operation on each element of an
existing list.
It will be easy to see their impact if we first check out how it can be done using the good old for- loops.
3
Stacks are a list of elements in which the addition or deletion of elements is done from the end of the
list. Think of it as a stack of books. Whenever you need to add or remove a book from the stack,
youdo it from the top. It uses the simple concept of Last-In-First-Out.
Queues, on the other hand, are a list of elements in which the addition of elements takes place at
the end of the list, but the deletion of elements takes place from the front of the list. You can think
of it as a queue in the real-world. The queue becomes shorter when people from the front exit the
queue. The queue becomes longer when someone new adds to the queue from the end. It uses the
concept of First-In-First-Out.
2.3 Dictionaries
Dictionary is another Python data structure to store heterogeneous objects that are immutable
butunordered.
Generating Dictionary
Dictionaries are generated by writing keys and values within a {curly} bracket separated by a
semi-colon. And each key-value pair is separated by a comma:
Using the key of the item, we can easily extract the associated value of the item:
4
Dictionaries are very useful to access items quickly because, unlike lists and tuples, a dictionary
does not have to iterate over all the items finding a value. Dictionary uses the item key to quickly
find the item value. This concept is called hashing.
We can even access these values simultaneously using the items() method which returns therespective key and
value pair for each element of the dictionary
5
2.4 Understanding Standard Libraries in Python
PANDAS
When it comes to data manipulation and analysis, nothing beats Pandas. It is the most
popular Python library, period. Pandas is written in the Python language especially for
manipulation and analysis tasks.
Pandas provides features like:
• Dataset joining and merging.
• Data Structure column deletion and insertion
• Data filtration
• Reshaping datasets
• Data Frame objects to manipulate data, and much more!
NUMPY
NumPy, like Pandas, is an incredibly popular Python library. NumPy brings in functions to
support large multi-dimensional arrays and matrices. It also brings in high-level mathematical
functions to work with these arrays and matrices. NumPy is an open-source library and has
multiple contributors.
MATPLOTLIB
Matplotlib is the most popular data visualization library in Python. It allows us to generate
and build plots of all kinds. This is my go-to library for exploring data visually along with
Seaborn.
This function allows us to retrieve rows and columns by position. To do that, we’ll need to
specify the positions of the rows that we want, and the positions of the columns that we want as
well. The df.iloc indexer is very similar to floc but only uses integer locations to make its
selections.
7
Module-3: Understanding the Statistics for Data Science
3.1Introduction to Statistics
Statistics simply means numerical data and is field of math that generally deals with collection of
data, tabulation, and interpretation of numerical data. It is actually a form of mathematical analysis
that uses different quantitative models to produce a set of experimental data or studies of real life.
It is an area of applied mathematics concern with data collection analysis, interpretation, and
presentation. Statistics deals with how data can be used to solve complex problems. Some people
consider statistics to be a distinct mathematical science rather than a branch of mathematics.
Statistics makes work easy and simple and provides a clear and clean picture of work you do on a
regular basis.
8
3.2 Measures of Central Tendency
(i) Mean:
It is measure of average of all value in a sample set. For example,
(ii) Median:
It is measure of central value of a sample set. In these, data set is ordered from lowest to
highestvalue and then finds exact middle.
For example,
9
(iii) Mode:
It is value most frequently arrived in sample set. The value repeated most of time in central set is
actually mode.
For example,
Measure of Variability is also known as measure of dispersion and used to describe variability in a
sample or population. In statistics, there are three common measures of variability as shown below:
(i) Range:
It is given measure of how to spread apart values in sample set or data set.
(ii) Variance:
It simply describes how much a random variable defers from expected value and it is also computed as
square of deviation.
i=1
S2 = ∑ n [(xi - x)2 ÷ n]
10
Module-4: Predictive Modeling and Basics of Machine Learning
Predictive analytics involves certain manipulations on data from existing data sets with the goal
of identifying some new trends and patterns. These trends and patterns are then used to predict
future outcomes and trends. By performing predictive analysis, we can predict future trends and
performance. It is also defined as the prognostic analysis; the word prognostic means prediction.
Predictive analytics uses the data, statistical algorithms and machine learning techniques to
identify the probability of future outcomes based on historical data.
4.2 Understanding the types of Predictive
ModelsSupervised learning.
Supervised learning as the name indicates the presence of a supervisor as a teacher. Basically
supervised learning is a learning in which we teach or train the machine using data which is well
labeled that means some data is already tagged with the correct answer. After that, the machine is
provided with a new set of examples(data) so that supervised learning algorithm analyses the
trainingdata (set of training examples) and produces a correct outcome from labeled data.
Unsupervised learning
Unsupervised learning is the training of machine using information that is neither classified nor
labeled and allowing the algorithm to act on that information without guidance. Here the task of
machine is to group unsorted information according to similarities, patterns and differences
without any prior training of data.
12
4.5 Data Extraction
In general terms, “Mining” is the process of extraction of some valuable material from the earth
e.g. coal mining, diamond mining etc. In the context of computer science, “Data Mining” refers
to the extraction of useful information from a bulk of data or data warehouses. One can see that
the term itself is a little bit confusing. In case of coal or diamond mining, the result of extraction
process is coal or diamond. But in case of Data Mining, the result of extraction process is not
data!! Instead, the result of data mining is the patterns and knowledge that we gain at the end of
the extraction process. In that sense, Data Mining is also known as Knowledge Discovery or
Knowledge Extraction.
➢ Data Pre-processing – Data cleaning, integration, selection and transformation takes place.
➢ Data Extraction – Occurrence of exact data mining
➢ Data Evaluation and Presentation – Analyzing and presenting results.
Remember the quality of your inputs decide the quality of your output. So, once you have got
your business hypothesis ready, it makes sense to spend lot of time and efforts here. With my
personal estimate, data exploration, cleaning and preparation can take up to 70% of your total
project time.
Below are the steps involved to understand, clean and prepare your data for building your
predictive model:
• Variable Identification
• Univariate Analysis
• Bi-variate Analysis
• Missing values treatment
• Outlier treatment
• Variable transformation
• Variable creation
13
Finally, we will need to iterate oversteps 4 – 7 multiple times before we come up with our
refinedmodel.
Python provides inbuilt functions for creating, writing and reading files. There are two types of
filesthat can be handled in python, normal text files and binary files (written in binary language,0s
and 1s).
• Text files: In this type of file, each line of text is terminated with a special character
calledEOL (End of Line), which is the new line character (‘\n’) in python by default.
• Binary files: In this type of file, there is no terminator for a line and the data is
storedafter converting it into machine-understandable binary language.
Access modes govern the type of operations possible in the opened file. It refers to how the file
will be used once it’s opened. These modes also define the location of the File Handle in the file.
File handle is like a cursor, which defines from where the data has to be read or written in the file.
Different accessmodes for reading a file are –
Read Only (‘r’): Open text file for reading. The handle is positioned at the beginning of the file.
If the file does not exist, raises I/O error. This is also the default mode in which file is opened.
Read and Write (‘r+’): Open the file for reading and writing. The handle is positioned at the
beginning of the file. Raises I/O error if the file does not exist.
Append and Read (‘a+’): Open the file for reading and writing. The file is created if it does not exist.The handle
is positioned at the end of the file. The data being written will be inserted at the end, afterthe existing data.
First, identify Predictor (Input) and Target (output) variables. Next, identify the data type and
category of the variables.
Basics of Model Building
Lifecycle of Model Building:
• Select variables.
• Balance data
• Build models.
• Validate
• Deploy
14
• Maintain
• Define success.
• Explore data.
• Condition data.
Data exploration is used to figure out gist of data and to develop first step assessment of its
quality, quantity, and characteristics. Visualization techniques can be also applied. However, this
can be difficult task in high dimensional spaces with many input variables. In the conditioning of
data, we group functional data which is applied upon modeling techniques after then rescaling is
done, in some cases rescaling is an issue if variables are coupled. Variable section is very important
to develop qualitymodel.
This process is implicitly model-dependent since it is used to configure which combination of variables
should be used in ongoing model development. Data balancing is to partition data into appropriate subsets
for training, test, and validation. Model building is to focus on desired algorithms. The most famous
technique is symbolic regression, other techniques can also be preferred Linear Regression
4.10Logistic Regression
Any change in the coefficient leads to a change in both the direction and the steepness of the
logistic function. It means positive slopes result in an S-shaped curve and negative slopes result in
a Z- shapedcurve.
Decision Tree: Decision tree is the most powerful and popular tool for classification and
prediction. A Decision tree is a flowchart like tree structure, where each internal node denotes a
15
test on an attribute, each branch represents an outcome of the test, and each leaf node (terminal
node) holds a class label.
4.12K-means
k-means clustering tries to group similar kinds of items in form of clusters. It finds the similarity
between the items and groups them into the clusters. K-means clustering algorithm works in three
steps. Let’s see what these three steps are.
Let us understand the above steps with the help of the figure because a good picture is better than
thethousands of words.
16
fig 4.12.1 K-means
• Figure 1 shows the representation of data of two different items. the first item has
shown in blue color and the second item has shown in red color. Here I am choosing
the value of K randomly as 2. There are different methods by which we can choose the
right k values.
• In figure 2, Join the two selected points. Now to find out centroid, we will draw a
perpendicular line to that line. The points will move to their centroid. If you will
notice there, then you will see that some of the red points are now moved to the blue
points. Now,these points belong to the group of blue color items.
• The same process will continue in figure 3. we will join the two points and draw a
perpendicular line to that and find out the centroid. Now the two points will move to
its centroid and again some of the red points get converted to blue points.
• The same process is happening in figure 4. This process will be continued until and
unlesswe get two completely different clusters of these groups.
17
How to choose the value of K?
One of the most challenging tasks in this clustering algorithm is to choose the right values of k.
What should be the right k-value? How to choose the k-value? Let us find the answer to these
questions. If you are choosing the k values randomly, it might be correct or may be wrong. If you
will choose the wrong value, then it will directly affect your model performance. So there are two
methods by which you can select the right value of k.
1. Elbow
Method.
2.Silhouette
Method.
Elbow Method:
Elbow is one of the most famous methods by which you can select the right value of k and boost
yourmodel performance. We also perform the hyperparameter tuning to chose the best value of k.
Let us see how this elbow method works.
It is an empirical method to find out the best value of k. it picks up the range of values and takes
the best among them. It calculates the sum of the square of the points and calculates the average
distance.
When the value of k is 1, the within-cluster sum of the square will be high. As the value of k
increases,the within-cluster sum of square value will decrease.
18
Fig 4.12.2 Elbow method
Finally, we will plot a graph between k-values and the within-cluster sum of the square to get the
k value. we will examine the graph carefully. At some point, our graph will decrease abruptly. That
pointwill be considered as a value of k.
19
Silhouette Method
Note: The a (i) value must be less than the b (i) value, that is ai<<bi.
Fig 4.12.4 Silhoette method
Now, we have the values of a (i) and b (i). we will calculate the silhouette coefficient by using
thebelow formula.
Now, we can calculate the silhouette coefficient of all the points in the clusters and plot the
silhouette graph This plot will also helpful in detecting the outliers. The plot of the silhouette is
between -1 to 1.1
Also, check for the plot which has fewer outliers which means a less negative value. Then choose
thatvalue of k for your model to tune.
20
A
l
o
,
for the plot which has fewer outliers which means a less negative value. Then choose that value
of kfor your model to tune.
Advantages of K-means
Disadvantages of K-means
21
Software Requirement Specifications
For data science, you typically need a combination of software tools to perform various tasks
such as data manipulation, analysis, visualization, and machine learning. Here's a list of essential
software requirements for data science:
1. Programming Languages:
• Python: It's the most widely used language in data science due to its extensive
libraries for data manipulation (e.g., Pandas), visualization (e.g., Matplotlib,
Seaborn), and machine learning (e.g., Scikit-learn, TensorFlow, PyTorch).
• Spyder: A powerful IDE for Python that provides a MATLAB-like interface for
data analysis.
• NumPy: Provides support for large, multi-dimensional arrays and matrices, along
witha collection of mathematical functions to operate on these arrays.
4. Data Visualization:
5. Machine Learning:
• Scikit-learn: A simple and efficient tool for data mining and data analysis, built
on NumPy, SciPy, and Matplotlib.
• TensorFlow / Keras: Widely used for deep learning tasks due to its flexibility
andperformance.
• PyTorch: Another popular choice for deep learning, known for its
dynamiccomputational graph and ease of use.
8. Version Control:
• Git: Essential for tracking changes in code and collaborating with other team
members. Platforms like GitHub, GitLab, or Bitbucket are commonly used for
hosting Git repositories.
9. Text Editors:
• VS Code: A lightweight and powerful source code editor that comes with built-in
23
support for Python and many other languages.
• Atom, Sublime Text, etc.: Other popular text editors with extensive support for
various programming languages.
• AWS, Azure, Google Cloud Platform (GCP): Familiarity with cloud services
for deploying, managing, and scaling data science applications can be
advantageous.
24
Module 5. Video Screenshots
25
26
Module 6. Quiz Screenshots
27
Module 7. Internship Project
1. Introduction
Lung cancer is a leading cause of cancer-related deaths worldwide, and early detection plays a
crucial role in improving patient outcomes. This project aims to develop a predictive model using
data science,machine learning, and AI techniques to assist in the early detection and prediction of
lung cancer basedon patient data.
2. Objectives
• Utilize data science methodologies to preprocess and analyze large datasets containing
patientdemographics, medical history, lifestyle factors, and potentially genetic
information.
• Apply machine learning algorithms to build a predictive model capable of identifying
patternsand features associated with lung cancer risk.
• Develop an AI-driven system that can predict the likelihood of lung cancer based on input variables.
• Evaluate the model's performance metrics such as accuracy, sensitivity, specificity, and
areaunder the ROC curve (AUC) to assess its reliability and effectiveness.
3. Methodology
Feature Engineering:
• Extract relevant features from the dataset that are indicative of lung cancer
risk(e.g., smoking history, age, exposure duration).
28
Model Development:
• Implement machine learning algorithms such as logistic regression, decision trees, random
forests,support vector machines (SVM), or deep learning techniques like convolutional neural
networks (CNNs).
• Split the dataset into training and testing sets for model training and validation.
• Develop an AI-driven system that takes input data (patient attributes) and provides a
predicteprobability or risk score of developing lung cancer.
• Integrate the predictive model into a user-friendly interface (web application or API) for
easyaccessibility and use by healthcare professionals.
5. Ethical Considerations
• Ensure patient data privacy and confidentiality throughout the project lifecycle.
• Address bias and fairness issues in the dataset and model predictions to ensure equitable outcomes.
6. Expected Outcomes
• A robust predictive model capable of accurately identifying individuals at high risk of
developinglung cancer.
• Insights into key risk factors and predictors associated with lung cancer development.
• A scalable AI solution that can be further extended or integrated into clinical decision
supportsystems.
29
7. Summary
This project aims to leverage the power of data science and machine learning to enhance early
detection and prediction of lung cancer, ultimately contributing to improved patient outcomes and
healthcaredecision-making. By integrating AI-driven technologies into cancer screening processes, this
work has the potential to revolutionize preventive healthcare strategies.
IMPLEMENTATION
30
31
32
33
34
35
36
37
Module 8. List of References
Schapiro, R.E., Freund, Y., Bartlett, P. and Lee, W.S. (1998). Boosting the margin: A new
explanationfor the effectiveness of voting methods. Annals of Statistics 26(5):1651–1686.
341Google Scholar
Quinlan, J.R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann. 156Google Scholar
Ragavan, H. and Rendell, L.A. (1993). Lookahead feature construction for learning hard concepts.
In Proceedings of the Tenth International Conference on Machine Learning (ICML 1993), pp. 252–
259.Morgan Kaufmann. 328Google Scholar
Raj Narayan, D.G. and Wolpert, D. (2010). Bias-variance trade-offs: Novel applications.
In C., Sammut and G.I., Webb (eds.), Encyclopedia of Machine Learning, pp. 101–110.
Springer.103Google Scholar
38
Module 9. Conclusion
In summary, the convergence of data science, machine learning, and artificial intelligence
represents a transformative force in our digital landscape. This interdisciplinary fusion empowers us to
extract valuable insights from vast datasets, automate processes, and create intelligent systems capable
of learningand adapting. By leveraging advanced algorithms, statistical techniques, and computational
power, practitioners in these fields can tackle complex problems, drive innovation, and enhance
decision-making across diverse domains. As we continue to advance, interdisciplinary collaboration
and ongoing research will further propel the capabilities of data science, machine learning, and AI,
ushering in a new era of technological sophistication and societal impact.
39