0% found this document useful (0 votes)
24 views47 pages

Machine Learning for Breast Cancer Diagnosis

The document discusses breast cancer as a leading cause of cancer deaths and emphasizes the importance of early diagnosis through machine learning algorithms. It outlines various types of breast cancer, risk factors, and existing detection systems, while proposing a new system using support vector machines for improved accuracy. Additionally, it details the project’s objectives, feasibility studies, and the tools and technologies utilized in the development process.

Uploaded by

sahana s
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views47 pages

Machine Learning for Breast Cancer Diagnosis

The document discusses breast cancer as a leading cause of cancer deaths and emphasizes the importance of early diagnosis through machine learning algorithms. It outlines various types of breast cancer, risk factors, and existing detection systems, while proposing a new system using support vector machines for improved accuracy. Additionally, it details the project’s objectives, feasibility studies, and the tools and technologies utilized in the development process.

Uploaded by

sahana s
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

CHAPTER 1

INTRODUCTION

1.1 PROJECT DESCRIPTION

Breast cancer is regarded as the leading cause of cancer fatalities globally, as it is difficult to detect
and only therapy options are available for this condition. Early diagnosis is also suggested to help
people lead healthy lives.

Machine learning algorithms may nearly diagnose cancer based on the conditions and available
datasets before a thorough medical checkup. This has developed into a procedure that is ahead of its
time for assessing the condition in addition to helping to anticipate with the necessary precision. Using
this so-called computer produced reaction, many lives have been saved.

Anticipate abnormal tumors. Research has been conducted in order to accurately identify and classify
individuals as malignant or benign. Magnetic resonance imaging (MRI), trans sonography, diagnostic
mammography, and biopsies.

A cancer that originates in the breast tissue is called breast cancer. Breast cancer symptoms include
breast lumps, breast shape changes, skin dimpling, milk rejection, fluid emerging from the nipple,
recently inverted nipples, and red or scaly skin patches. Those who have the disease spread far may
experience yellow skin, loss of breath, enlarged lymph nodes, and bone discomfort.

Obesity, inactivity, alcoholism, hormone replacement therapies following menopause, ionizing


radiation, early menarche age, late or nonexistent birth, advanced age, prior breast cancer history, and
a family history of breast cancer are all causes of risk for this illness.

Hereditary genes, including BRCA mutations, lead for five to ten percent the cells that line milk ducts
and the lobules that provide milk to them are the most prevalent sources of breast cancer cells. Ductal
carcinomas are tumors formed in the ducts, whereas lobular carcinomas form in the lobules.There are
over 18 more subtypes of breast cancer.Pre-invasive lesions can lead to some forms of cancer, such as
ductal carcinoma in situ. Breast cancer is determined by biopsy of questionable tissue.
The three types of tumours are as follows:

a. Benign tumours grow slowly, cannot spread, and are not malignant. They cannot harm a human
body, not even if doctors remove them.

b. Premalignant tumours have the potential to develop into cancer but are not already malignant.

c. Malignant tumours can spread swiftly throughout the body and are malignant.
OBJECTIVE

• Recognize breast cancer risk factors.


• Differentiate from different types of breast cancer.
• Compare the suggested breast cancer therapy options.
• Plan strategies with members of the interprofessional team to enhance patient care and optimize
results for breast cancer patients.
CHAPTER 2

LITERATURE SURVEY

Related work

• According to Anika Singh from UEM, Kolkata in 2020, breast cancer risk may be estimated
utilizing both keen and slow learners. But this isn't able to grind the highest accuracy achievable
with the current algorithms. According to their findings, the necessary accuracy—roughly 88
percent—can be achieved by employing lazy learners to identify just the cell growth.
• According to Nitasha (2019), eager learners and sluggish learners (about 90% of the
population) may both be educated to achieve the necessary accuracy by applying the necessary
precision.
• In 2019, Navya Sri published a cross-comparison study comparing Bayesian classifiers with
decision tree algorithms. The study measured the accuracy of Bayesian with a decision tree
score of 75.875% and a result of 75.27% using the Waikato Environment for Knowledge
Analysis.
• In 2016, a professor at Brown University said that deep belief networks (DBN) and artificial
neural networks (ANN) might work together to give more effective outcomes. To ensure a
more consistent what happens, the instructor underlined the necessity of thorough education
and experimentation.
• Shannon (2011) employed a metaplasticity Artificial Neural Network approach to train an
image scanning algorithm, with a correctness close to 95%. It returned a result of 99.26%,
which is considered an overfit the quantity by current parlance.However, it succeeded in a
renowned breakthrough in that the value continuously encouraged researchers to train and
evaluate their data in a way that raised their accuracy requirements above 95%, with capacities
above 90%.
• The assertion made by Jiaxin Li in 2020 that training and testing are required for getting the
only true accuracy, irrespective of the algorithm, served as a major source of inspiration for
this work. As a result, their accuracy standard was raised to more nearly 95%.
• Nam Nhut Phan (2021) proposes employing convolutional neural networks in three stages:
training (50%) and testing (43%), with the remaining percentage going toward validation to
produce results free of redundant empty values or correlations.
• N Gupta used an ensemble-based training strategy in 2021 to get the intended results. 96.77
was the notable breakthrough that was obtained without the usage of gradient boosting
techniques.
• In 2020, Md. Milon Islam said linking artificial neural networks with SVM produced 96.82
percent accuracy (0.9777). They paid close attention to the earlier finding by the researcher
Shannon [2011] who utilized a comparative analysis of using Artificial Neural Networks and
Deep Belief Networks to get an accuracy of more than 95% using only ANNs.
• In 2019, Chang Ming employed BOADICEA and BCRAT in conjunction with eight simulated
datasets containing cancer carriers and their relatives free of cancer. The startling discovery
was that one of the cancer-free patients had a positive accuracy of 97%, indicating a reasonably
high susceptibility to cancer.

2.1 EXISTING SYSTEM

Tumors often come into two categories. There are two kinds of tumors: benign and malignant. A
benign tumor is not cancerous; but a malignant tumor is. A number of techniques and algorithms,
including as decision trees, logistic regression, and k means clustering, are available for the detection
of breast cancer.
A decision tree is an approach to machine learning and is commonly used to forecast both continuous
and non-continuous data. For good accuracy, the data must go under considerable preprocessing before
the Decision Tree is deployed. First, the dataset's features are chosen using a recursive feature
elimination approach. The top 16 features are then chosen, and the Decision Tree is used on them.

DISADVANTAGES

• Dependency on hardware: Decision trees depend on parallel processors to good efficacy. On


therefore, the equipment's realization is sensitive.
• Inexplicable network behavior: This is the main issue with Decision Trees.
Decision Tree doesn't explain the why or how it arrived to a particular solution. The network
loses credibility as a result.
2.2 PROPOSED SYSTEM

In machine learning, classifying data is a typical problem. The objective is to determine which of the
two classes the points of data given belong to, and say that a new data point is to be placed in one of
both. A data point is considered a 𝑝 {\displaystyle p}-dimensional vector in the context of support
vector machines (a list of 𝑝 {\displaystyle p} numbers). Our goal is to determine if we can divide such
points using a (𝑝 −1}) {\displaystyle (p-1)}-dimensional hyperplane. We refer to this as a linear
classifier. Numerous hyperplanes might be used to categorize the data. The hyperplane that shows the
greatest margin, or distance, between the two classes is a plausible candidate for best hyperplane.

• Text and photo categorization are two examples of applications that frequently deal with high-
dimensional data. SVMs, or support vector machines, are useful in handling this kind of data.
• This machine learning method is highly accurate for is complex and mathematical method.
• It is a dynamic method that can handle a variety of issues, including as regression problems,
binary, binomial, and multi-class classification problems, in addition to linear and non-linear
problems
• SVM employs the notion of margins to optimize differentiation between two classes; it
decreases the likelihood of model overfitting, making the model extremely stable.
• The SVM algorithm is highly accurate in high dimensions and can compete with algorithms
like Naïve Bayes, which are specialized in handling classification problems of extremely high
dimensions. This is because to the availability of kernels and the fundamental principles upon
which SVM is built.
• SVM is renowned for its memory management and processing speed. Memory use is lower,
particularly when compared to machine learning vs deep learning algorithms, with which SVM
frequently competes and occasionally even performs better than these days.

2.3 FEASIBILITY STUDY

This business proposal outlines the project's main objectives and provides some first cost estimates in
an effort to assess the project's feasibility and enhance server performance. The feasibility of your
suggested system may be evaluated when a thorough investigation has been completed. Before starting
the feasibility study, a complete grasp of the system's fundamental needs is necessary. The feasibility
study incorporates three primary schools of thought.
• ECONOMICAL FEASIBILITY

In addition, given that we'll be making predictions and based on that analytical revisions have
to be made, economic feasibility is a practice where we can comprehend the financial
necessities and distant standards of monetary necessities that we have toward managing
identical conditionsin the worst-case scenariosthat could occur. And that economicsfeasibility
we'll be doing return on investment calculations whichever will be accessible because
understanding how much profit can be made over the arrangement is rightful.

• OPERATIONAL FEASIBILITY

Operational knowledge is crucial since it is appropriate for understanding diverse category


acceleration data, which allows us to generate a whole report channel. Reports will be
considered so that information processing may be completed with ease.

Additionally, as part of the operational feasibility assessment, we will determine actual time
guidance to support the appropriate course of action. This is because, on occasion, issues may
arise during the examination of trend indication and heterogeneous media analysis.

• TECHNICAL FEASIBILITY

Technical feasibility must match the appropriate level of technological expertise in order for
us to give customers with structured analysis reports and diverse comparative studies. Finding
any issues that may be connected to the arrangement design is more or less the same as what
is covered in the technical feasibility paperwork.

• SOCIAL FEASIBILITY

An examination of the project's social feasibility looks at how it could change the community.
This is carried out in order to gauge public interest in the project.. It is possible that long-
standing institutional systems and cultural norms will make a certain kind of worker hard to
locate or nonexistent..
2.4 TOOLS AND TECHNOLOGIES USED
Machine Learning
Within the field of artificial intelligence (AI), machine learning (ML) entails creating algorithms that let
computers analyze, interpret, and forecast data in order to make judgments or predictions. In contrast to
traditional programming, which requires explicit instructions for each job, machine learning (ML) enables
systems to learn from experience and adjust to new data without explicit programming. .

Libraries:

• Pandas: The easy-to-use Python library. It is simple to learn and produces fast results. This
public library is available to you for free use. We have read the dataset and used it for data
analysis.

• Matplotlib -is a Python library that is used to visualize data using various graphs and
scatterplots. It has been applied to data visualization in this case.

• Numpy: Python library utilized for array processing. It serves so many purposes. Using the
ravel function, we have utilized this module to convert the two-dimensional array into a
continuous flattened array

• Flask- Using Flask Libraries, developers can create lightweight web apps fast and effortlessly
using the Flask web framework. Armin Ronacher, the founder of the International Group of
Python Enthusiasts (POCCO), created it. The Jinja2 templating engine and the WSGI toolkit
serve as its fundamental foundations.

• Pickle-To develop binary protocols for serializing and de-serializing a Python object structure,
utilize the pickle module.
Pickling- is the process of transforming an object hierarchy in Python into a stream of bytes.
Unpickling -is the process of transforming a byte stream into an object hierarchy. It is an
opposite of pickling.
PROGRAMMING LANGUAGES
Python

Python is a high-level, general-purpose programming language. Its design philosophy puts code
readability first, with a focus on indentation.

Python offers garbage collection and dynamic typing. It is compatible with several programming
paradigms, such object-oriented, functional, and structured. Because of its wide standard library, it is
often referred to as a "batteries included" language.

Features of python

1. Open Source and Free

The official website offers the Python language for free. To download it, click the "Download
Python" keyword below or use the provided download link. Get Python here. The source code
is likewise available for the general public as it's open-source. Thus, you may to download,
use, and distribute it.

2. Simple to write

High-level programming languages include Python. In contrast to other programming


languages like C, C#, Javascript, Java, etc., Python is incredibly simple to learn.

3. Easily Debugged

Great information for identifying errors. Once you learn to analyze Python's error traces, you
will be able to rapidly find and fix most of the problems with your application. You can tell
what the code is intended to do just by looking at it.

Advantages of python
• Existence of third-party modules: Python boasts an overall third-party module and library
ecosystem that expands its capabilities for an array of applications.
• Large support library: Python is appropriate for scientific and data-related applications due
to its large support library, which includes NumPy for numerical computations and Pandas for
data analytics.
• Open source and having a sizable, vibrant community: Python is open source and has a
sizable, vibrant community that supports and advances its development.
• Python is renowned for its simplicity and readability: which makes it a great option for both
novice and seasoned programmers. It is also versatile, easy to learn, and create.
• User-friendly data structures: Python provides simple clear data structures, making data
administration and manipulation simple.
• High-level language: Python is a high-level language that makes things easier to use by
abstracting away low-level features.
• Language with dynamic typing: Python has dynamic typing, which eliminates the need for
explicit data type declarations and offers flexibility without sacrificing reliability
• Python is a versatile programming language: That supports both procedural and object-
oriented programming, allowing for a wide range of coding processes. Python is both
interactive and portable, allowing some operating systems and enabling real-time code
execution and testing.

Disadvantage of python
• Performance: Python, as an interpreted language, can't execute as swiftly as compiled
languages like Java or C. This can be problematic for activities requiring a lot of performance.
Python's Global Interpreter Lock (GIL) mechanism forbids the simultaneous execution of
Python code by many threads. Certain apps may have their parallelism and concurrency limited
as a result.
• Versioning and packaging: Python contains a lot of libraries and packages, which can
occasionally cause versioning problems and package conflicts.
• Absence of rigidity: Python's adaptability might have unintended consequences. It might result
in code that is challenging to comprehend and maintain, even while it can be excellent for quick
development and experimentation.
• high learning curve: Python is mostly seen as a language that is easy to learn, but rookies,
particularly those with no technical expertise, may face a steep learning curve.
• Memory usage: Python may use a lot of memory, particularly when executing sophisticated
algorithms or working with big datasets. Python is a dynamically typed language, meaning that
variable types can change while the program run. This may cause issues and make it
challenging to identify mistakes.
2.5 HARDWARE AND SOFTWARE REQUIREMENTS

SOFTWARE REQUIREMENTS:

• Python

• Anaconda

• Jupyter Notebook

HARDWARE REQUIREMENTS:

Processor :Intel Core i5

RAM :8GB

OS :Windows
CHAPTER 3
SOFTWARE REQUIREMENTS SPECIFICATION

A software system that has to be developed is described in a software requirements specification, or


SRS. The business requirements specification (CONOPS) is the model it is based on. Functional and
non-functional needs are outlined in the software requirements specification, which may also contain
a series of use cases that illustrate the kinds of user interactions that the program must facilitate for
optimal user experience.
The appropriate and essential requirements for the project development are listed in the software
requirements specification document. The developer must have a comprehensive grasp of the goods
that are being developed in order to extract the requirements.

Functional Requirements

Functional requirements are the specs for the features, functions, and attitudes that the system must
process. The outline that has been complete confirms that the system's particular criteria are satisfied.

Examples of functional requirements are user identification, data management, workflow, reporting
and analytics and system connection with the third parties.

User Requirements

User requirements aim to meet the needs and demands of clients; you are under a group of functional
requirements. Knowing the customer's goals, responsibilities, and action result in these.

Focus on the important features that customer anticipates from the system. Use cases and user stories
are important methods for documenting the features they are written in English language.

System Requirements

System Requirements interact with different components of the system and specify how to operate
with various circumstances for the system. They are more broad and technical is extent compared to
user requirements. They offer more directions for system performance, action, security, and other
areas.
There are two common methods in System Requirements:
• System requirements specifications
• Functional specifications

Nonfunctional Requirements
The traits and restrictions that the system must adhere to are outlined in non-functional requirements.
Performance, dependability, security, usability, scalability, maintainability, and compatibility are
some of the factors they emphasize. The effectiveness and efficiency of the system are guaranteed by
non-functional criteria.

Performance goals, system availability, data security protocols, user interface friendliness, and system
scalability are a few examples of non-functional criteria.

Completeness: A checklist ensuring that all needs are taken into account and recorded. It serves as a
guide to cover all the many facets of the system, making sure that no crucial necessity is missed.
Promotes uniformity in the documentation and requirement collection processes. It helps to maintain
a consistent strategy, making it easy for stakeholders to comprehend and review.

Clarity: Establishes a clear structure for arranging and presenting needs. It guarantees that
requirements are expressed clearly and succinctly and helps to prevent ambiguity.

Coverage: Guarantees that requirements, whether functional and non-functional, are sufficiently taken
care of. It assists in recognizing and capturing each and every one of the essential characteristics, traits,
and features that the system ought to have.
CHAPTER 4
SYSTEM DESIGN

4.1 SYSTEM PERSPECTIVE

An overview of every entity currently incorporated into the system is given in a clear and succinct
manner in this image. The figure shows the associated options and actions. One may contend that the
process in its entirety and its implementation are just images. The functional connections between the
various entities are depicted in the following figure.

Fig: 4.1 – Architecture Diagram


Data Collection
The data about breast cancer came from the UCI machine learning library. This data consists of 699
occurrences and 10 attributes. Additionally, there are a few missing values in this data, indicated by
"?" Cases which are either benign or malignant have been included in a class attribute. For the data
below, the class distribution is as follows: 241 (34.5%) are malignant and 458 (65.5%) are benign. The
data has a partitioned class of 2 or 4, where 2 represents the benign condition and 4 represents the
malignant situation. The table below lists every characteristic found in the data. Malignant instances
are categorized as positive classes (Class 0), while benign cases are classed as negative classes.

Table 1: Attributes of the dataset Attributes

Attributes Domain

ID 1–10

Clump thickness 1–10

Uniformity of cell size 1–10

Uniformity of cell shape 1–10

Marginal adhesion 1–10

Single epithelial cell size 1–10

Bare nuclei 1–10

Bland chromatin 1–10

Normal nuclei 1–10

Mitoses 1–10

INSERTING THE DATASET AND LIBRARIES

I started by importing the breast cancer libraries and data. I have then defined the names of all the
characteristics because the data lacks column names. The data—not the dataset—was taken from the
UCI repository. I thus converted the data to a CSV file using the data.to_csv function. Now that I have
the dataset, I can use it for further EDA and modeling data.
DATA PRE-PROCESSING
Conventions, outliers, and missing values must be filled in, and data pre-processing serves to identify
and eliminate them. The sample code number was deleted from the dataset since it is unrelated to any
illnesses. Each "?" represents one of a data collection's sixteen missing variables.

Since the "?" is not a numerical value, the program reads it as a string when it immediately types into
data. As a result, we need to translate it to a numeric quantity. As it turns out, the missing number ("?")
has been restored with -99999. In terms of sorting, those that are benign fall under as negative (Class
0) and malignant as positive (Class 5).

Pre-processing entails the following three crucial and typical steps:

• Formatting: This is the process for organizing information so that it may be used
appropriately. The format of data files should be determined by their needs. The most
frequently recommended file format is.csv.
• cleaning: Since it makes up the majority of the effort, data cleansing is a crucial step in the
data science process. It entails eliminating missing data, handling complicated category names,
and other issues. For the majority of data scientists, 80% of their labor consists of data cleaning.

Finding the missing variable

Processing missing data can be done in two main ways:

o By eliminating the particular row: Null values are handled in the first manner. In this manner,
we can easily remove the particular row or column that has zero data. However, this method is
not very efficient, and deleting the data may result in information loss that produces inaccurate
results.

o By figuring out the mean: To replace the missing value, we compute the average of the rows
and columns that contain any missing values. For characteristics that include digital data, such
as age, income, year, etc., this 28-strategy is helpful. This strategy will be applied here.

• Sampling: a technique that examines portions of bigger datasets to get better findings and help
with coherent understanding of the behavior and data pattern
Transforming categorical data

Data with specific categories, such as whether an individual smokes or does not smoke, is referred to
as categorical data. The machine learning model is entirely based on math and statistics, however it
may encounter difficulties during model construction if our dataset contains categorical variables.
Thus, these category variables need to be numerically coded. We'll encrypt the category values in a
one-hot fashion.

Feature Extraction

The process of assessing the pattern and behaviour of the assessed data and build characteristics for
further testing and training. In the end, I train the model using classifier approaches. To pinpoint the
Python module, I applied natural language toolkit packages. I used the gathered data marked to build
the model.

I will examine the model by using labelled datasets. Pre-processed data is divided by machine learning
processes. K nearest classifier are selected. The classification jobs are frequently find by these
techniques.

Splitting the data

Data splitting can takes place when data is split into two or more subclasses. Normally, two split is
used, where the first part is used for model building and the second part is used for testing the data.
Data splitting crucial attribute of data science, mostly during model building the data depends on. This
make possible the development of exact the data models and the process models apply the use of data
models, like machine learning.

Mostly two split is used for training the dataset the pair of model creation and training. Utilize the
training packages to evaluate metrics or examine the efficiency of discreate models is classic
procedure. Once the training is completed test dataset is used.to know whether the final model is
working correctly, data from training and testing is compared. Data is some times divided into three
or more sets.
Training and Validating

Train Set: the data is given to train the model and it’s capacity to find the hidden patterns and
characteristic in the data is referred as train set. The trained dataset is given for the model frequently ,
allowing to constantly learn the features. The training package gives a wide variety of inputs to train
the model in all cases.

Test set: Following training, the model gets assessed on a new dataset called the test set. In terms of
accuracy, it supplies an impartial final performance classification.
Algorithms are used for model building

Classification and Regression Tree (CART)

This approach is applicable to regression as well as classification. The CART algorithm divides a node
into sub-nodes using the Gini Index criteria. It begins with the training set as a root node. Once the
root node has been successfully split in half, it uses the same logic to split the subsets and then splits
the sub-subsets once more. This process is repeated until it discovers that splitting any further will
result in either a maximum number of leaves in a growing tree or pure sub-nodes. We call this
procedure "tree pruning."

The term CART serves as a catch-all for the following categories of decision trees:

Classification Trees: The tree is used for identifying which "class" the variable being targeted is
most likely to fall into when it is continuous

Regression trees: These are employed to predict a continuous variable's value.

The decision tree's nodes divide into subnodes according to an attribute's threshold value. The Gini
Index criteria is used by the CART algorithm to find the optimal homogeneity for the subnodes.

The training set consists the root node, which is split into two halves based on the threshold value and
best attribute. The subgroups are also divided based on the same logic. This continues until the tree
either reaches its last pure subset or reaches its maximum number of leaves. This is also known as tree
trimming.
Support Vector Machine (SVM)

Support Vector Machine (SVM) is a classifier that uses the closest data points to find a maximum
marginal hyper plane (MMH) by dividing datasets into classes.

Classifying data is a typical machine learning duty. The objective is to determine which of the two
classes the information points supplied belong to, and say that a new data point has to be added in one
of groups. In support vector machines, a data point is referred to by the term 𝑝 {\displaystyle p}-
dimensional vector. Our goal is to see if we may divide such places using a (𝑝 -1)-dimensional
hyperplane. We term this a linear classifier. Different hyperplanes might be used to sort the data. The
hyperplane containing the greatest margin, or distance, between the two classes is an excellent choice
for best hyperplane.

k-Nearest Neighbors (K-NN)


A supervised classification technique is called k-Nearest Neighbors (K-NN). It utilizes a large number
of labeled points as input and learns how to categorize new input. In order to designate a new point,
its nearest neighbors—the labeled points closest to the new point—are considered, and their vote is
solicited.

Among machine learning's most fundamental but crucial classification algorithms is KNN. Pattern
recognition, data mining, and intrusion detection are three major applications for this supervised
learning domain member.

Each vector in the training examples has a class label and is located in a multidimensional feature
space. The feature vectors and class labels of the training samples are the only data that has to be
preserved throughout the algorithm's training phase.

In the classification phase, an unlabeled vector (a query or test point) is categorized with the label that
appears the most often among the k training samples closest to the query point. In the preceding stage,
k used to be a user-defined constant.

Naïve Bayes
The classification techniques that use Bayes' Theorem as a foundation are called naive Bayes
classifiers. The idea that each pair of characteristics under categorization is independent unites a group
of algorithms that make up this approach rather than a single one. Let us examine a dataset to get things
going. Apply the Naïve Bayes classifier to easily create machine learning models with successful
forecasting abilities. It is one of the most efficient classification algorithm ideas.

Classification issues are handled by the Naïve Bayes method. Text categorization makes extensive use
of it. Since each word in the text represents a single feature, data for text classification tasks has a high
dimension. It is applied in sentiment analysis, rating categorization, spam filtering, and other areas.
One benefit of naïve Bayes is its quickness. With a large data dimension, prediction is simple and
quick.

This model forecasts the probability that an instance belongs to a class given a collection of feature
values. It's a probability-based classifier. To do this, it believes that each feature in the model exists
separately on the other ones. Stated differently, each feature contributes to the predictions separately
from the others. In the real world, this condition is rarely true. The Bayes theorem utilize in the training
and prediction procedure.

Data visualization

The aim of data visualization is to schematically display sets of raw data, mostly quantitative, in a
visual format. Tables, charts, graphs (such as bar, area, cone, pyramid, donut, histogram, spectrogram,
cohort, waterfall, funnel, and bullet charts), diagrams, plots (such as scatter, distribution, and box-and-
whisker plots), geospatial maps (such as proportional symbol, choropleth, isopleth, and heat maps),
figures, correlation matrices, percentage gauges, and so on are instances of diagram formats used in
data visualization. These visual instruments may be combined at times to create a dashboard.

Evaluation model

it involves the process of evaluating the model using metrices to examine the model’s performance.
Commonly developing the model involves serval steps, and its capacity to predict the future need to
be frequently examined.
As a result, reviewing the model is essential for assessing how well it conducts itself. It has many
calculations available, such mean square error, accuracy, precision, recall, f1 score, area under curve,
and confusion matrix. Cross-validation is a model evaluation look at used in training.
4.2 FLOW CHART

A form of visual which symbolizes a workflow or process is referred to as a flowchart. An algorithm,


or a methodical process for completing a task, can also be represented diagrammatically as a flowchart.
The flowchart depicts the processes as different types of boxes and their sequential sequence by joining
the boxes with arrows. A solution model for a particular problem is shown in this diagrammatic
representation. In many different industries, flowcharts are used for process or program analysis,
design, documentation, and management.

Flow Chart symbols


Flowline
displays the sequence of action of the procedure. a line that points at another symbol from one. If the
flow is not the typical top-to-bottom, left-to-right, arrowheads are added.

Terminal
shows the start and finish of a subprocess or program. Described as an oval or rounded (fillet)
rectangle, stadium. Typically, they have the words "Start" or "End" or a different phrase that indicates
the beginning or completion of a process, such "submit inquiry" or "receive product".

Process

represents a collection of actions that alter the location, form, or value of data. shown as a form that is
rectangle.

Decision

identifies the conditional operation that determines the program's path, out of the two. For the
procedure, a true/false test or a yes/no question is usually employed. represented by a diamond with a
rhombus form.
Input/Output

shows how data is entered and produced, such as when results are displayed or data is entered [17].
symbolized by a rhomboid.

Flow Chart

Fig 4.2: Flow chart


CHAPTER 5

DETAILED DESIGN

5.1 CLASS DIAGRAM

Essentially, this may also be referred to as a "context diagram," or contextual diagram. It merely
denotes the procedure's highest point, or the 0 Level. The relationship to externalities is abstracted,
and the system is turned out as a single process.

• An action or feature that are obtainable for the broader public is indicated with a +.

• A: one that is only available to you.

• A protected number.

• A - indicates personal characteristics or functions.

Fig 5.1: Class Diagram


5.2 DATA FLOW DIAGRAM

The data flow diagram describes how data moves over a system or process. (typically an information
system). The DFD delivers data on each entity's inputs and outputs, as well as the process itself. A data
flow diagram is no control flow, decision rules, and loops. A flowchart can be employed to represent
certain operations based on data.

various notations are available for data-flow diagram presentation. Tom DeMarco defined the notation
used above as a component of structured analysis in 1979.

A process must have a minimum of one endpoint (source or destination) for every data flow. Another
data-flow diagram that splits a process into smaller techniques, allow it to be described more precisely.

One tool used in structured analysis and data modeling is the data-flow diagram. The activity diagram
usually replaces the data-flow diagram when using UML. Site-oriented data-flow plans is a unique
form of data-flow strategy.

Data Flow Diagram Symbols

Process

A system's process, function, or transformation conveys inputs into outputs. Depending on the type of
notation used, a process sign can be an oval, rectangle, circle, or rectangle with rounded corners. The
name of the process might be one word, a brief statement, or a phrase that captures its core standards..

Data Flow

The movement of information, occasionally along with material, from one area of the system to another
is shown by data flow (flow, dataflow). The arrow is the representation of the flow. The information
(or material) which is being sent should be set up by a flow name. Flows associated with entities that
clearly define the information passed on through them are considered exceptions. More than just
instructional systems modify the model material. There should only be one kind of information
(material) sent via flow. The direction of flow is indicated by the arrow; it may also be bi-directional
provided the data going to and from the entity make sense—for example, a query and a response.
Flows connect terminators, warehouses, and processes.

Data Store

Data is kept in the warehouse (datastore, data store, file, database) for subsequent use. The shop is
represented by two horizontal lines; the DFD Notation displays the other viewpoint. The warehouse's
name, which comes from its input and output streams, is a plural noun (e.g., orders). The warehouse
can be more than simply a data file; it can also be a filing cabinet, a series of optical disks, or a folder
containing papers. Consequently, implementation has no impact on how the warehouse is seen in a
DFD.

Terminator

The terminator, that is outside of the system, interacts with it. It might be different departments (like
the human resources department) within the same firm, groups of individuals (like the customers),
authorities (like the tax office), or even other companies (like the bank) that are not included in the
model system. The modeled system could communicate with another system called the terminator.

.
DATA FLOW DIAGRAM

Multi-level DFDs can be made to make the DFD more transparent (i.e., not too many processes).
Higher level DFDs are less detailed. Due to the DFD Creation Rules, the contextual DFD ranks first
on the list. DFD 0 comes after the so-called zero level, with process numbering (e.g., process 1, process
2). The numbering in the following, the so-called first level (DFD 1), is maintained. 1.1, 1.2, and 1.3
are the initial three DFD levels, for instance, which include procedure 1. Comparable numbers apply
to the procedures of the second level (DFD 2).The size of the model system determines how many
layers there are.

There might not be the same amount of decomposition levels in DFD 0 processes. The most crucial
(aggregated) system functions are found in DFD 0. Processes that enable the creation of a process
definition for about one A4 page should be included at the lowest level. It is suitable to add a level to
the process if the mini-specification is lengthier and will be broken down into many processes. An
easily understood summary of the whole DFD hierarchy may be produced by drawing a vertical (cross-
sectional) figure. Every level, including the top level where it is originally used, reflects the warehouse.

Fig: 5.2 – Data Flow Diagram Level 0


Fig:5.2-Data Flow Diagram Level 1

5.3 Activity Diagram

Activity diagrams, this allow for choice, iteration, and concurrency, are visual depictions of processes
comprised out of sequential activities and actions. The Unified Modeling Language combines activity
diagrams to present organizational and computational processes, or workflows, as well as data flows
that cross over into the associated activities.

Data that goes across object flow edges and is input and output from executable nodes is stored in
object nodes. Through control flow edges, control nodes dictate the order in which executable nodes
should be executed." Put differently, activity diagrams can incorporate features that illustrate the data
flow between activities via one or more data stores, even if they mainly depict the overall control flow.
5.4 Sequence Diagram

Developers use sequence diagrams to show the hyperlinks between parts of a specific use case. They
represent the way different shards of a system connect with the others to perform functions, as well as
the sequence in which those interactions occur once an application begins.

Sequence diagrams are often utilized in software development to clarify system behavior or to assist
developers with making and figuring out complex systems. They allow both simple and complex
object interactions, making them a crucial resource for software architects, designers, and developers.

5.5 Use Case Diagram

It is a diagram of potential user interactions with a system. This describes the many use cases and
user types that the system has; its frequently backed up by other types of diagrams as well. The use
cases are denoted by circles or ellipses. Stick figures frequently utilized as performers.
CHAPTER 6

IMPLEMENTATION

6.1 CODE SNIPPET

df = pd.read_csv(url, names=names)

df.head()

#Shape of the Dataset

df.shape

df.drop(['id'],axis=1,inplace = True)

# Columns in the dataset

df.columns

df.info

df.isna().sum()

msno.bar(df,color="red")

#Diagnosis class Malignant = 4 and Benign = 2

#The number of Benign and Maglinant cases from the dataset

df['class'].value_counts()

plt.hist(df['class'])

plt.title('diagnosis in class(B=1,M=0)')

plt.xlabel('class')

plt.ylabel('count')

plt.show()

df['bare_nuclei'].value_counts()

df[df['bare_nuclei'] == '?']

df[df['bare_nuclei'] == '?'].sum()
df.replace('?',np.nan,inplace=True)

sns.displot(df['class'],kde=True)

ax = df[df['class'] == 4][0:50].plot(kind='scatter', x='clump_thickness', y='uniform_cell_size',


color='DarkBlue', label='malignant');

df[df['class'] == 2][0:50].plot(kind='scatter', x='clump_thickness', y='uniform_cell_size',


color='Yellow', label='benign', ax=ax);

plt.show()

sns.set_style('darkgrid')

df.hist(figsize=(30,30))

plt.show()

plt.figure(figsize=(10,10))

sns.boxplot(data=df,orient='h')

df.corr()

plt.figure(figsize=(30,20))

cor = df.corr()

sns.heatmap(cor,vmax=1,square = True,annot=True, cmap=plt.cm.Blues)

plt.title('Correlation between different attributes')

plt.show()

sns.pairplot(df,diag_kind='kde')

cor_target = abs(cor["class"])

#Selecting highly correlated features

relevant_features = cor_target[cor_target>0]

relevant_features

for name, model in models:

kfold = KFold(n_splits=10)
cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)

results.append(cv_results)

names.append(name)

msg = "For %s Model:Mean accuracy is %f (Std accuracy is %f)" % (name, cv_results.mean(),


cv_results.std())

print(msg)

fig = plt.figure(figsize=(10,10))

fig.suptitle('Performance Comparison')

ax = fig.add_subplot(111)

plt.boxplot(results)

ax.set_xticklabels(names)

plt.show()

model.fit(X_train, Y_train)

predictions = model.predict(X_test)

print("\nModel:",name)

print("Accuracy score:",accuracy_score(Y_test, predictions))

print("Classification report:\n",classification_report(Y_test, predictions))

clf = SVC()

clf.fit(X_train, Y_train)

accuracy = clf.score(X_test, Y_test)

print("Test Accuracy:",accuracy)

predict = clf.predict(X_test)

predict

from sklearn.model_selection import GridSearchCV

param_grid = {
'C': [0.1, 1, 10, 100],

'gamma': [0.1, 0.01, 0.001],

'kernel': ['rbf', 'linear', 'poly']

grid_search = GridSearchCV(estimator=SVC(), param_grid=param_grid, cv=5)

# Fit the grid search to the data

grid_search.fit(X_train, Y_train)

print("Best Parameters:", grid_search.best_params_)

print("Best Score:", grid_search.best_score_)

best_model = grid_search.best_estimator_

test_accuracy = best_model.score(X_test, Y_test)

print("Test Accuracy with best model:", test_accuracy)

inputs = [[4,2,1,1,1,2,3,2,1]]

prediction = clf.predict(inputs)

print(prediction)

import itertools

sns.set_theme(style="dark")

def plot_confusion_matrix(cm, classes, normalize=False,title='Confusion matrix',


cmap=plt.cm.Blues):

cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

print("Normalized confusion matrix")

else:

print('Confusion matrix, without normalization')

print(cm)
plt.imshow(cm, interpolation='nearest', cmap=cmap)

plt.title(title)

plt.colorbar()

tick_marks = np.arange(len(classes))

plt.xticks(tick_marks, classes, rotation=45)

plt.yticks(tick_marks, classes)

fmt = '.2f' if normalize else 'd'

thresh = cm.max() / 2.

for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):

plt.text(j, i, format(cm[i, j], fmt),

horizontalalignment="center",

color="white" if cm[i, j] > thresh else "black")

plt.tight_layout()

plt.ylabel('True label')

plt.xlabel('Predicted label')

cnf_matrix = confusion_matrix(Y_test, predict, labels=[2,4])

np.set_printoptions(precision=2)

print (classification_report(Y_test, predict))

RESULTS AND DISCUSSION


Following the setup of algorithms for learning on the Dataset for Breast Cancer Kind As measures of
performance, Confusion Matrix, Accuracy, Precision, Sensitivity, F1 Score, and AUC were employed
for the purpose of contrast the models to determine the best algorithm for the most accurate cancer
prediction.
A confusion matrix is a method to determine the effectiveness of an assessment issue when the output
might belong to two or more different types of classes. A table with the dimensions "Actual" and
"Predicted" together with the columns "True Positives (TP), True Negatives (TN), False Positives (FP),
and False Negatives (FN)" makes up a confusion matrix. For classification algorithms, accuracy is the
most often used performance parameter.

It may be defined as the ratio of all forecasts made to the number of right ones. The quantity of accurate
documents given in the machine learning model might be used to determine proficiency in document
retrievals.. You may define sensitivity as the quantity of favorable results that your machine learning
model produces. We can find the harmonic mean of accuracy and sensitivity using the F1 score. The
weighted average of accuracy and sensitivity is the mathematical representation of the F1 score.

Algorithms Accuracy Training Set Accuracy Testing

(%) Set (%)


SVM 98.4% 97.2%

CART 99.8% 96.5%

KNN 95.5% 95.8%

NB 98.8% 95.1%

Each row in Table 1 shows the rates in an real class, while each column gives the predictions—
confusion matrices being a useful tool in evaluating the classifier. Table 2 provides the computed
performance metrics of the classification models built upon the findings of the confusion matrix,
covering sensitivity, precision, and f1 score for both benign and malignant conditions.

Table 2. Confusion Matrix

Malignant Benign
SVM 201 11 Malignant

1 356 Benign

CART 196 16 Malignant

7 350 Benign

KNN 201 11 Malignant

5 324 Benign

NB 195 17 Malignant

22 335 Benign
Table 2: Confusion Matrix demonstrates that while the Support Vector Machine correctly predicts 556
cases out of 569 cases, it incorrectly predicts 11 cases, including 11 cases of the malignant class
predicted as benign and 1 case of the benign class predicted as malignant. The 566 cases comprised of
201 malignant cases that are actually malignant and 356 benign cases that are actually benign. Because
of this, Support Vector Machine accuracy surpasses that of other classification methods.

The table suggests that the SVM beats the other classifiers when speaking of accuracy (0.98%),
sensitivity (0.94%), and F-measure (0.96%). In the dataset for breast cancer, SVM consistently beats
alternatives classifiers for both benign and malignant cancers.
Its displays the ROC curves for the machine learning calculation. The ROC curve is a crucial indicator
of how effective classifiers is. The territory under the ROC curve (AUC) gets assessed. The classifier
performs better, and the region is larger. The AUC score in the classifier and regression is 0.94%,
while the support vector machine offers the highest AUC score of 0.96%.

The Area under roc curve (AUC) Algorithms AUC (%)

SVM 0.966
Classification and Regression Tree 0.960
KNN 0.947
Navi Bayes 0.945
6.2 SCREEN SHOTS
CHAPTER 7

SOFTWARE TESTING

Software testing is the process of determining if software meets requirements. software testing can
give a user or sponsor unbiased, impartial information about the software's quality and failure risk.

Software testing may assess if a piece of software is proper in certain situations, but not in others.
Not all bugs can be found using it.

Software testing uses concepts and procedures that may identify an issue, based on the standards for
gauging accuracy from an oracle. Contracts, specifications, comparable items, previous iterations of
the same product, inferences regarding intended or expected purpose, user or customer expectations,
pertinent standards, and applicable laws are a few examples of oracles.

Software testing is frequently dynamic, involving executing the program to ensure that the expected
output matches the reality. It might also take the form of static code reviews and documentation.

program testing is frequently used to provide a response to the following query: Does the program
perform as needed and as intended?

It is possible to enhance the software development process by utilizing the knowledge gained from
software testing.

Software testing should be done in a "pyramid" fashion, with unit tests making up the majority of the
tests, integration tests coming in second, and end-to-end (e2e) tests making up the least amount.

TYPES OF TESTING

Unit Testing

It involves the examine remote source code to confirm deliberate behaviour is referred as unit
testing. also called as component or module testing

Test Cases

1. Test case for input data: examine the model’s performance to process discrete types of input
data includes photos, text, categories.
2. Test case for missing data: intentionally remove the data from the input to know how well it
handles the missing data.
3. Test case for Model Accuracy: Examine the accuracy of the model creation with the data to
see the result.
Integration Testing
It involves examine various components of a system together is referred as integration testing or
integration and testing

Test Cases

1. Test case for data integration: confirm that the model is properly processing and combining
serval forms of data
2. Test case for model integration: Test every model or algorithms used by the system to predict
breast cancer.
3. Test case for system compatibility: Assess the way the breast cancer prediction system works
with other systems.

Functional Testing

Functional testing is the name piece of work that verify a determined code activity or function.
However few development methods work from use cases or user stories, those are consistently located
in the code requirements documentation. the inquiry “can the customer do this” and “carry out this
particular characteristic job” are commonly labelled by functional testing

Test Cases

1. Test case for input validation: assess to see how well the model handles data inputs that can’t
fit into the belated range or format.
2. Test case for output validation: assess the model’s ability to provide output data, including
binary classification results.
3. test case for model accuracy: assess the model’s accuracy by comparing the the desired
outcomes and actual results form a test dataset

TESTING TECHNIQUES

Black box Testing


This is a type of software testing inside the tester is disinterested among software inner knowledge or
execution particular but instead focuses on confirm the useful based on the providing requirements
White box testing

This is a process of evaluating the program’s internally working in place of functionality that are clear
to the cutomer. white box testing call for designing test cases utilize programming command and
within the perspective of the system.
CHAPTER 8
CONCLUSION

We have used four algorithm’s such as SVM, Classification and Regression Tree, Navi Bayes, K-
NN—on the breast cancer dataset. We calculated, evaluated and compared on various forms of results
based upon the confusion matrix, accuracy, sensitivity, precision, and AUC to estimate which machine
learning algorithm is the most accurate, dependable, and precise.

For programming each algorithm python is used, with the package called sklearn in the Anaconda
environment. we found that svm overcomes all the other techniques and achieves a efficiency of
97.2%,precision of 97.5%, and AUC of 96.6% after a precise model comparison between them.

In conclusion, SVM provides best accuracy and precision and have proved to be effective in the
detection and prediction of breast cancer. As a result we can apply for future work with the same
algorithms and techniques on the another databases to confirm the results obtained

Additionally, in our future work, we can plan to use same or other machine learning algorithms using
new parameters on a huge datasets with large disease classes to obtain good accuracy, it should be
noticed that all the results obtained are related only to the dataset of the breast cancer, it may be
considered as drawback of our work.
CHAPTER 9
FUTURE ENHANCEMENT

Personalized risk model development appears to be much enhanced by the integration of AI


techniques, particularly CNNs and DL, with digital mammography-based breast cancer risk
assessment. These AI-driven methods have the potential to greatly improve breast cancer risk
prediction's efficacy and accuracy by using medical imaging and specific patient data.

The literature now in publication shows promising outcomes; still, more investigation and verification
are required to determine the practical applicability and dependability of these models. The integration
of DL approaches in breast cancer risk modelling might transform screening methods and enable
customized risk management for women globally as long as studies and implementations continue to
progress.

Considering the significant effects of breast cancer on women's health, utilizing AI in risk assessment
is essential for early identification and better patient care.
APPENDIX B
USER MANUAL
User Manual for Breast Cancer Prediction Web App

Introduction
The Breast Cancer Prediction Web App user manual offers comprehensive guidance on how to use the
app. The purpose of the app is to assist users in estimating their risk of developing breast cancer by
using a variation of medical data.

System Requirements
Web browser
Internet connection is required

open your web browser


enter the URL provided for breast cancer prediction
press enter button to load
Upon loading the app you will be directed to prediction page

An overview of the app

Navigate to the Prediction Page:


Enter the data
Input the required details
It includes:
• Clump thickness
• Uniformity of cell size
• Uniformity of cell shape
• Marginal adhesion
• Single epithelial cell size
• Bare nuclei
• Bland chromatin
• Normal nuclei
• Mitoses
Submit the data Prediction
Once fill the details click on submit button to Predict

View Results:
The possibility of breast cancer will be given along with the prediction findings.
It's also possible to offer suggestions or more details.

Understanding the Results


The outcomes will show the chance of developing breast cancer.
The results may be explained and potential courses of action suggested in an interpretation section.

You might also like