Data Quality
1 .What is data quality
Data quality measures the condition of your data,
using factors such as accuracy, consistency , integrity,
usability.
2. How do measure data quality?
• Consistency: When one piece of data is stored in
multiple locations, do they have the same values?
• Accuracy: Does the data accurately describes the
properties of the object it is meant to model?
• Relevance: Is the data appropriate to support the
objective?
• Existence: Does the organization have the right data?
• Integrity: How accurate are the relationships between
data elements and data sets?
• Validity: Are the values acceptable?
Noisy data
• Noisy data is meaningless data.
• It includes any data that cannot be understood and
interpreted correctly by machines, such as
unstructured text.
• Noisy data unnecessarily increases the amount of
storage space and can affect the results of any data
mining analysis.
• Noisy data is data that contains errors, outliers, or
missing values that can make it difficult to find patterns
or trends in the data. Noisy data can be caused by
human error, measurement error, or data processing
errors.
Outliers
What is Outliers?
• Outliers are a very important aspect of Data
Analysis.
• An outlier is one that appears to deviate
markedly from other members of the sample
in which it occurs.
• It is extensively used in many application domains
such as
-Fraud detection for credit cards, Insurance, and
Healthcare
-Telecom fraud detection
-Intrusion detection in cyber-security,
-Medical analysis
-Fault detection in safety-critical systems
• Outliers can be classified into three categories:
• 1. Global Outlier (or point outliers): A global outlier
is a data point that has a value that is significantly
higher or lower than the rest of the data in a set.
• For example, Intrusion detection in computer
networks.
• 2.Contextual outliers:- Contextual outliers are data
points that are significantly different from other
data points within a specific context. They are also
known as conditional outliers.
• Attributes of data objects should be divided into
two groups
⦁ Contextual attributes: defines the context, e.g.,
time & location
⦁ Behavioral attributes: characteristics of the object,
used in outlier evaluation, e.g., temperature
• 3. Collective outliers: If a collection of data points is
anomalous with respect to the entire data set, it is
termed as a collective outlier
Missing values
• Missing data are defined as not available
values, and that would be meaningful if
observed.
• Missing data can be anything from missing
sequence, incomplete feature, files missing,
information incomplete, data entry error etc.
Why Do They Happen?
• Data might not be collected or recorded for certain variables.
• People might skip questions in surveys or forms.
Common Ways to Handle Missing Values:
• Remove Missing Data:
– Delete rows with missing values.
• Fill Missing Data (Imputation):
– Replace missing values with the mean, median, or mode (for
numbers).
– For categories, replace with the most common value.
• Use a Placeholder:
– Fill missing values with something like 0, "Unknown", or "NA".
• Leave It:
– Some algorithms can handle missing values directly without filling
them in.
• Example:
Name Age Salary
Alice 25 50000
Bob 60000
Carol 30
Dave 28 45000
Bob’s age is missing.
Carol’s salary is missing.
You can:
Remove Bob and Carol’s rows.
Fill Bob's age with the average age (e.g., 27.5).
Fill Carol’s salary with the average salary (e.g., 55000).
Duplicate Data
Duplicate data is when the same information appears
more than once in a dataset.
Why is it a Problem?
• It can lead to incorrect analysis and wrong
conclusions.
• It slows down processing and analysis.
Causes of Duplicate Data:
• Mistakes during data entry.
• Merging data from different sources without
checking.
• System errors or bugs.
How to Handle It:
• Identify duplicates using tools like Excel, Python, or
SQL.
• Remove duplicates by filtering or deleting repeated
entries.
• Set up rules to prevent duplicate entries during
data input.