0% found this document useful (0 votes)
4 views

DSA2

Data cleaning is the process of correcting or removing inaccurate, corrupted, or incomplete data to ensure reliable outcomes. It involves steps like importing data, creating backups, and using Excel functions to clean and standardize data formats. Data preprocessing enhances data quality by eliminating errors, handling missing values, and removing duplicates, but it may not be suitable for large datasets and often requires manual execution.

Uploaded by

davenguting20
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

DSA2

Data cleaning is the process of correcting or removing inaccurate, corrupted, or incomplete data to ensure reliable outcomes. It involves steps like importing data, creating backups, and using Excel functions to clean and standardize data formats. Data preprocessing enhances data quality by eliminating errors, handling missing values, and removing duplicates, but it may not be suitable for large datasets and often requires manual execution.

Uploaded by

davenguting20
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

What is data cleaning?

Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted,
duplicate, or incomplete data within a dataset. When combining multiple data sources, there are
many opportunities for data to be duplicated or mislabeled. If data is incorrect, outcomes and
algorithms are unreliable, even though they may look correct. There is no one absolute way to
prescribe the exact steps in the data cleaning process because the processes will vary from
dataset to dataset. But it is crucial to establish a template for your data cleaning process so you
know you are doing it the right way every time.

Advantages and benefits of data cleaning

Having clean data will ultimately increase overall productivity and allow for the highest quality
information in your decision-making. Benefits include:

 Removal of errors when multiple sources of data are at play.


 Fewer errors make for happier clients and less-frustrated employees.
 Ability to map the different functions and what your data is intended to do.
 Monitoring errors and better reporting to see where errors are coming from, making it
easier to fix incorrect or corrupt data for future applications.
 Using tools for data cleaning will make for more efficient business practices and quicker
decision-making.

The basic steps for cleaning data are as follows:

1. Import the data from an external data source.


2. Create a backup copy of the original data in a separate workbook.
3. Ensure that the data is in a tabular format of rows and columns with: similar data in
each column, all columns and rows visible, and no blank rows within the range. For
best results, use an Excel table.
4. Do tasks that don't require column manipulation first, such as spell-checking or
using the Find and Replace dialog box.
5. Next, do tasks that do require column manipulation. The general steps for
manipulating a column are:
a. Insert a new column (B) next to the original column (A) that needs cleaning.
b. Add a formula that will transform the data at the top of the new column (B).
c. Fill down the formula in the new column (B). In an Excel table, a calculated
column is automatically created with values filled down.
d. Select the new column (B), copy it, and then paste as values into the new
column (B).
e. Remove the original column (A), which converts the new column from B to
A.
To periodically clean the same data source, consider recording a macro or writing code to
automate the entire process. There are also a number of external add-ins written by third-party
vendors, listed in the Third-party providers section, that you can consider using if you don't have
the time or resources to automate the process on your own.

Excel Data Cleaning is a significant skill that all Business and Data Analysts must possess. In the
current era of data analytics, everyone expects the accuracy and quality of data to be of the
highest standards. A major part of Excel Data Cleaning involves the elimination of blank spaces,
incorrect, and outdated information.

Data Preprocessing
Data preprocessing is a kind of process in data analysis. It is used to clean and transform raw
data into useful information that can be used by computers. Before analyzing the data, we need
to make sure that the data should be clean and useful. Data preprocessing helps to improve the
quality of data, consistency of the data, and compatibility.

Data Preprocessing helps in many ways:


It helps in eliminating errors.
It helps in handling the missing values.
It helps in removing duplicates.
It helps in standardizing formats.

Steps in Data Preprocessing


1. Collection of the Data
In this step, we need to collect the raw data. We can collect this data from various sources such
as spreadsheets, online repositories, etc.
2. Cleaning of the Data
In this step, we need to clean the data before using it. We have to identify and address data
quality issues. Excel provides functions like Find and Replace, Text to Columns,
and conditional formatting to clean the data.
3. Handling Missing Values
In this step, we need to handle the missing values. If a value is missing, it can create a major
problem in transforming the data. We can identify and handle missing values using some
functions:
 IF
 ISNA or ISBLANK

We can choose all those rows which are having missing values. We can also replace them with
appropriate substitutes.
4. Removing Duplicates
In this step, we need to remove the duplicates from the data. Duplicates can lead us to skewed
analysis results. Excel offers a simple way to remove duplicates. First, we need to select the data
range and go to the Data tab. Then click on the Remove Duplicates button. Then we can choose
the columns to check for duplicates. Excel will remove duplicate rows, keeping only unique
values.
5. Standardizing Formats
In this step, we need to standardize the formats. Inconsistent data formats can create some
challenges for us during analysis. That’s why Excel allows you to standardize formats. We can
use the features of Excel like cell formatting, text functions (e.g., PROPER, UPPER,
LOWER), and data validation rules.

6. Filtering and Sorting


In this step, we need to filter and sort the data. Excel's filtering and sorting capabilities help
explore and organize large datasets. The Filter function allows you to display specific subsets of
data based on criteria. Sorting data in ascending or descending order can be done using
the Sort function.

Advantages of Data Preprocessing


There are several advantages of data preprocessing in Excel:
Excel provides a user-friendly interface so that we can easily do data preprocessing and other
data analysis tasks.
Excel offers a wide range of functions and features that helps in different data preprocessing
needs.
Excel is widely available, that’s why it is commonly used for data preprocessing.
Excel integrates well with other Microsoft Office applications, facilitating seamless data transfer
and collaboration.

Disadvantages of Data Preprocessing


Along with the advantages, there are some disadvantages of data preprocessing in Excel:
Excel may not be suitable for handling large datasets.
Excel’s analytical capabilities are robust but may not match those offered by specialized
statistical or data analysis software.
Data preprocessing tasks in Excel often require manual execution.

You might also like