What Is Data Preparation? An In-Depth Guide - TechTarget
What Is Data Preparation? An In-Depth Guide - TechTarget
Business
9 Analytics
Tech Accelerator
DEFINITION
IT, BI and data management teams do data preparation work as they integrate data sets
to load into data warehouses, data
TechTarget and ourlakes or other
partners repositories. They then refine the
use cookies
prepared data sets to
asimprove
needed our sites,new
when analyze traffic and
analytics applications are developed. In
performance,
addition, data scientists, and serve personalized
data engineers, data analysts and business users
content and advertising. You can update
increasingly use self-service data preparation tools to collect and prepare data
your cookie preferences at any time.
themselves.
See our Privacy Policy for more
information.
OK SETTINGS
https://www.techtarget.com/searchbusinessanalytics/definition/data-preparation 7/20/24, 08 57
Page 1 of 14
:
ADVERTISEMENT
Data preparation is often referred to informally as data prep. Alternatively, it's also
known as data wrangling. But some practitioners use the latter term in a narrower sense
to refer to cleansing, structuring and transforming data, which distinguishes data
wrangling from the data preprocessing stage.
This comprehensive guide to data preparation further explains what it is, how to do it
and the benefits it provides in organizations. You'll also find information on data
preparation tools, best practices and common challenges faced in preparing data.
Throughout the guide, hyperlinks point to related articles that provide more information
on the covered topics.
Data preparation also involves finding relevant data to ensure that analytics applications
deliver meaningful information and actionable insights for business decision-making.
The data often is enriched and optimized to make it more informative and useful -- for
https://www.techtarget.com/searchbusinessanalytics/definition/data-preparation 7/20/24, 08 57
Page 2 of 14
:
example, by blending internal and external data sets, creating new data fields,
eliminating outlier values and addressing imbalanced data sets that could skew
analytics results.
In addition, BI and data management teams use the data preparation process to curate
data sets for business users to analyze. Doing so helps streamline and guide self-
service BI applications for business analysts, executives and workers.
Done properly, data preparation also helps an organization do the following to gain
business benefits:
https://www.techtarget.com/searchbusinessanalytics/definition/data-preparation 7/20/24, 08 57
Page 3 of 14
:
Identify and fix data issues that otherwise might not be detected.
Enable more informed decision-making by business executives and operational
workers.
Reduce data management and analytics costs.
Avoid duplication of effort in preparing data for use in multiple applications.
Get a higher ROI from BI and data science initiatives.
Effective data preparation is particularly beneficial in big data environments that store a
combination of structured, semistructured and unstructured data to support machine
learning (ML), predictive analytics and other forms of advanced analytics. Those
applications typically involve large amounts of data, which is often stored in raw form in
a data lake until it's needed for specific analytics uses. As a result, preparing data for
machine learning can be more time-consuming than creating the ML algorithms to run
against the data -- a situation that a well-managed data prep process helps rectify.
1. Data collection
Relevant data is gathered from operational systems, data warehouses, data lakes and
other data sources. During the data collection step, data scientists, data engineers, BI
team members, other data professionals and end users should confirm that the data
they're gathering is a good fit for the objectives of planned analytics applications.
https://www.techtarget.com/searchbusinessanalytics/definition/data-preparation 7/20/24, 08 57
Page 4 of 14
:
3. Data cleansing
Next, the identified data errors and issues are corrected to create complete and
accurate data sets. For example, as part of data cleansing work, faulty data is removed
or fixed, missing values are filled in and inconsistent entries are harmonized.
4. Data structuring
At this point, the data needs to be modeled and organized to meet analytics
requirements. For example, data stored in comma-separated values files or other file
formats must be converted into tables to make it accessible to BI and analytics tools.
Data preparation can also incorporate or feed into data curation work that creates
ready-to-use data sets for BI and analytics applications. Data curation involves tasks
such as indexing, cataloging and maintaining data sets and their associated metadata
to help users find and access the data. In some organizations, data curator is a formal
role that works collaboratively with data scientists, business analysts, other users and
the IT and data management teams. In others, data might be curated by data stewards,
data engineers, database administrators or data scientists and business users
themselves.
https://www.techtarget.com/searchbusinessanalytics/definition/data-preparation 7/20/24, 08 57
Page 5 of 14
:
k The data preparation process includes these primary steps.
As noted above, doing so is often a lengthy process: In the past, a common maxim was
that data scientists spent about 80% of their time collecting and preparing data and only
20% analyzing it. That might not be the case now, partly due to the increased availability
of data preparation tools. But in the 2023 edition of an annual survey conducted by data
science platform vendor Anaconda, 1,071 data science practitioners ranked data
preparation and data cleansing as the two most time-consuming tasks in analytics
applications.
The following are seven common data preparation challenges faced by data scientists
and others involved in the process:
https://www.techtarget.com/searchbusinessanalytics/definition/data-preparation 7/20/24, 08 57
Page 6 of 14
:
invalid entries that frequently occur in data and must be fixed to ensure analytics
accuracy.
Name and address standardization. Names and addresses might be inconsistent
in different systems, with variations that can affect views of customers and other
entities if the data isn't standardized.
Inconsistent data across enterprise systems. Other inconsistencies in data sets
drawn from multiple source systems, such as different terminology and unique
identifiers, are also pervasive issues to contend with in data preparation efforts.
Data enrichment issues. Deciding how to enrich a data set -- for example, what to
add to it -- is a complex task that requires a strong understanding of business needs
and analytics goals.
Maintaining and expanding data prep processes. Data preparation work often
becomes a recurring process that needs to be sustained and enhanced on an
ongoing basis.
k These issues complicate the process of preparing data for BI and analytics applications.
https://www.techtarget.com/searchbusinessanalytics/definition/data-preparation 7/20/24, 08 57
Page 7 of 14
:
The tools run data sets through a workflow that follows the steps of the data preparation
process. They also feature GUIs designed to further simplify the required tasks and
functions. In addition to speeding up data preparation and leaving more time for related
analytics work, the self-service software might help organizations increase the number
of BI and data science applications they're able to run, thus opening up new analytics
scenarios.
In 2023, consulting firm Gartner removed data preparation tools from its annual Hype
Cycle report on emerging data management technologies, saying they had reached full
maturity and mainstream adoption. Initially sold as separate products by several
vendors that focused on data preparation, the tools have now largely been incorporated
into broader data management software suites and BI or data science platforms. For
example, Gartner lists data preparation as a core capability of BI platforms.
The following list includes some prominent BI, analytics and data management vendors
that offer data preparation tools or capabilities -- it's based on market reports from
Gartner and Forrester Research plus additional research by TechTarget editors:
Altair.
Alteryx.
AWS.
Boomi.
Datameer.
IBM.
Informatica.
Microsoft.
Oracle.
Precisely.
Qlik.
SAP.
SAS.
Tableau.
https://www.techtarget.com/searchbusinessanalytics/definition/data-preparation 7/20/24, 08 57
Page 8 of 14
:
Tamr.
Tibco Software.
One caveat from data management consultants and practitioners: Don't look at self-
service data preparation software as a replacement for traditional data
integration technologies, particularly extract, transform and load (ETL) tools. While data
prep tools enable users to integrate relevant data sets for analytics applications, ETL
ones provide heavy-duty capabilities for integrating large amounts of data, transforming
it and moving it into a data store. The two technologies often complement one another:
A data management team might use ETL software to combine and initially prepare data
sets, then data scientists or business analysts can use a self-service tool to do more
specific data preparation work.
GARTNER
https://www.techtarget.com/searchbusinessanalytics/definition/data-preparation 7/20/24, 08 57
Page 9 of 14
:
automatically profile data, fix errors and recommend other data cleansing,
transformation and enrichment measures. In its 2023 Hype Cycle report, Gartner said
organizations should make such augmented features a must-have item when buying
new data management tools.
Automated data prep features are also included in the augmented analytics
technologies now offered by many BI vendors. The automation is particularly helpful for
self-service BI users and citizen data scientists -- business analysts and other workers
who don't have formal data science training but do some advanced analytics work. But it
also speeds up data preparation by skilled data scientists and data engineers.
1. Think of data preparation as part of data analysis. Data preparation and analysis
are "two sides of the same coin," according to Farmer. That means data can't be
properly prepared without knowing what analytics use it needs to fit.
2. Define what data preparation success means. Desired data accuracy levels and
other data quality metrics should be set as goals and then balanced against
projected costs to create a data prep plan that's appropriate to each use case.
3. Prioritize data sources based on the application. Resolving differences in data
from multiple source systems is an important element of data preparation that also
should be based on the planned analytics use case.
https://www.techtarget.com/searchbusinessanalytics/definition/data-preparation 7/20/24, 08 57
Page 10 of 14
:
4. Use the right tools for the job and your skill level. Self-service data preparation
tools aren't the only option available -- other tools and technologies can also be
used, depending on an organization's internal skills and data needs.
5. Be prepared for failures when preparing data. Error-handling capabilities need to
be built into the data preparation process to prevent it from going awry or getting
bogged down when problems occur.
6. Keep an eye on data preparation costs. The cost of software licenses, data
processing and storage resources, and the people involved in preparing data should
be watched closely to ensure that expenses don't get out of hand.
Ed Burns, a former executive editor at TechTarget, and freelance journalist Mary K. Pratt
contributed to this article.
ADVERTISEMENT
https://www.techtarget.com/searchbusinessanalytics/definition/data-preparation 7/20/24, 08 57
Page 11 of 14
:
By: Donald Farmer
-ADS BY GOOGLE
https://www.techtarget.com/searchbusinessanalytics/definition/data-preparation 7/20/24, 08 57
Page 12 of 14
:
DATA MANAGEMENT AWS CONTENT MANAGEMENT ORACLE SAP
Data Management
About Us Editorial Ethics Policy Meet The Editors Contact Us Advertisers Partner with Us Media Kit
https://www.techtarget.com/searchbusinessanalytics/definition/data-preparation 7/20/24, 08 57
Page 13 of 14
:
Corporate Site
Privacy Policy
Cookie Preferences
Do Not Sell or Share My Personal Information
https://www.techtarget.com/searchbusinessanalytics/definition/data-preparation 7/20/24, 08 57
Page 14 of 14
: