0% found this document useful (0 votes)
46 views

Basic Data Profiling

Data profiling involves analyzing data at the column, row, table, and cross-table levels to identify issues such as invalid values, inconsistent representations, missing data, and violations of structural or logical rules. It is a multi-stage process that examines individual values, value distributions within columns, data structures, and relationships within and between tables to validate data quality and uncover potential problems. Examples of issues found include empty columns, values outside expected ranges, inconsistent formatting, and violations of dependencies or business rules.

Uploaded by

tab12345
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

Basic Data Profiling

Data profiling involves analyzing data at the column, row, table, and cross-table levels to identify issues such as invalid values, inconsistent representations, missing data, and violations of structural or logical rules. It is a multi-stage process that examines individual values, value distributions within columns, data structures, and relationships within and between tables to validate data quality and uncover potential problems. Examples of issues found include empty columns, values outside expected ranges, inconsistent formatting, and violations of dependencies or business rules.

Uploaded by

tab12345
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Column Examination

Identify all values in column along with frequency of occurrence


Identify min and max values
Determine true data type
Determine degree of uniqueness
Determine encoding patterns used, frequency of each pattern
Compute values: AVG, SUM, MEDIAN, STD DEVIATION

Row Examination
Find all primary key candidates (single or multi-column)
Find intra-row column dependencies (find de-normalization instances)
Find multi-column value relationships
Value ordering rules
NULL value dependencies

Multi-table Examination
Find matching columns across tables
Match by column name, data type
Match by values
Find primary/foreign key pairs (single and multi-column)
Determine 1-1, 1-M, 1-0, M-1, M-M, 0-1 rules
Find primary values not found in secondary tables

Invalid Values
Missing values when should not be missing
Values out of range or not in domain of expected values
Value in one column not possible when combined with values in one or more
other columns
Example: obviously wrong values
Name = Donald Duck
Address = 1600 Pennsylvania Avenue

THE BASICS OF DATA PROFILING
Data profiling consists of multiple analyses to investigate the structure and
content of data and make inferences about data.
Examples of problems easily uncovered through data profiling analysis:
Data elements used for purposes other than thought to be
Empty columns; columns containing no data at all
Invalid values in columns
Inconsistent methods of representing the same value
Missing values
Violation of structural dependencies
Violation of expected column relationships missing date values
Violation of business rules
Unrealistic percentages of specific values appearing in a column
Data profiling is an organized methodology for analyzing the data in stages that provides for a
thorough result. The stages that an analyst typically exercises are:
Analyze individual values to determine if they are valid values for a column
Analyze all the values in a column together to find problems with unique rules,
consecutive rules and unexpected frequencies of specific values
Analyze structure rules governing functional dependencies, primary keys, foreign keys,
synonyms and duplicate columns
Validate data rules that must hold true with a row of data
Validate data rules that must hold true over all rows for a single business object
Validate data rules that must hold true over collections of a business object
Validate data rules that must hold true between collections of different types of
business objects
Data rules are a subset of business rules that define relationships between sets of columns or
rows that must always be true within the data. A violation may mean that data inaccuracies
exist in the data or that the business rules they are based on are not being followed in the real
world. In one case the data was entered inaccurately. In the other case the data was entered
correctly but the transaction was handled with data outside of the corporation's business
policies. Both of these situations are important to expose.
Examples of data rules are:
Employees must be at least 18 years old.
Part-time employees are paid hourly.
Checkout periods for tools cannot overlap for the same tool.
Customers with more than $50,000 in sales last quarter get a 5 percent discount
Suppliers cannot supply radioactive part numbers unless certified.

You might also like