Ref1.Richardson2023.DataAnalytics for Accounting - Chapt 1-2
Ref1.Richardson2023.DataAnalytics for Accounting - Chapt 1-2
Chapter 1
Data Analytics for Accounting and
Identifying the Questions
A Look Ahead
Chapter 2 provides a description of how data are prepared and scrubbed to
be ready for analysis to address accounting questions. We explain how to
extract, transform, and load data and then how to validate and normalize the
data. In addition, we explain how data standards are used to facilitate the
exchange of data between data sender and receiver. We finalize the chapter
by emphasizing the need for ethical data collection and data use to maintain
data privacy.
page 3
Cobalt S-Elinoi/Shutterstock
OBJECTIVES
After reading this chapter, you should be able to:
page 4
DATA ANALYTICS
LO 1-1
Define Data Analytics.
Data surround us! By the year 2024, it is expected that the volume of data
created, captured, copied, and consumed worldwide will be 149 zettabytes
(compared to 2 zettabytes in 2010 and 59 zettabytes in 2020).1 In fact, more
data have been created in the last 2 years than in the entire previous history
of the human race.2 With so much data available about each of us (e.g., how
we shop, what we read, what we’ve bought, what music we listen to, where
we travel, whom we trust, what devices we use, etc.), arguably, there is the
potential for analyzing those data in a way that can answer fundamental
business questions and create value.
We define Data Analytics as the process of evaluating data with the
purpose of drawing conclusions to address business questions. Indeed,
effective Data Analytics provides a way to search through large structured
data (data that adheres to a predefined data model in a tabular format) and
unstructured data (data that does not adhere to a predefined data format)
to discover unknown patterns or relationships.3 In other words, Data
Analytics often involves the technologies, systems, practices,
methodologies, databases, statistics, and applications used to analyze
diverse business data to give organizations the information they need to
make sound and timely business decisions.4 That is, the process of Data
Analytics aims to transform raw data into knowledge to create value.
Big Data refers to datasets that are too large and complex for
businesses’ existing systems to handle utilizing their traditional capabilities
to capture, store, manage, and analyze these datasets. Another way to
describe Big Data (or frankly any available data source) is by use of four
Vs: its volume (the sheer size of the dataset), velocity (the speed of data
processing), variety (the number of types of data), and veracity (the
underlying quality of the data). While sometimes Data Analytics and Big
Data are terms used interchangeably, we will use the term Data Analytics
throughout and focus on the possibility of turning data into knowledge and
that knowledge into insights that create value.
PROGRESS CHECK
1. How does having more data around us translate into value for a company?
extend a loan. How would you suggest a bank use Data Analytics to get a more
complete view of its customers’ creditworthiness? Assume the bank has access
to a customer’s loan history, credit card transactions, deposit history, and direct
There is little question that the impact of data and Data Analytics on
business is overwhelming. In fact, in PwC’s 18th Annual Global CEO
Survey, 86 percent of chief executive officers (CEOs) say they find it
important to champion digital technologies and emphasize a clear vision of
using technology for a competitive advantage, while 85 percent say they put
a high value on Data Analytics. In fact, per PwC’s 6th Annual page 5
Digital IQ survey of more than 1,400 leaders from digital
businesses, the area of investment that tops CEOs’ list of priorities is
business analytics.5
A recent study from McKinsey Global Institute estimates that Data
Analytics and technology could generate up to $2 trillion in value per year
in just a subset of the total possible industries affected.6 Data Analytics
could very much transform the manner in which companies run their
businesses in the near future because the real value of data comes from Data
Analytics. With a wealth of data on their hands, companies use Data
Analytics to discover the various buying patterns of their customers,
investigate anomalies that were not anticipated, forecast future possibilities,
and so on. For example, with insight provided through Data Analytics,
companies could execute more directed marketing campaigns based on
patterns observed in their data, giving them a competitive advantage over
companies that do not use this information to improve their marketing
strategies. By pairing structured data with unstructured data, patterns could
be discovered that create new meaning, creating value and competitive
advantage. In addition to producing more value externally, studies show
that Data Analytics affects internal processes, improving productivity,
utilization, and growth.7
And increasingly, data analytic tools are available as self-service
analytics allowing users the capabilities to analyze data by aggregating,
filtering, analyzing, enriching, sorting, visualizing, and dashboarding for
data-driven decision making on demand.
PwC notes that while data has always been important, executives are
more frequently being asked to make data-driven decisions in high-stress
and high-change environments, making the reliance on Data Analytics even
greater these days!8
PROGRESS CHECK
3. Let’s assume a brand manager at Procter and Gamble identifies that an older
demographic might be concerned with the use of Tide Pods to do their laundry.
How might Procter and Gamble use Data Analytics to assess if this is a
problem?
4. How might Data Analytics assess the decision to either grant overtime to
page 6
Auditing
Data Analytics plays an increasingly critical role in the future of audit. In a
recent Forbes Insights/KPMG report, “Audit 2020: A Focus on Change,”
the vast majority of survey respondents believe both that:
page 7
We address auditing questions and Data Analytics in Chapters 5 and 6.
Lab Connection
Lab 1-3 has you explore questions auditors would answer with
Data Analytics.
Management Accounting
Of all the fields of accounting, it would seem that the aims of Data
Analytics are most akin to management accounting. Management
accountants (1) are asked questions by management, (2) find data to address
those questions, (3) analyze the data, and (4) report the results to
management to aid in their decision making. The description of the
management accountant’s task and that of the data analyst appear to be
quite similar, if not identical in many respects.
Whether it be understanding costs via job order costing, understanding
the activity-based costing drivers, forecasting future sales on which to base
budgets, or determining whether to sell or process further or make or
outsource its production processes, analyzing data is critical to management
accountants.
As information providers for the firm, it is imperative for management
accountants to understand the capabilities of data and Data Analytics to
address management questions.
We address management accounting questions and Data Analytics in
Chapter 7.
Lab Connection
Lab 1-2 and Lab 1-4 have you explore questions managers
would answer with Data Analytics.
Financial Reporting and Financial Statement
Analysis
Data Analytics also potentially has an impact on financial reporting. With
the use of so many estimates and valuations in financial accounting, some
believe that employing Data Analytics may substantially improve the
quality of the estimates and valuations. Data from within an enterprise
system and external to the company and system might be used to address
many of the questions that face financial reporting. Many financial
statement accounts are just estimates, and so accountants often ask
themselves questions like this to evaluate those estimates:
1. How much of the accounts receivable balance will ultimately be
collected? What should the allowance for loan losses look like?
2. Is any of our inventory obsolete? Should our inventory be valued at
market or cost (applying the lower-of-cost-or-market rule)? When will it
be out of date? Do we need to offer a discount on it now to get it sold?
3. Has our goodwill been impaired due to the reduction in profitability
from a recent merger? Will it regain value in the near future?
4. How should we value contingent liabilities like warranty claims or
litigation? Do we have the right amount?
Lab Connection
Tax
Traditionally, tax work dealt with compliance issues based on data from
transactions that have already taken place. Now, however, tax executives
must develop sophisticated tax planning capabilities that assist the company
with minimizing its taxes in such a way to avoid or prepare for a potential
audit. This shift in focus makes tax data analytics valuable for its ability to
help tax staffs predict what will happen rather than react to what just did
happen. Arguably, one of the things that Data Analytics does best is
predictive analytics—predicting the future! An example of how tax data
analytics might be used is the capability to predict the potential tax
consequences of a potential international transaction, R&D investment, or
proposed merger or acquisition in one of their most value-adding tasks, that
of tax planning!
One of the issues of performing predictive Data Analytics is the efficient
organization and use of data stored across multiple systems on varying
platforms that were not originally designed for use in the tax department.
Organizing tax data into a data warehouse to be able to consistently model
and query the data is an important step toward developing the capability to
perform tax data analytics. This issue is exemplified by the 29 percent of
tax departments that find the biggest challenge in executing an analytics
strategy is integrating the strategy with the IT department and gaining
access to available technology tools.12
We address tax questions and Data Analytics in Chapter 9.
PROGRESS CHECK
5. Why are management accounting and Data Analytics considered similar in
many respects?
6. How specifically will Data Analytics change the way a tax staff does its taxes?
page 9
EXHIBIT 1-1
The IMPACT Cycle
Source: Isson, J. P., and J. S. Harriott. Win with Advanced Business Analytics: Creating Business
Value from Your Data. Hoboken, NJ: Wiley, 2013.
We explain the full IMPACT cycle briefly here, but in more detail later
in Chapters 2, 3, and 4. We use its approach for thinking about the steps
included in Data Analytics throughout this textbook, all the way from
carefully identifying the question to accessing and analyzing the data to
communicating insights and tracking outcomes.13
Step 1: Identify the Questions (Chapter 1)
It all begins with understanding a business problem that needs addressing.
Questions can arise from many sources, including how to better attract
customers, how to price a product, how to reduce costs, or how to find
errors or fraud. Having a concrete, specific question that is potentially
answerable by Data Analytics is an important first step.
Indeed, accountants often possess a unique skillset to improve an
organization’s Data Analytics by their ability to ask the right questions,
especially since they often understand a company’s financial data. In other
words, “Your Data Won’t Speak Unless You Ask It the Right Data Analysis
Questions.”14 We could ask any question in the world, but if we don’t
ultimately have the right data to address the question, there really isn’t
much use for Data Analytics for those questions.
Audience: Who is the audience that will use the results of the analysis
(internal auditor, CFO, financial analyst, tax professional, etc.)?
Scope: Is the question too narrow or too broad?
Use: How will the results be used? Is it to identify risks? Is it to make
data-driven business decisions?
page 10
Here is an example of potential questions accountants might address
using Data Analytics:
Amazon Inc.
page 12
EXHIBIT 1-3
Example of Link Prediction on Facebook
Back to Step 1
Since the IMPACT cycle is iterative, once insights are gained and outcomes
are tracked, new more refined questions emerge that may use the same or
different data sources with potentially different analyses and thus, the
IMPACT cycle begins anew.
PROGRESS CHECK
7. Let’s say we are trying to predict how much money college students spend on
fast food each week. What would be the response, or dependent, variable?
8. How might a data reduction approach be used in auditing to allow the auditor to
spend more time and effort on the most important (e.g., most risky, largest
page 14
We address these seven skills throughout the first four chapters in the
text in hopes that the analytic-minded accountant will develop and practice
these skills to be ready to address business questions. We then demonstrate
these skills in the labs and hands-on analysis throughout the rest of the
book.
EXHIBIT 1-4
Gartner Magic Quadrant for Business Intelligence and Analytics
Platforms
EXHIBIT 1-6
Tableau Data Analytics Tools
PROGRESS CHECK
9. Given the “magic quadrant” in Exhibit 1-4, why are the software tools
10. Why is having the Tableau software tools fully available on both Windows and
page 18
EXHIBIT 1-7
LendingClub Statistics
EXHIBIT 1-8
LendingClub Statistics by Reported Loan Purpose
42.33% of LendingClub borrowers report using their loans to refinance
existing loans as of September 30, 2020.
Source: Accessed December 2020, https://www.lendingclub.com/info/statistics.action
LendingClub provides datasets on the loans it approved and funded as
well as data for the loans that were declined. To address the question posed,
“What are some characteristics of rejected loans?,” we’ll use the dataset of
rejected loans.
The rejected loan datasets and related data dictionary are available from
your instructor or from Connect (in Additional Student Resources).
page 19
As we learn about the data, it is important to know what is available to
us. To that end, there is a data dictionary that provides descriptions for all
of the data attributes of the dataset. A cut-out of the data dictionary for the
rejected stats file (i.e., the statistics about those loans rejected) is shown in
Exhibit 1-9.
EXHIBIT 1-9
2007–2012 LendingClub Data Dictionary for Declined Loan Data
Source: LendingClub
RejectStats
File Description
Amount Requested Total requested loan amount
Application Date Date of borrower application
RejectStats
File Description
Zip Code The first 3 numbers of the borrower zip code provided from loan
application.
State Two digit State Abbreviation provided from loan application.
Employment Length Employment length in years, where 0 is less than 1 and 10 is greater
than 10.
We could also take a look at the data files available for the funded loan
data. However, for our analysis in the rest of this chapter, we use the Excel
file “DAA Chapter 1-1 Data” that has rejected loan statistics from
LendingClub for the time period of 2007 to 2012. It is a cleaned-up,
transformed file ready for analysis. We’ll learn more about data scrubbing
and preparation of the data in Chapter 2.
Exhibit 1-10 provides a cut-out of the 2007–2012 “Declined Loan”
dataset provided.
EXHIBIT 1-10
2007–2012 Declined Loan Applications (DAA Chapter 1-1 Data)
Dataset
EXHIBIT 1-11
LendingClub Declined Loan Applications by DTI (Debt-to-Income)
DTI bucket includes high (debt > 20 percent of income), medium (“mid”)
(debt between 10 and 20 percent of income), and low (debt < 10 percent of
income).
Microsoft Excel, 2016
The second analysis is on the length of employment and its relationship
with rejected loans (see Exhibit 1-12). Arguably, the longer the
employment, the more stable of a job and income stream you will have to
ultimately repay the loan. LendingClub reports the number of years of
employment for each of the rejected applications. The PivotTable analysis
lists the number of loans by the length of employment. Almost 77 percent
(495,109 out of 645,414) out of the total rejected loans had worked at a job
for less than 1 year, suggesting potentially an important reason for rejecting
the requested loan. Perhaps some had worked a week, or just a month, and
still want a big loan?
EXHIBIT 1-12
LendingClub Declined Loan Applications by Employment Length
(Years of Experience)
EXHIBIT 1-13
Breakdown of Customer Credit Scores (or Risk Scores)
Source: Cafecredit.com
We will classify the sample according to this breakdown into excellent,
very good, good, fair, poor, and very bad credit according to their credit
score noted in Exhibit 1-13.
As part of the analysis of credit score and rejected loans, we again perform
PivotTable analysis (as seen in Exhibit 1-14) by counting the number of
rejected loan applications by credit (risk) score. We’ll note in the rejected
loans that nearly 82 percent [(167,379 + 151,716 + 207,234)/645,414] of
the applicants have either very bad, poor, or fair credit ratings, suggesting
this might be a good reason for a loan rejection. We also note that only 0.3
percent (2,494/645,414) of those rejected loan applications had excellent
credit.
page 21
EXHIBIT 1-14
The Count of LendingClub Rejected Loan Applications by Credit or
Risk Score Classification Using PivotTable Analysis
(PivotTable shown here required manually sorting rows to get in proper
order.)
Microsoft Excel, 2016
page 22
page 23
EXHIBIT 1-15
The Count of LendingClub Declined Loan Applications by Credit (or
Risk Score), Debt-to-Income (DTI Bucket), and Employment Length
Using PivotTable Analysis (Highlighting Added)
EXHIBIT 1-16
The Average Debt-to-Income Ratio (Shown as a Percentage) by Credit
(Risk) Score for LendingClub Declined Loan Applications Using
PivotTable Analysis
Communicate Insights
Certainly further and more sophisticated analysis could be performed, but at
this point we have a pretty good idea of what LendingClub uses to decide
whether to extend or reject a loan to a potential borrower. We can
communicate these insights either by showing the PivotTables or simply
stating what three of the determinants are. What is the most effective
communication? Just showing the PivotTables themselves, showing a graph
of the results, or simply sharing the names of these three determinants to the
decision makers? Knowing the decision makers and how they like to
receive this information will help the analyst determine how to
communicate insights.
Track Outcomes
There are a wide variety of outcomes that could be tracked. But in this case,
it might be best to see if we could predict future outcomes. For example, the
data we analyzed were from 2007 to 2012. We could make our predictions
for subsequent years based on what we had found in the past and then test
to see how accurate we are with those predictions. We could also change
our prediction model when we learn new insights and additional data
become available.
page 25
PROGRESS CHECK
11. Lenders often use the data item of whether a potential borrower rents or owns
their house. Beyond the three characteristics of rejected loans analyzed in this
12. Performing your own analysis, download the rejected loans dataset titled “DAA
Chapter 1-1 Data” and perform an Excel PivotTable analysis by state (including
the District of Columbia) and figure out the number of rejected applications for
the state of California. That is, count the loans by state and see what
percentage of the rejected loans came from California. How close is that to the
United States?
13. Performing your own analysis, download the rejected loans dataset titled “DAA
Chapter 1-1 Data” and run an Excel PivotTable by risk (or credit) score
classification and DTI bucket to determine the number of (or percentage of)
Summary
In this chapter, we discussed how businesses and accountants derive
value from Data Analytics. We gave some specific examples of how
Data Analytics is used in business, auditing, managerial accounting,
financial accounting, and tax accounting.
We introduced the IMPACT model and explained how it is used to
address accounting questions. And then we talked specifically about the
importance of identifying the question. We walked through the first few
steps of the IMPACT model and introduced eight data approaches that
might be used to address different accounting questions. We also
discussed the data analytic skills needed by analytic-minded accountants.
We followed this up using a hands-on example of the IMPACT
model, namely what are the characteristics of rejected loans at
LendingClub. We performed this analysis using various filtering and
PivotTable tasks.
Key Words
Big Data (4) Datasets that are too large and complex for businesses’ existing systems to handle
utilizing their traditional capabilities to capture, store, manage, and analyze these datasets.
classification (11) A data approach that attempts to assign each unit in a population into a few
categories potentially to help with predictions.
clustering (11) A data approach that attempts to divide individuals (like customers) into groups
(or clusters) in a useful or meaningful way.
co-occurrence grouping (11) A data approach that attempts to discover associations between
individuals based on transactions involving them.
Data Analytics (4) The process of evaluating data with the purpose of drawing conclusions to
address business questions. Indeed, effective Data Analytics provides a way to search through
large structured and unstructured data to identify unknown patterns or relationships.
data dictionary (19) Centralized repository of descriptions for all of the data attributes of the
dataset.
data reduction (12) A data approach that attempts to reduce the amount of information that
needs to be considered to focus on the most critical items (i.e., highest cost, highest risk, largest
impact, etc.).
link prediction (12) A data approach that attempts to predict a relationship between two data
items.
predictor (or independent or explanatory) variable (11) A variable that predicts or
explains another variable, typically called a predictor or independent variable.
profiling (11) A data approach that attempts to characterize the “typical” behavior of an
individual, group, or population by generating summary statistics about the data (including mean,
standard deviations, etc.).
regression (11) A data approach that attempts to estimate or predict, for each unit, the
numerical value of some variable using some type of statistical model.
response (or dependent) variable (10) A variable that responds to, or is dependent on,
another.
similarity matching (11) A data approach that attempts to identify similar individuals based on
data known about them.
structured data (4) Data that are organized and reside in a fixed field with a record or a file.
Such data are generally contained in a relational database or spreadsheet and are readily
searchable by search algorithms.
unstructured data (4) Data that do not adhere to a predefined data model in a tabular format.
ANSWERS TO PROGRESS CHECKS
1. The plethora of data alone does not necessarily translate into value. However, if we
carefully analyze the data to help address critical business problems and questions,
if they have access to all of their customer’s banking information, Data Analytics
would allow them to evaluate their customers’ creditworthiness. Banks would know
how much money they have and how they spend it. Banks would know if they had
prior loans and if they were paid in a timely manner. Banks would know where they
work and the size and stability of monthly income via the direct deposits. All of
3. The brand manager at Procter and Gamble might use Data Analytics to see what
is being said about Procter and Gamble’s Tide Pods product on social media
websites (e.g., Snapchat, Twitter, Instagram, and Facebook), particularly those that
attract an older demographic. This will help the manager assess if there is a
4. Data Analytics might be used to collect information on the amount of overtime. Who
worked overtime? What were they working on? Do we actually need more full-time
employees to reduce the level of overtime (and its related costs to the company
instead of paying overtime? How much will costs increase just to pay for fringe
benefits (health care, retirement, etc.) for new employees versus just paying
existing employees for their overtime. All of these questions could be addressed by
5. Management accounting and Data Analytics both (1) address questions asked by
management, (2) find data to address those questions, (3) analyze the data, and
6. The tax staff would become much more adept at efficiently organizing data from
multiple systems across an organization and performing Data Analytics to help with
7. The dependent variable could be the amount of money spent on fast food.
Independent variables could be proximity of the fast food, ability to cook own food,
8. The data reduction approach might help auditors spend more time and effort on the
most risky transactions or on those that might be anomalous in nature. This will
help them more efficiently spend their time on items that may well be of highest
importance.
9. According to the “magic quadrant,” the software tools represented by the Microsoft
and Tableau Tracks are considered innovative because they lead the market in the
10. Having Tableau software tools available on both the Mac and Windows computers
gives the analyst needed flexibility that is not available for the Microsoft Track,
11. The use of the data item whether a potential borrower owns or rents their house
would be expected to complement the risk score, debt levels (DTI bucket), and
length of employment, since it can give a potential lender additional data on the
borrower.
12. An analysis of the rejected loans suggests that 85,793 of the total 645,414 rejected
loans were from the state of California. That represents 13.29 percent of the total
rejected loans. This is greater than the relative population of California to the United
13. A PivotTable analysis of the rejected loans suggests that more than 30.6 percent
(762/2,494) of those in the excellent risk credit score range asked for a loan with a
2. (LO 1-4) Which data approach attempts to assign each unit in a population into a
small set of classes (or groups) where the unit best fits?
a. Regression
b. Similarity matching
c. Co-occurrence grouping
d. Classification page 29
3. (LO 1-4) Which data approach attempts to identify similar individuals based on data
a. Classification
b. Regression
c. Similarity matching
d. Data reduction
4. (LO 1-4) Which data approach attempts to predict connections between two data
items?
a. Profiling
b. Classification
c. Link prediction
d. Regression
a. Big Data
b. Data warehouse
c. Data dictionary
d. Data Analytics
6. (LO 1-5) Which skills were not emphasized that analytic-minded accountants
should have?
7. (LO 1-5) In which areas were skills not emphasized for analytic-minded
accountants?
a. Data quality
8. (LO 1-4) The IMPACT cycle includes all except the following steps:
d. track outcomes.
9. (LO 1-4) The IMPACT cycle specifically includes all except the following steps:
a. data preparation.
b. communicate insights.
10. (LO 1-1) By the year 2024, the volume of data created, captured, copied, and
a. zettabytes
b. petabytes
c. exabytes
d. yottabytes
page 30
suggested that Data Analytics would be increasingly implementing Big Data in their
business processes. Why is that? How can Data Analytics help accountants do
their jobs?
2. (LO 1-1) Define Data Analytics and explain how a university might use its
businesses.
4. (LO 1-3) Give a specific example of how Data Analytics creates value for auditing.
5. (LO 1-3) How might Data Analytics be used in financial reporting? And how might it
6. (LO 1-3) How is the role of management accounting similar to the role of the data
analyst?
7. (LO 1-4) Describe the IMPACT cycle. Why does its order of the processes and its
8. (LO 1-4) Why is identifying the question such a critical first step in the IMPACT
process cycle?
9. (LO 1-4) What is included in mastering the data as part of the IMPACT cycle
10. (LO 1-4) What data approach mentioned in the chapter might be used by Facebook
to find friends?
11. (LO 1-4) Auditors will frequently use the data reduction approach when considering
the total number of transactions might be important for auditors to assess risk.
12. (LO 1-4) Which data approach might be used to assess the appropriate level of the
13. (LO 1-6) Why might the debt-to-income attribute included in the declined loans
dataset considered in the chapter be a predictor of declined loans? How about the
14. (LO 1-6) To address the question “Will I receive a loan from LendingClub?” we had
available data to assess the relationship among (1) the debt-to-income ratios and
number of rejected loans, (2) the length of employment and number of rejected
loans, and (3) the credit (or risk) score and number of rejected loans. What
additional data would you recommend to further assess whether a loan would be
Problems
1. (LO 1-4) Match each specific Data Analytics test to a specific test approach, as part
Classification
Regression
Similarity Matching
Clustering
Co-occurrence Grouping
Profiling
Link Prediction
Data Reduction
page 31
Test
Specific Data Analytics Test Approach
2. (LO 1-4) Match each of the specific Data Analytics tasks to the stage of the
IMPACT cycle:
Communicate Insights
Track Outcomes
Stage of
IMPACT
Specific Data Analytics Test Cycle
Excel
Power Query
Power BI
Power Automate
page 32
Specific Analysis Microsoft Track
Need/Characteristic Tool
1. Basic visualization
2. Robotics process automation
3. Data joining
4. Advanced visualization
5. Works on Windows/Mac/Online
platforms
6. Dashboards
7. Collect data from multiple sources
8. Data cleaning
4. (LO 1-5) Match the specific analysis need/characteristic to the appropriate Tableau
Tableau Desktop
Tableau Public
1. Advanced visualization
2. Analyze and share public datasets
3. Data joining
Specific Analysis Tableau Track
Need/Characteristic Tool
4. Presentations
5. Data transformation
6. Dashboards
7. Data cleaning
5. (LO 1-6) Navigate to the Connect Additional Student Resources page. Under
Chapter 1 Data Files, download and consider the LendingClub data dictionary file
dictionary for the loans that were funded. Choose among these attributes in the
data dictionary and indicate which are likely to be predictive that loans will go
delinquent, or that loans will ultimately be fully repaid and which are not predictive.
Predictive?
Predictive Attributes (Yes/No)
6. (LO 1-6) Navigate to the Connect Additional Student Resources page. Under
Chapter 1 Data Files, download and consider the rejected loans dataset of
LendingClub data titled “DAA Chapter 1-1 Data.” Choose among these attributes
in the data dictionary, and indicate which are likely to be predictive of loan rejection,
page 33
1. Amount Requested
2. Zip Code
3. Loan Title
4. Debt-To-Income Ratio
5. Application Date
6. Risk_Score
7. Employment Length
7. (LO 1-6) Navigate to the Connect Additional Student Resources page. Under
Chapter 1 Data Files, download and consider the rejected loans dataset of
LendingClub data titled “DAA Chapter 1-1 Data” from the Connect website and
perform an Excel PivotTable by state; then figure out the number of rejected
applications for the state of Arkansas. That is, count the loans by state and
compute the percentage of the total rejected loans in the United States that came
from Arkansas. How close is that to the relative proportion of the population of
Arkansas as compared to the overall U.S. population (per 2010 census)? Use your
browser to find the population of Arkansas and the United States and calculate the
7A. Multiple Choice: What is the percentage of total loans rejected in the United
States that came from Arkansas?
7B. Multiple Choice: Is this loan rejection percentage greater than the percentage
of the U.S. population that lives in Arkansas (per 2010 census)?
8. (LO 1-6) Download the rejected loans dataset of LendingClub data titled “DAA
Chapter 1-1 Data” from Connect Additional Student Resources and do an Excel
PivotTable by state; then figure out the number of rejected applications for each
state.
8A. Put the following states in order of their loan rejection percentage based on the
count of rejected loans (from high [1] to low [11]) of the total rejected loans.
Does each state’s loan rejection percentage roughly correspond to its relative
proportion of the U.S. population?
1. Arkansas (AR)
2. Hawaii (HI)
3. Kansas (KS)
4. New Hampshire (NH)
5. New Mexico (NM)
6. Nevada (NV)
7. Oklahoma (OK)
8. Oregon (OR)
9. Rhode Island (RI)
10. Utah (UT)
11. West Virginia (WV)
page 34
8B. What is the state with the highest percentage of rejected loans?
8C. What is the state with the lowest percentage of rejected loans?
8D. Analysis: Does each state’s loan rejection percentage roughly correspond to its
relative proportion of the U.S. population (by 2010 U.S. census at
https://en.wikipedia.org/wiki/2010_United_States_census)?
For Problems 9, 10, and 11, we will be cleaning a data file in preparation for
subsequent analysis.
The analysis performed on LendingClub data in the chapter (“DAA Chapter 1-1
Data”) was for the years 2007–2012. For this and subsequent problems, please
download the rejected loans table for 2013 from Connect Additional Student
9. (LO 1-6) Consider the 2013 rejected loan data from LendingClub titled “DAA
Chapter 1-2 Data” from Connect Additional Student Resources. Browse the file in
Excel to ensure there are no missing data. Because our analysis requires risk
scores, debt-to-income data, and employment length, we need to make sure each
a. Assign each risk score to a risk score bucket similar to the chapter. That is,
classify the sample according to this breakdown into excellent, very good, good,
fair, poor, and very bad credit according to their credit score noted in Exhibit 1-
13. Classify those with a score greater than 850 as “Excellent.” Consider using
nested if–then statements to complete this. Or sort by risk score and manually
input into appropriate risk score buckets.
b. Run a PivotTable analysis that shows the number of loans in each risk score
bucket.
Which risk score bucket had the most rejected loans (most observations)? Which
risk score bucket had the least rejected loans (least observations)? Is it similar to
10. (LO 1-6) Consider the 2013 rejected loan data from LendingClub titled “DAA
Chapter 1-2 Data.” Browse the file in Excel to ensure there are no missing data.
Because our analysis requires risk scores, debt-to-income data, and employment
length, we need to make sure each of them has valid data. There should be
669,993 observations.
a. Assign each valid debt-to-income ratio into three buckets (labeled DTI bucket) by
classifying each debt-to-income ratio into high (>20.0 percent), medium (10.0–
20.0 percent), and low (<10.0 percent) buckets. Consider using nested if–then
statements to complete this. Or sort the row and manually input.
b. Run a PivotTable analysis that shows the number of loans in each DTI bucket.
Which DTI bucket had the highest and lowest grouping for this rejected Loans
dataset? Any interpretation of why these loans were rejected based on debt-to-
income ratios?
11. (LO 1-6) Consider the 2013 rejected loan data from LendingClub titled “DAA
Chapter 1-2 Data.” Browse the file in Excel to ensure there are no missing data.
Because our analysis requires risk scores, debt-to-income data, and employment
length, we need to make sure each of them has valid data. There should be
669,993 observations.
a. Assign each risk score to a risk score bucket similar to the chapter. That is,
classify the sample according to this breakdown into excellent, very good, good,
fair, poor, and very bad credit according to their credit score noted in Chapter 1.
Classify those with a score greater than 850 as “Excellent.” Consider using
nested if-then statements to complete this. Or sort by risk score and manually
input into appropriate risk score buckets (similar to Problem 9).
b. Assign each debt-to-income ratio into three buckets (labeled DTI bucket) by
classifying each debt-to-income ratio into high (>20.0 percent), medium (10.0–
20.0 percent), and low (<10.0 percent) buckets. Consider using page 35
nested if-then statements to complete this. Or sort the row and
manually classify into the appropriate bucket.
c. Run a PivotTable analysis to show the number of excellent risk scores but high
DTI bucket loans in each employment year bucket.
Which employment length group had the most observations to go along with
excellent risk scores but high debt-to-income? Which employment year group had
the least observations to go along with excellent risk scores but high debt-to-
page 36
LABS
Microsoft | Excel
Tableau | Desktop
Throughout the lab you will be asked to answer questions about the
process and the results. Add your screenshots to your screenshot lab
document. All objective and analysis questions should be answered in
Connect or included in your screenshot lab document, depending on your
instructor’s preferences.
5. Close the Power Query Editor and close your Excel workbook.
6. Open Power BI Desktop and close the welcome screen.
7. Take a screenshot (label it 1-0MB) of the Power BI Desktop
workspace and paste it into your lab document.
8. Close Power BI Desktop.
Tableau | Prep, Desktop
page 39
Case Summary: Let’s see how we might perform some simple Data
Analytics. The purpose of this lab is to help you identify relevant questions
that may be answered using Data Analytics.
You were just hired as an analyst for a credit rating agency that evaluates
publicly listed companies in the United States. The agency already has
some Data Analytics tools that it uses to evaluate financial statements and
determine which companies have higher risk and which companies are
growing quickly. The agency uses these analytics to provide ratings that
will allow lenders to set interest rates and determine whether to lend money
in the first place. As a new analyst, you’re determined to make a good first
impression.
Lab 1-1 Part 1 Identify the Questions
Think about ways that you might analyze data from a financial statement.
You could use a horizontal analysis to view trends over time, a vertical
analysis to show account proportions, or ratios to analyze relationships.
Before you begin the lab, you should create a new blank Word document
where you will record your screenshot and save it as Lab 1-1 [Your name]
[Your email address].docx.
page 41
page 42
Attribute Description
id Loan identification number
member_id Membership identification number
addr_state State
Case Summary: ABC Company is a large retailer that collects its order-to-
cash data in a large ERP system that was recently updated to comply with
the AICPA’s audit data standards. ABC Company currently collects all
relevant data in the ERP system and digitizes any contracts, orders, or
receipts that are completed on paper. The credit department reviews
customers who request credit. Sales orders are approved by managers
before being sent to the warehouse for preparation and shipment. Cash
receipts are collected by a cashier and applied to a customer’s outstanding
balance by an accounts receivable clerk.
You have been assigned to the audit team that will perform the internal
controls audit of ABC Company. In this lab, you should identify appropriate
questions and develop a hypothesis for each question. Then you should
translate questions into target fields and value in a database and perform a
simple analysis.
page 43
Lab 1-3 Part 1 Identify the Questions
Your audit team has been tasked with identifying potential internal control
weaknesses within the order-to-cash process. You have been asked to
consider what the risk of internal control weakness might look like and how
the data might help identify it.
Before you begin the lab, you should create a new blank Word document
where you will record your screenshot and save it as Lab 1-3 [Your name]
[Your email address].docx.
1. Open your web browser and search for “Audit data standards order to
cash.” Follow the link to the “Audit Data Standards Library—AICPA,”
then look for the “Audit Data Standard—Order to Cash Subledger
Standard” PDF document.
2. Quickly scroll through the document and evaluate the tables (e.g.,
Sales_Orders_YYYYMMDD_YYYYMMDD), field names (e.g.,
Sales_Order_ID), and descriptions (e.g., “Unique identifier for each
sales order.”).
3. Take a screenshot (label it 1-3A) of the page showing 2.1
Sales_Orders_YYYYMMDD_YYYYMMDD.
4. As you skim the tables, make note of any data elements you identified in
Part 1 that don’t appear in the list of fields in the audit data standard.
page 44
This is a gifted dataset that is based on real operational data. Like any
real database, integrity problems may be noted. This can provide a unique
opportunity not only to be exposed to real data, but also to illustrate the
effects of data integrity problems.
For this lab, you should rely on your creativity and prior business
knowledge to answer the following analysis questions. Answer these
questions in your lab doc or in Connect and then continue to the next part of
this lab.
page 45
Sample
Attribute Description values
CUST_ID Unique identifier representing a 219948527,
customer instance 219930818
CITY City where the customer lives. HOUSTON,
COOS BAY
STATE State where the customer lives. FL, TX
DEPARTMENT table:
Sample
Attribute Description values
DEPT The Dillard’s unique identifier for a collection of 0471, 0029
merchandise within a store format
DEPTCENT The first two digits of a department code, a way to 04XX, 00XX
classify departments at a higher level.
SKU table:
Sample
Attribute Description values
SKU Unique identifier for an item, identifies the item by 0557578,
size within a color and style for a particular vendor. 6383039
CLASSIFICATION Category used to sort products into logical groups. Dress Shoe
PACKSIZE Number that describes how many of the product 001, 002
come in a package
page 46
SKU_STORE table:
Sample
Attribute Description values
STORE The numerical identifier for a Dillard’s store. 915, 701
SKU Unique identifier for an item, identifies the item by size 4305296,
within a color and style for a particular vendor. 6137609
STORE table:
Sample
Attribute Description values
Sample
Attribute Description values
STORE The numerical identifier for any type of Dillard’s 767, 460
location.
TRANSACT table:
Sample
Attribute Description values
TRANSACTION_ID Unique numerical identifier for each scan of an item at 40333797,
a register. 15129264
TRAN_DATE Calendar date the transaction occurred in a store. 1/1/2015,
5/19/2014
STORE The numerical identifier for any type of Dillard’s 716, 205
location.
REGISTER The numerical identifier for the register where the 91, 55, 12
item was scanned.
Lab Note: The tools presented in this lab periodically change. Updated
instructions, if applicable, can be found in the eBook and lab walkthrough
videos in Connect.
Case Summary: Dillard’s is a department store with approximately 330
stores in 29 states in the United States. Its headquarters is located in Little
Rock, Arkansas. You can learn more about Dillard’s by looking at
finance.yahoo.com (ticker symbol = DDS) and the Wikipedia site for DDS.
You’ll quickly note that William T. Dillard II is an accounting grad of the
University of Arkansas and the Walton College of Business, which may be
why he shared transaction data with us to make available for this lab and
labs throughout this text. In this lab, you will learn how to load Dillard’s
data into the tools used for data analysis.
Data: Dillard’s sales data are available only on the University of
Arkansas Remote Desktop (waltonlab.uark.edu). See your instructor for
login credentials.
Lab 1-5 Part 1 Load the Dillard’s Data in
Excel + Power Query and Tableau Prep
Before you begin the lab, you should create a new blank Word document
where you will record your screenshots and save it as Lab 1-5 [Your name]
[Your email address].docx.
From the Walton College website, we note the following:
This is a gifted dataset that is based on real operational data. Like any
real database, integrity problems may be noted. This can provide a unique
opportunity not only to be exposed to real data, but also to illustrate the
effects of data integrity problems. The TRANSACT table itself contains
107,572,906 records. Analyzing the entire population would take a
significant amount of computational time, especially if multiple users are
querying it at the same time.
page 48
In Part 1 of this lab, you will learn how to load the Dillard’s data into
either Excel + Power Query or Tableau Prep so that you can extract,
transform, and load the data for later assignments. You will also filter the
data to a more manageable size. In Part 2, you will learn how to load the
Dillard’s data into either Power BI Desktop or Tableau Desktop to prepare
your data for visualization and Data Analytics models.
Tableau | Prep
8. Now that you have loaded your data into Power BI Desktop,
continue to explore the data:
Tableau | Desktop
1. Create a new workbook in Tableau.
2. Go to Connect > To a Server > Microsoft SQL Server.
3. Enter the following and click Sign In:
a. Server: essql1.walton.uark.edu
b. Database: WCOB_DILLARDS
page 51
a. Click Add….
b. Choose Tran Date and click OK.
c. Choose Range of Dates and click Next.
d. Drag the sliders to limit the data from 1/1/2014 to 1/7/2014 and
click OK. Note: Future labs may ask you to load different date
ranges.
e. Take a screenshot (label it 1-5TD).
f. Click OK to return to the Data Source screen.
8. Click the TRANSACT table and then click Update Now to
preview the data.
9. When you are finished answering the lab questions you may close
Tableau. Save your file as Lab 1-5 Dillard’s Filter.twb.
Note: Tableau will try to query the server after each change you
make and will take a up to a minute. After each change, click Cancel
to stop the query until you’re ready to prepare the final report.
Chapter 2
Mastering the Data
A Look Back
Chapter 1 defined Data Analytics and explained that the value of Data Analytics is in the
insights it provides. We described the Data Analytics Process using the IMPACT cycle
model and explained how this process is used to address both business and accounting
questions. We specifically emphasized the importance of identifying appropriate questions
that Data Analytics might be able to address.
A Look Ahead
Chapter 3 describes how to go from defining business problems to analyzing data,
answering questions, and addressing business problems. We identify four types of Data
Analytics (descriptive, diagnostic, predictive, and prescriptive analytics) and describe
various approaches and techniques that are most relevant to analyzing accounting data.
page 53
We are lucky to live in a world in which data are abundant. However, even with rich sources
of data, when it comes to being able to analyze data and turn them into useful information
and insights, very rarely can an analyst hop right into a dataset and begin analyzing.
Datasets almost always need to be cleaned and validated before they can be used. Not
knowing how to clean and validate data can, at best, lead to frustration and poor insights
and, at worst, lead to horrible security violations. While this text takes advantage of open
source datasets, these datasets have all been scrubbed not only for accuracy, but also to
protect the security and privacy of any individual or company whose details were in the
original dataset.
Wichy/Shutterstock
In 2015, a pair of researchers named Emil Kirkegaard and Julius Daugbejerg Bjerrekaer
scraped data from OkCupid, a free dating website, and provided the data onto the “Open
Science Framework,” a platform researchers use to obtain and share raw data. While the
aim of the Open Science Framework is to increase transparency, the researchers in this
instance took that a step too far—and a step into illegal territory. Kirkegaard and Bjerrekaer
did not obtain permission from OkCupid or from the 70,000 OkCupid users whose
identities, ages, genders, religions, personality traits, and other personal details maintained
by the dating site were provided to the public without any work being done to anonymize or
sanitize the data. If the researchers had taken the time to not just validate that the data were
complete, but also to sanitize them to protect the individuals’ identities, this would not have
been a threat or a news story. On May 13, 2015, the Open Science Framework removed the
OkCupid data from the platform, but the damage of the privacy breach had already been
done.1
A 2020 report suggested that “Any consumer with an average number of apps on their
phone—anywhere between 40 and 80 apps—will have their data shared with hundreds or
perhaps thousands of actors online,” said Finn Myrstad, the digital policy director for the
Norwegian Consumer Council, commenting specifically about dating apps.2
All told, data privacy and ethics will continue to be an issue for data providers and data
users. In this chapter, we look at the ethical considerations of data collection and data use
as part of mastering the data.
OBJECTIVES
After reading this chapter, you should be able to:
LO 2-1 Understand available internal and external data sources and how data
are organized in an accounting information system.
LO 2-4 Describe the ethical considerations of data collection and data use.
page 54
As you learned in Chapter 1, Data Analytics is a process, and we follow an established
Data Analytics model called the IMPACT cycle.3 The IMPACT cycle begins with
identifying business questions and problems that can be, at least partially, addressed with
data (the “I” in the IMPACT model). Once the opportunity or problem has been identified,
the next step is mastering the data (the “M” in the IMPACT model), which requires you
to identify and obtain the data needed for solving the problem. Mastering the data requires
a firm understanding of what data are available to you and where they are stored, as well
as being skilled in the process of extracting, transforming, and loading (ETL) the data in
preparation for data analysis. While the extraction piece of the ETL process may often be
completed by the information systems team or a database administrator, it is also possible
that you will have access to raw data that you will need to extract out of the source
database. Both methods of requesting data for extraction and of extracting data yourself
are covered in this chapter. The mastering the data step can be described via the ETL
process. The ETL process is made up of the following five steps:
Step 1 Determine the purpose and scope of the data request (extract).
Step 2 Obtain the data (extract).
Step 3 Validate the data for completeness and integrity (transform).
Step 4 Clean the data (transform).
Step 5 Load the data in preparation for data analysis (load).
This chapter will provide details for each of these five steps.
Before you can identify and obtain the data, you must have a comfortable grasp on what
data are available to you and where such data are stored.
page 55
Exhibit 2-1 provides an example of different categories of external data sources
including economic, financial, governmental, and other sources. Each of these may be
useful in addressing accounting and business questions.
EXHIBIT 2-1
Potential External Data Sources Available to Address Business and Accounting
Questions
Dataset
Category Description Website
Economics BRICS World Bank https://www.kaggle.com/docstein/brics-world-bank-indicators
Indicators (Brazil,
Russia, India, China
and South Africa)
Economics Bureau of Economic https://www.bls.gov/data/
Analysis data
Financial Financial statement https://www.calcbench.com/
data
Financial Financial statement https://www.sec.gov/edgar.shtml
data, EDGAR,
Securities and
Exchange
Commission
Financial Analyst forecasts Yahoo! Finance (finance.yahoo.com), Analysis Tab
Financial Stock market dataset https://www.kaggle.com/borismarjanovic/price-volume-data-for-all-us-
stocks-etfs
Financial Credit card fraud https://www.kaggle.com/mlg-ulb/creditcardfraud
detection
Financial Daily News/Stock https://www.kaggle.com/aaron7sun/stocknews
Market Prediction
Financial Retail Data https://www.kaggle.com/manjeetsingh/retaildataset
Analytics
Financial Peer-to-peer lending lendingclub.com (requires login)
data of approved and
rejected loans
Financial Daily stock prices Yahoo! Finance (finance.yahoo.com), Historical Data Tab
(and weekly and
monthly)
Financial Financial and https://pages.stern.nyu.edu/~adamodar/New_Home_Page/datacurrent.html
economic
summaries by
industry
General data.world https://data.world/
General kaggle.com https://www.kaggle.com/datasets
Government State of Ohio https://data.ohio.gov/wps/portal/gov/data/
financial data (Data
Ohio)
Dataset
Category Description Website
Government City of Chicago https://data.cityofchicago.org
financial data
Government City of New York https://www.checkbooknyc.com/spending_landing/yeartype/B/year/119
financial data
Marketing Amazon product https://data.world/datafiniti/consumer-reviews-of-amazon-products
reviews
Other Restaurant safety https://data.cityofnewyork.us/Health/DOHMH-New-York-City-
Restaurant-Inspection-Results/43nn-pn8j
Other Citywide payroll https://data.cityofnewyork.us/City-Government/Citywide-Payroll-Data-
data Fiscal-Year-/k397-673e
Other Property https://data.cityofnewyork.us/City-Government/Property-Valuation-and-
valuation/assessment Assessment-Data/yjxr-fw8i
Other USA facts—our https://www.irs.gov/uac/tax-stats
country in numbers
Other Interesting fun https://towardsdatascience.com/14-data-science-projects-to-do-during-
datasets—14 data your-14-day-quarantine-8bd60d1e55e1
science projects with
data
Other Links to Big Data https://aws.amazon.com/public-datasets/
Sets—Amazon Web
Services
Real Estate New York Airbnb https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data
data explanation
Real Estate U.S. Airbnb data https://www.kaggle.com/kritikseth/us-airbnb-open-data/tasks?
taskId=2542
Real Estate TripAdvisor hotel https://www.kaggle.com/andrewmvd/trip-advisor-hotel-reviews
reviews
Retail Retail sales https://www.kaggle.com/tevecsystems/retail-sales-forecasting
forecasting
page 56
EXHIBIT 2-2
Procure-to-Pay Database Schema (Simplified)
In this text, we will work with data in a variety of forms, but regardless of the tool we use
to analyze data, structured data should be stored in a normalized relational database.
There are occasions for working with data directly in the relational database, but many
times when we work with data analysis, we’ll prefer to export the data from the relational
database and view it in a more user-friendly form. The benefit of storing data in a
normalized database outweighs the downside of having to export, validate, and page 57
sanitize the data every time you need to analyze the information.
Storing data in a normalized, relational database instead of a flat file ensures that data
are complete, not redundant, and that business rules and internal controls are enforced; it
also aids communication and integration across business processes. Each one of these
benefits is detailed here:
Completeness. Ensures that all data required for a business process are included in the
dataset.
No redundancy. Storing redundant data is to be avoided for several reasons: It takes up
unnecessary space (which is expensive), it takes up unnecessary processing to run
reports to ensure that there aren’t multiple versions of the truth, and it increases the
risk of data-entry errors. Storing data in flat files yields a great deal of redundancy, but
normalized relational databases require there to be one version of the truth and for each
element of data to be stored in only one place.
Business rules enforcement. As will become increasingly evident as we progress
through the material in this text, relational databases can be designed to aid in the
placement and enforcement of internal controls and business rules in ways that flat
files cannot.
Communication and integration of business processes. Relational databases should be
designed to support business processes across the organization, which results in
improved communication across functional areas and more integrated business
processes.4
It is valuable to spend some time basking in the benefits of storing data in a relational
database because it is not necessarily easier to do so when it comes to building the data
model or understanding the structure. It is arguably more complex to normalize your data
than it is to throw redundant data without business rules or internal controls into a
spreadsheet.
Columns in a Table: Primary Keys, Foreign Keys, and
Descriptive Attributes
When requesting data, it is critical to understand how the tables in a relational database are
related. This is a brief overview of the different types of attributes in a table and how these
attributes support the relationships between tables. It is certainly not a comprehensive take
on relational data modeling, but it should be adequate in preparing you for creating data
requests.
Every column in a table must be both unique and relevant to the purpose of the table.
There are three types of columns: primary keys, foreign keys, and descriptive attributes.
Each table must have a primary key. The primary key is typically made up of one
column. The purpose of the primary key is to ensure that each row in the table is unique,
so it is often referred to as a “unique identifier.” It is rarely truly descriptive; instead, a
collection of letters or simply sequential numbers are often used. As a student, you are
probably already very familiar with your unique identifier—your student ID number at the
university is the way you as a student are stored as a unique record in the university’s data
model! Other examples of unique identifiers that you are familiar with would be Amazon
order numbers, invoice numbers, account numbers, Social Security numbers, and driver’s
license numbers.
One of the biggest differences between a flat file and a relational database is simply
how many tables there are—when you request your data into a flat file, you’ll receive one
big table with a lot of redundancy. While this is often ideal for analyzing data, page 58
when the data are stored in the database, each group of information is stored in
a separate table. Then, the tables that are related to one another are identified (e.g.,
Supplier and Purchase Order are related; it’s important to know which Supplier the
Purchase Order is from). The relationship is created by placing a foreign key in one of the
two tables that are related. The foreign key is another type of attribute, and its function is
to create the relationship between two tables. Whenever two tables are related, one of
those tables must contain a foreign key to create the relationship.
The other columns in a table are descriptive attributes. For example, Supplier Name
is a critical piece of data when it comes to understanding the business process, but it is not
necessary to build the data model. Primary and foreign keys facilitate the structure of a
relational database, and the descriptive attributes provide actual business information.
Refer to Exhibit 2-2, the database schema for a typical procure-to-pay process. Each
table has an attribute with the letters “PK” next to them—these are the primary keys for
each table. The primary key for the Materials Table is “Item_Number,” the primary key
for the Purchase Order Table is “PO_Number,” and so on. Several of the tables also have
attributes with the letters “FK” next to them—these are the foreign keys that create the
relationship between pairs of tables. For example, look at the relationship between the
Supplier Table and the Purchase Order Table. The primary key in the Supplier Table is
“Supplier ID.” The line between the two tables links the primary key to a foreign key in
the Purchase Order Table, also named “Supplier ID.”
The Line Items Table in Exhibit 2-3 has so much detail in it that it requires two
attributes to combine as a primary key. This is a special case of a primary key often
referred to as a composite primary key, in which the two foreign keys from the tables
that it is linking combine to make up a unique identifier. The theory and details that
support the necessity of this linking table are beyond the scope of this text—if you can
identify the primary and foreign keys, you’ll be able to identify the data that you need to
request. Exhibit 2-4 shows a subset of the data that are represented by the Purchase Order
table. You can see that each of the attributes listed in the class diagram appears as a
column, and the data for each purchase order are accounted for in the rows.
EXHIBIT 2-3
Line Items Table: Purchase Order Detail Table
EXHIBIT 2-4
Purchase Order Table
page 59
PROGRESS CHECK
1. Referring to Exhibit 2-2, locate the relationship between the Supplier and Purchase Order tables.
What is the unique identifier of each table? (The unique identifier attribute is called the primary
key—more on how it’s determined in the next learning objective.) Which table contains the
attribute that creates the relationship? (This attribute is called the foreign key—more on how it’s
2. Referring to Exhibit 2-2, review the attributes in the Purchase Order table. There are two foreign
keys listed in this table that do not relate to any of the tables in the diagram. Which tables do you
think they are? What type of data would be stored in those two tables?
3. Refer to the two tables that you identified in Progress Check 2 that would relate to the Purchase
Order table, but are not pictured in this diagram. Draw a sketch of what the UML Class Diagram
would look like if those tables were included. Draw the two classes to represent the two tables
(i.e., rectangles), the relationships that should exist, and identify the primary keys for the two new
tables.
DATA DICTIONARIES
In the previous section, you learned about how data are stored by focusing on the procure-
to-pay database schema. Viewing schemas and processes in isolation clarifies each
individual process, but it can also distort reality—these schemas typically do not represent
their own separate databases. Rather, each process-specific database schema is a piece of a
greater whole, all combining to form one integrated database.
As you can imagine, once these processes come together to be supported in one
database, the amount of data can be massive. Understanding the processes and the basics
of how data are stored is critical, but even with a sound foundation, it would be nearly
impossible for an individual to remember where each piece of data is stored, or what each
piece of data represents.
Creating and using a data dictionary is paramount in helping database administrators
maintain databases and analysts identify the data they need to use. In Chapter 1, you were
introduced to the data dictionary for the LendingClub data for rejected loans (DAA
Chapter 1-1 Data). The same cut-out of the LendingClub data dictionary is provided in
Exhibit 2-5 as a reminder.
EXHIBIT 2-5
LendingClub Data Dictionary for Rejected Loan Data (DAA Chapter 1-1 Data)
Source: LendingClub Data
Because the LendingClub data are provided in a flat file, the only information
necessary to describe the data are the attribute name (e.g., Amount Requested) and a
description of that attribute. The description ensures that the data in each attribute are used
and analyzed in the appropriate way—it’s always important to remember that technology
will do exactly what you tell it to, so you must be smarter than the computer! If you run
analysis on an attribute thinking it means one thing, when it actually means another, you
could make some big mistakes and bad decisions even when you are working with data
validated for completeness and integrity. It’s critical to get to know the data through
database schemas and data dictionaries thoroughly before attempting to do any data
analysis.
When you are working with data stored in a relational database, you will have more
attributes to keep track of in the data dictionary. Exhibit 2-6 provides an example of a data
dictionary for a generic Supplier table:
page 60
EXHIBIT 2-6
Supplier Data Dictionary
PROGRESS CHECK
4. What is the purpose of the primary key? A foreign key? A nonkey (descriptive) attribute?
5. How do data dictionaries help you understand the data from a database or flat file?
Once you have familiarized yourself with the data via data dictionaries and schemas, you
are prepared to request the data from the database manager or extract the data yourself.
The ETL process begins with identifying which data you need and is complete when the
clean data are loaded in the appropriate format into the tool to be used for analysis.
This process involves:
1. Determining the purpose and scope of the data request.
2. Obtaining the data.
3. Validating the data for completeness and integrity.
4. Cleaning the data.
5. Loading the data for analysis.
page 61
Extract
Determine exactly what data you need in order to answer your business questions.
Requesting data is often an iterative process, but the more prepared you are when
requesting data, the more time you will save for yourself and the database team in the long
run.
Requesting the data involves the first two steps of the ETL process. Each step has
questions associated with it that you should try to answer.
Once the purpose of the data request is determined and scoped, as well as any risks and
assumptions documented, the next step is to determine whom to ask and specifically what
is needed, what format is needed (Excel, PDF, database), and by what deadline.
Lab Connection
Lab 2-1 has you explore work through the process of requesting data from
IT.
page 62
In a later chapter, you will be provided a deep dive into the audit data standards (ADS)
developed by the American Institute of Certified Public Accountants (AICPA).6 The aim
of the ADS is to alleviate the headaches associated with data requests by serving as a
guide to standardize these requests and specify the format an auditor desires from the
company being audited. These include the following:
1. Order-to-Cash subledger standards
2. Procure-to-Pay subledger standards
3. Inventory subledger standards
4. General Ledger standards
While the ADS provide an opportunity for standardization, they are voluntary.
Regardless of whether your request for data will conform to the standards, a data request
form template (as shown in Exhibit 2-7) can make communication easier between data
requester and provider.
EXHIBIT 2-7
Example Standard Data Request Form
Requester Name:
Requester Contact Number:
Requester Email Address:
Please provide a description of the information needed (indicate which tables and which fields you require):
Intended Audience:
Customer (if not requester):
Once the data are received, you can move on to the transformation phase of the ETL
process. The next step is to ensure the completeness and integrity of the extracted data.
page 63
After identifying the goal of the data analysis project in the first step of the IMPACT
cycle, you can follow a similar process to how you would request the data if you were
going to extract it yourself:
1. Identify the tables that contain the information you need. You can do this by looking
through the data dictionary or the relationship model.
2. Identify which attributes, specifically, hold the information you need in each table.
3. Identify how those tables are related to each other.
Once you have identified the data you need, you can start gathering the information.
There are a variety of methods that you could take to retrieve the data. Two will be
explained briefly here—SQL and Excel—and there is a deep dive into SQL in Appendices
D and E, as well as a deep dive into Excel’s VLookup and Index/Match in Appendix B.
SQL: “Structured Query Language” (SQL, often pronounced sequel) is a computer
language to interact with data (tables, records, and attributes) in a database by creating,
updating, deleting, and extracting. For Data Analytics we only need to focus on extracting
data that match the criteria of our analysis goals. Using SQL, we can combine data from
one or more tables and organize the data in a way that is more intuitive and useful for data
analysis than the way the data are stored in the relational database. A firm understanding
of the data—the tables, how they are related, and their respective primary and foreign keys
—is integral to extracting the data.
Typically, data should be stored in the database and analyzed in another tool such as
Excel, IDEA, or Tableau. However, you can choose to extract only the portion of the data
that you wish to analyze via SQL instead of extracting full tables and transforming the
data in Excel, IDEA, or Tableau. This is especially preferable when the raw data stored in
the database are large enough to overwhelm Excel. Excel 2016 can hold only 1,048,576
rows on one spreadsheet. When you attempt to bring in full tables that exceed that amount,
even when you use Excel’s powerful Power BI tools, it will slow down your analysis if the
full table isn’t necessary.
As you will explore in labs throughout this textbook, SQL isn’t only directly within the
database. When you plan to perform your analysis in Excel, Power BI, or Tableau, each
tool has an SQL option for you to directly connect to the database and pull in a subset of
the data.
There is more description about writing queries and a chance to practice creating joins
in Appendix E.
Microsoft Excel or Power BI: When data are not stored in a relational database, or are
not too large for Excel, the entire table can be analyzed directly in a spreadsheet. The
advantage is that further analysis can be done in Excel or Power BI and it is beneficial to
have all the data to drill down into more detail once the initial question is answered. This
approach is often simpler for doing exploratory analysis (more on this in a later chapter).
Understanding the primary key and foreign key relationships is also integral to working
with the data directly in Excel.
When your data are stored directly in Excel, you can also use Excel functions and
formulas to combine data from multiple Excel tables into one table, similar to how you
can join data with SQL in Access or another relational database. Two of Excel’s most
useful techniques for looking up data from two separate tables and matching them based
on a matching primary key/foreign key relationship is the VLookup or Index/Match
functions. There are a variety of ways that the VLookup or Index/Match function can be
used, but for extracting and transforming data it is best used to add a column to a table.
More information about using VLookup and Index/Match functions in Excel is
provided in Appendix B.
The question of whether to use SQL or Excel’s tools (such as VLookup) is primarily
answered by where the data are stored. Because data are most frequently stored in a
relational database (as discussed earlier in this chapter, due to the efficiency page 64
and data integrity benefits relational databases provide), SQL will often be the
best option for retrieving data, after which those data can be loaded into Excel or another
tool for further analysis. Another benefit of SQL queries is that they can be saved and
reproduced at will or at regular intervals. Having a saved SQL query can make it much
easier and more efficient to re-create data requests. However, if the data are already stored
in a flat file in Excel, there is little reason to use SQL. Sometimes when you are
performing exploratory analysis, even if the data are stored in a relational database, it can
be beneficial to load entire tables into Excel and bypass the SQL step. This should be
considered carefully before doing so, though, because relational databases handle large
amounts of data much better than Excel can. Writing SQL queries can also make it easier
to load only the data you need to analyze into Excel so that you do not overwhelm Excel’s
resources.
Source: Robert Half Associates, “Survey: Finance Leaders Report Technology Skills Most Difficult to
Find When Hiring,” August 22, 2019, http://rh-us.mediaroom.com/2019-08-22-Survey-Finance-
Leaders-Report-Technology-Skills-Most-Difficult-To-Find-When-Hiring (accessed January 22, 2021).
Transform
Step 3: Validating the Data for Completeness and Integrity
Anytime data are moved from one location to another, it is possible that some of the data
could have been lost during the extraction. It is critical to ensure that the extracted data are
complete (that the data you wish to analyze were extracted fully) and that the integrity of
the data remains (that none of the data have been manipulated, tampered with, or
duplicated during the extraction). Being able to validate the data successfully requires you
to not only have the technical skills to perform the task, but also to know your data well. If
you know what to reasonably expect from the data in the extraction then you have a higher
likelihood of identifying errors or issues from the extraction. Examples of data validation
questions are: “How many records should have been extracted?” “What checksums or
control totals can be performed to ensure data extraction is accurate?”
The following four steps should be completed to validate the data after extraction:
1. Compare the number of records that were extracted to the number of records in the
source database: This will give you a quick snapshot into whether any data were
skipped or didn’t extract properly due to an error or data type mismatch. This is a
critical first step, but it will not provide information about the data themselves other
than ensuring that the record counts match.
page 65
If an error is found, depending on the size of the dataset, you may be able to easily find
the missing or erroneous data by scanning the information with your eyes. However, if the
dataset is large, or if the error is difficult to find, it may be easiest to go back to the
extraction and examine how the data were extracted, fix any errors in the SQL code, and
re-run the extraction.
Lab Connection
Lab 2-5, Lab 2-6, Lab 2-7, and Lab 2-8 explore the process of loading and
validating data.
1. Remove headings or subtotals: Depending on the extraction technique used and the file
type of the extraction, it is possible that your data could contain headings or subtotals
that are not useful for analysis. Of course, these issues could be overcome in the
extraction steps of the ETL process if you are careful to request the data in the correct
format or to only extract exactly the data you need.
2. Clean leading zeroes and nonprintable characters: Sometimes data will contain
leading zeroes or “phantom” (nonprintable) characters. This will happen particularly
when numbers or dates were stored as text in the source database but need to be
analyzed as numbers. Nonprintable characters can be white spaces, page breaks, line
breaks, tabs, and so on, and can be summarized as characters that our human eyes can’t
see, but that the computer interprets as a part of the string. These can cause trouble
when joining data because, while two strings may look identical to our eyes, the
computer will read the nonprintable characters and will not find a match.
3. Format negative numbers: If there are negative numbers in your dataset, ensure that the
formatting will work for your analysis. For example, if your data contain negative
numbers formatted in parentheses and you would prefer this formatting to be as a
negative sign, this needs to be corrected and consistent.
4. Correct inconsistencies across data, in general: If the source database did not enforce
certain rules around data entry, it is possible that there are inconsistencies across the
data—for example, if there is a state field, Arkansas could be formatted as “AR,”
“Ark,” “Ar.,” and so on. These will need to be replaced with a common value before
you begin your analysis if you are interested in grouping data geographically.
page 66
Lab Connection
Lab 2-2 and Lab 2-3 walk through how to prepare data for analysis and
resolve common data quality issues.
1. Dates: The most common problems revolve around the date format because there are
so many different ways a date can be presented. For example, look at the different ways
you can show July 6, 2024: 6-Jul-2024; 6.7.2024; 45479 (in Excel); 07/06/2024 (in the
United States); 06/07/2024 (in Europe); and the list goes on. You need to format the
date to match the acceptable format for your tool. The ISO 8601 standard indicates you
should format dates in the year-month-day format (2024-07-06), and most professional
query tools accept this format. If you use Excel to transform dates to this format,
highlight your dates and go to Home > Number > Format Cells and choose Custom.
Then type in YYYY-MM-DD and click OK.
page 67
Load
Step 5: Loading the Data for Data Analysis
If the extraction and transformation steps have been done well by the time you reach this
step, the loading part of the ETL process should be the simplest step. It is so simple, in
fact, that if your goal is to do your analysis in Excel and you have already transformed and
cleaned your data in Excel, you are finished. There should be no additional loading
necessary.
However, it is possible that Excel is not the last step for analysis. The data analysis
technique you plan to implement, the subject matter of the business questions you intend
to answer, and the way in which you wish to communicate results will all drive the choice
of which tool you use to perform your analysis.
Throughout the text, you will be introduced to a variety of different tools to use for
analyzing data beyond including Excel, Power BI, Tableau Prep, and Tableau Desktop. As
these tools are introduced to you, you will learn how to load data into them.
ETL or ELT?
If loading the data into Excel is indeed the last step, are you actually “extracting,
transforming, and loading,” or is it “extracting, loading, and transforming”?
The term ETL has been in popular use since the 1970s, and even though methods for
extracting and transforming data have gotten easier to use, more accessible, as well as
more robust, the term has stuck. Increasingly, however, the procedure is shifting toward
ELT. Particularly with tools such as Microsoft’s Power BI suite, all of the loading and
transforming can be done within Excel, with data directly loaded into Excel page 68
from the database, and then transformed (also within Excel). The most
common method for mastering the data that we use throughout this textbook is more in
line with ELT than ETL; however, even when the order changes from ETL to ELT, it is
still more common to refer to the procedure as ETL.
PROGRESS CHECK
6. Describe two different methods for obtaining data for analysis.
7. What are four common data quality issues that must be fixed before analysis can take place?
Mastering the data goes beyond just ETL processes. Mastering the data also includes
having some assurance that the data collection is not only secure, but also that the ethics of
data collection and data use have been considered.
In the past, the scope for digital risk was limited to cybersecurity threats to make sure
the data were secure; however, increasingly the concern is the risk of lacking ethical data
practices. Indeed, the concerns regarding data gleaned from traditional and nontraditional
sources are that they are used in an ethical manner and for their intended purpose.
Potential ethical issues include an individual’s right to privacy and whether assurance
is offered that certain data are not misused. For example, is the individual about whom
data has been collected able to limit who has access to her personal information, and how
those data are used or shared? If an individual’s credit card is submitted for an e-
commerce transaction, does the customer have assurance that the credit card number will
not be misused?
To address these and other concerns, the Institute of Business Ethics suggests that
companies consider the following six questions to allow a business to create value from
data use and analysis, and still protect the privacy of stakeholders 7:
1. How does the company use data, and to what extent are they integrated into firm
strategy? What is the purpose of the data? Are they accurate or reliable? Will they
benefit the customer or the employee?
2. Does the company send a privacy notice to individuals when their personal data
are collected? Is the request to use the data clear to the user? Do they agree to the
terms and conditions of use of their personal data?
3. Does the company assess the risks linked to the specific type of data the company
uses? Have the risks of data use or data breach of potentially sensitive data been
considered?
4. Does the company have safeguards in place to mitigate the risks of data misuse?
Are preventive controls on data access in place and are they effective? Are penalties
established and enforced for data misuse?
5. Does the company have the appropriate tools to manage the risks of data misuse?
Is the feedback from these tools evaluated and measured? Does internal audit regularly
evaluate these tools?
6. Does our company conduct appropriate due diligence when sharing with or
acquiring data from third parties? Do third-party data providers follow similar
ethical standards in the acquisition and transmission of the data?
page 69
The user of the data must continue to recognize the potential risks associated with data
collection and data use, and work to mitigate those risks in a responsible way.
PROGRESS CHECK
8. A firm purchases data from a third party about customer preferences for laundry detergent. How
would you recommend that this firm conduct appropriate due diligence about whether the third-
party data provider follows ethical data practices? An audit? A questionnaire? What questions
should be asked?
Summary
■ The first step in the IMPACT cycle is to identify the questions that you
intend to answer through your data analysis project. Once a data analysis
problem or question has been identified, the next step in the IMPACT
cycle is mastering the data, which includes obtaining the data needed and
preparing it for analysis. We often call the processes associated with
mastering the data ETL, which stands for extract, transform, and load.
(LO 2-2, 2-3)
■ In order to obtain the right data, it is important to have a firm grasp of
what data are available to you and how that information is stored. (LO 2-
2)
◦ Data are often stored in a relational database, which helps to ensure that an
organization’s data are complete and to avoid redundancy. Relational
databases are made up of tables with rows of data that represent records.
Each record is uniquely identified with a primary key. Tables are related to
other tables by using the primary key from one table as a foreign key in
another table.
■ Extract: To obtain the data, you will either have access to extract the data
yourself or you will need to request the data from a database administrator
or the information systems team. If the latter is the case, you will complete
a data request form, indicating exactly which data you need and why. (LO
2-3)
■ Transform: Once you have the data, they will need to be validated for
completeness and integrity—that is, you will need to ensure that all of the
data you need were extracted and that all data are correct. Sometimes
when data are extracted some formatting or sometimes even entire records
will get lost, resulting in inaccuracies. Correcting the errors and cleaning
the data is an integral step in mastering the data. (LO 2-3)
■ Load: Finally, after the data have been cleaned, there may be one last step
of mastering the data, which is to load them into the tool that will be used
for analysis. Often, the cleaning and correcting of data occur in Excel, and
the analysis will also be done in Excel. In this case, there is no need to
load the data elsewhere. However, if you intend to do more rigorous
statistical analysis than Excel provides, or if you intend to do more robust
data visualization than can be done in Excel, it may be necessary to load
the data into another tool following the transformation process. (LO 2-3)
■ Mastering the data goes beyond just the ETL processes. Those who collect
and use data also have the responsibility of being good stewards,
providing some assurance that the data collection is not only secure, but
also that the ethics of data collection and data use have been considered.
(LO 2-4)
page 70
Key Words
accounting information system (54) A system that records, processes, reports, and communicates the results of
business transactions to provide financial and nonfinancial information for decision-making purposes.
composite primary key (58) A special case of a primary key that exists in linking tables. The composite primary
key is made up of the two primary keys in the table that it is linking.
customer relationship management (CRM) system (54) An information system for managing all interactions
between the company and its current and potential customers.
data dictionary (59) Centralized repository of descriptions for all of the data attributes of the dataset.
data request form (62) A method for obtaining data if you do not have access to obtain the data directly yourself.
descriptive attributes (58) Attributes that exist in relational databases that are neither primary nor foreign keys.
These attributes provide business information, but are not required to build a database. An example would be
“Company Name” or “Employee Address.”
Enterprise Resource Planning (ERP) (54) Also known as Enterprise Systems, a category of business
management software that integrates applications from throughout the business (such as manufacturing, accounting,
finance, human resources, etc.) into one system.
ETL (60) The extract, transform, and load process that is integral to mastering the data.
flat file (57) A means of storing data in one place, such as in an Excel spreadsheet, as opposed to storing the data in
multiple tables, such as in a relational database.
foreign key (58) An attribute that exists in relational databases in order to carry out the relationship between two
tables. This does not serve as the “unique identifier” for each record in a table. These must be identified when
mastering the data from a relational database in order to extract the data correctly from more than one table.
human resource management (HRM) system (54) An information system for managing all interactions
between the company and its current and potential employees.
mastering the data (54) The second step in the IMPACT cycle; it involves identifying and obtaining the data
needed for solving the data analysis problem, as well as cleaning and preparing the data for analysis.
primary key (57) An attribute that is required to exist in each table of a relational database and serves as the
“unique identifier” for each record in a table.
relational database (56) A means of storing data in order to ensure that the data are complete, not redundant, and
to help enforce business rules. Relational databases also aid in communication and integration of business processes
across an organization.
supply chain management (SCM) system (54) An information system that helps manage all the company’s
interactions with suppliers.
Order table is [PO Number]. The Purchase Order table contains the foreign key.
2. The foreign key attributes in the Purchase Order table that do not relate to any tables in the view are
EmployeeID and CashDisbursementID. These attributes probably relate to the Employee table (so
that we can tell which employee was responsible for each Purchase Order) and the Cash
Disbursement table (so that we can tell if the Purchase Orders have been paid for yet, and if so, on
which check). The Employee table would be a complete listing of each employee, as well as
containing the details about each employee (for example, phone number, address, etc.). The Cash
Disbursement table would be a listing of the payments the company has made.
page 71
3.
4. The purpose of the primary key is to uniquely identify each record in a table. The purpose of a foreign
key is to create a relationship between two tables. The purpose of a descriptive attribute is to provide
meaningful information about each record in a table. Descriptive attributes aren’t required for a
database to run, but they are necessary for people to gain business information about the data stored
in their databases.
5. Data dictionaries provide descriptions of the function (e.g., primary key or foreign key when
applicable), data type, and field names associated with each column (attribute) of a database. Data
dictionaries are especially important when databases contain several different tables and many
different attributes in order to help analysts identify the information they need to perform their
analysis.
6. Depending on the level of security afforded to a business analyst, she can either obtain data directly
from the database herself or she can request the data. When obtaining data herself, the analyst must
have access to the raw data in the database and a firm knowledge of SQL and data extraction
techniques. When requesting the data, the analyst doesn’t need the same level of extraction skills,
but she still needs to be familiar with the data enough in order to identify which tables and attributes
7. Four common issues that must be fixed are removing headings or subtotals, cleaning leading zeroes
or nonprintable characters, formatting negative numbers, and correcting inconsistencies across the
data.
8. Firms can ask to see the terms and conditions of their third-party data supplier, and ask questions to
come to an understanding regarding if and how privacy practices are maintained. They also can
evaluate what preventive controls on data access are in place and assess whether they are followed.
Generally, an audit does not need to be performed, but requesting a questionnaire be filled out would
be appropriate.
page 72
2. (LO 2-3) Which of the following describes part of the goal of the ETL process?
3. (LO 2-2) The advantages of storing data in a relational database include which of the following?
e. a and b
f. b and c
g. a and c
5. (LO 2-2) Which attribute is required to exist in each table of a relational database and serves as the
a. Foreign key
b. Unique identifier
c. Primary key
d. Key attribute
6. (LO 2-2) The metadata that describe each attribute in a database are which of the following?
b. Data dictionary
c. Descriptive attributes
d. Flat file
7. (LO 2-3) As mentioned in the chapter, which of the following is not a common way that data will need
8. (LO 2-2) Why is Supplier ID considered to be a primary key for a Supplier table?
b. It is a 10-digit number.
9. (LO 2-2) What are attributes that exist in a relational database that are neither primary nor foreign
keys?
a. Nondescript attributes
b. Descriptive attributes
c. Composite keys
page 73
10. (LO 2-4) Which of the following questions are not suggested by the Institute of Business Ethics to
allow a business to create value from data use and analysis, and still protect the privacy of
stakeholders?
a. How does the company use data, and to what extent are they integrated into firm strategy?
b. Does the company send a privacy notice to individuals when their personal data are collected?
c. Does the data used by the company include personally identifiable information?
d. Does the company have the appropriate tools to mitigate the risks of data misuse?
are stored in a database. Why is this an important advantage? What can go wrong when redundant
preferable to integrate business processes in one information system, rather than store different
3. (LO 2-2) Even though it is preferable to store data in a relational database, storing data across
separate tables can make data analysis cumbersome. Describe three reasons it is worth the trouble
4. (LO 2-2) Among the advantages of using a relational database is enforcing business rules. Based on
your understanding of how the structure of a relational database helps prevent data redundancy and
other advantages, how does the primary key/foreign key relationship structure help enforce a
business rule that indicates that a company shouldn’t process any purchase orders from suppliers
5. (LO 2-2) What is the purpose of a data dictionary? Identify four different attributes that could be
6. (LO 2-3) In the ETL process, the first step is extracting the data. When you are obtaining the data
yourself, what are the steps to identifying the data that you need to extract?
7. (LO 2-3) In the ETL process, if the analyst does not have the security permissions to access the data
directly, then he or she will need to fill out a data request form. While this doesn’t necessarily require
the analyst to know extraction techniques, why does the analyst still need to understand the raw data
8. (LO 2-3) In the ETL process, when an analyst is completing the data request form, there are a
number of fields that the analyst is required to complete. Why do you think it is important for the
analyst to indicate the frequency of the report? How do you think that would affect what the database
9. (LO 2-3) Regarding the data request form, why do you think it is important to the database
administrator to know the purpose of the request? What would be the importance of the “To be used
10. (LO 2-3) In the ETL process, one important step to process when transforming the data is to work
with null, n/a, and zero values in the dataset. If you have a field of quantitative data (e.g., number of
years each individual in the table has held a full-time job), what would be the effect of the following?
c. Deleting records that have null and n/a values from your dataset
(Hint: Think about the impact on different aggregate functions, such as COUNT and AVERAGE.)
page 74
11. (LO 2-4) What is the theme of each of the six questions proposed by the Institute of Business Ethics?
Which one addresses the purpose of the data? Which one addresses how the risks associated with
data use and collection are mitigated? How could these two specific objectives be achieved at the
same time?
Problems
1. (LO 2-2) Match the relational database function to the appropriate relational database term:
Descriptive attribute
Foreign key
Primary key
Relational database
Relational
Database
Relational Database Function Term
2. (LO 2-3) Identify the order sequence in the ETL process as part of mastering the data (i.e., 1 is first;
5 is last).
Sequence Order (1
Steps of the ETL Process to 5)
3. (LO 2-3) Identify which ETL tasks would be considered “Validating” the data, and which would be
Validating
or
ETL Task Cleaning
page 75
4. (LO 2-3) Match each ETL task to the stage of the ETL process:
Determine purpose
Obtain
Validate
Clean
Load
Stage of
ETL
ETL Task Process
5. (LO 2-4) For each of the six questions suggested by the Institute of Business Ethics to evaluate data
C. Evaluate the due diligence of the company’s data vendors in preventing misuse of the data
Category
Institute of Business Ethics Questions regarding Data A, B, or
Use and Privacy C?
3. How does the company use data, and to what extent are
they integrated into firm strategy?
4. Does our company conduct appropriate due diligence
when sharing with or acquiring data from third parties?
5. Does the company have the appropriate tools to
manage the risks of misuse?
6. Does the company have safeguards in place to mitigate
these risks of misuse?
6. (LO 2-2) Which of the following are useful, established characteristics of using a relational database?
1. Completeness
2. Reliable
3. No redundancy
4. Communication and integration
of business processes
5. Less costly to purchase
6. Less effort to maintain
7. Business rules are enforced
page 76
7. (LO 2-3) As part of master the data, analysts must make certain trade-offs when they consider which
b. Analysis: What are the trade-offs an analyst should consider between data that are very
expensive to acquire and analyze, but will most directly address the question at hand? How would
you assess whether they are worth the extra cost?
c. Analysis: What are the trade-offs between extracting needed data by yourself, or asking a data
scientist to get access to the data?
8. (LO 2-4) The Institute of Business Ethics proposes that a company protect the privacy of
Does our company conduct appropriate due diligence when sharing with or acquiring data from
third parties?
Do third-party data providers follow similar ethical standards in the acquisition and transmission of
the data?
a. Analysis: What type of due diligence with regard to a third party sharing and acquiring data would
be appropriate for the company (or company accountant or data scientist) to perform? An audit? A
questionnaire? Standards written in to a contract?
b. Analysis: How would you assess whether the third-party data provider follows ethical standards in
the acquisition and transmission of the data?
page 77
LABS
Case Summary: Sláinte is a fictional brewery that has recently gone through big changes.
Sláinte sells six different products. The brewery has only recently expanded its business to
distributing from one state to nine states, and now its business has begun stabilizing after
the expansion. With that stability comes a need for better analysis. You have been hired by
Sláinte to help management better understand the company’s sales data and provide input
for its strategic decisions. In this lab, you will identify appropriate questions and develop a
hypothesis for each question, generate a data request, and evaluate the data you receive.
Data: Lab 2-1 Data Request Form.zip - 10KB Zip / 13KB Word
Lab 2-1 Part 1 Identify the Questions and Generate a
Data Request
Before you begin the lab, you should create a new blank Word document where you will
record your screenshots and save it as Lab 2-1 [Your name] [Your email address].docx.
One of the biggest challenges you face with data analysis is getting the right data. You
may have the best questions in the world, but if there are no data available to support your
hypothesis, you will have difficulty providing value. Additionally, there are instances in
which the IT workers may be reluctant to share data with you. They may send incomplete
data, the wrong data, or completely ignore your request. Be persistent, and you may have
to look for creative ways to find insight with an incomplete picture.
One of Sláinte’s first priorities is to identify its areas of success as well as areas of
potential improvement. Your manager has asked you to focus specifically on sales data at
this point. This includes data related to sales orders, products, and customers.
Answer the Lab 2-1 Part 1 Analysis Questions and then complete a data request form
for those data you have identified for your analysis.
1. Open the Data Request Form.
2. Enter your contact information.
3. In the description field, identify the tables and fields that you’d like to analyze, along
with the time periods (e.g., past month, past year, etc.).
4. Indicate what the information will be used for in the appropriate box (internal
analysis).
5. Select a frequency. In this case, this is a “One-off request.”
6. Choose a format (spreadsheet).
7. Enter a request date (today) and a required date (one week from today).
8. Take a screenshot (label it 2-1A) of your completed form.
EXHIBIT 2-1A
Sales_Orders Table
Sales_Order_Lines Table
Finished_Goods_Products Table
You may notice that while there are a few attributes that may be useful in your sales
analysis, the list may be incomplete and be missing several values. This is normal with
data requests.
page 79
Lab Note: The tools presented in this lab periodically change. Updated instructions, if
applicable, can be found in the eBook and lab walkthrough videos in Connect.
Case Summary: Sláinte is a fictional brewery that has recently gone through big
changes. Sláinte sells six different products. The brewery has only recently expanded its
business to distributing from one state to nine states, and now its business has begun
stabilizing after the expansion. Sláinte has brought you in to help determine potential areas
for sales growth in the next year. Additionally, management have noticed that the
company’s margins aren’t as high as they had budgeted and would like you to help
identify some areas where they could improve their pricing, marketing, or strategy.
Specifically, they would like to know how many of each product were sold, the product’s
actual name (not just the product code), and the months in which different products were
sold.
Data: Lab 2-2 Slainte Dataset.zip - 83KB Zip / 90KB Excel
Microsoft Excel
page 80
Tableau | Desktop
Tableau Software, Inc. All rights reserved.
In this lab, you will learn how to connect to data in Microsoft Power BI or Excel using
the Internal Data Model and how to connect to data and build relationships among tables
in Tableau. This will prepare you for future labs that require you to transform data, as well
as aid in understanding of primary and foreign key relationships.
a. Finished_Goods_Products
b. Sales_Order
c. Sales_Order_Lines
page 81
7. Take a screenshot (label it 2-2MA) of the Power Query Editor window with
your changes.
8. At this point, we are ready to connect the data to our Excel sheet. We will only
create a connection so we can pull it in for specific analyses. Click the Home tab
and choose the arrow below Close & Load > Close & Load To…
9. Choose Only Create Connection and Add this data to the Data Model and
click OK. The three queries will appear in a tab on the right side of your sheet.
10. Save your workbook as Lab 2-2 Slainte Model.xlsx, and continue to Part 2.
Tableau | Desktop
page 82
Lab 2-2 Part 2 Validate the Data
Now that the data have been prepared and organized, you’re ready for some basic analysis.
Given the sales data, management has asked you to prepare a report showing the total
number of each item sold each month between January and April 2020. This means that
we should create a PivotTable with a column for each month, a row for each product, and
the sum of the quantity sold where the two intersect.
Note: If at any point while working with your PivotTable, your PivotTable
Fields list disappears, you can make it reappear by ensuring that your active
cell is within the PivotTable itself. If the Field List still doesn’t reappear,
navigate to the Analyze tab in the Ribbon, and select Field List.
4. Click the > next to each table to show the available fields. If you don’t see your
three tables, click the All option directly below the PivotTable Fields pane title.
5. Drag Sales_Order.Sales_Order_Date to the Columns pane. Note: When you
add a date, Excel will automatically try to group the data by Year, Quarter, and so
on.
10. Clean up your PivotTable. Rename labels and the title of the report to something
more useful, like “Sales by Month”.
11. Take a screenshot (label it 2-2MC).
12. When you are finished answering the lab questions, you may close Excel. Save
your file as Lab 2-2 Slainte Pivot.xlsx.
page 83
Tableau | Desktop
a. In the Columns pane, drill down on the date to show the quarters and months
[click the + next to YEAR(Sales Order Date) to show the Quarters, etc.].
b. Click QUARTER(Sales Order Date) and choose Remove.
page 84
Lab Note: The tools presented in this lab periodically change. Updated instructions, if
applicable, can be found in the eBook and lab walkthrough videos in Connect.
Case Summary: LendingClub is a peer-to-peer marketplace where borrowers and
investors are matched together. The goal of LendingClub is to reduce the costs associated
with these banking transactions and make borrowing less expensive and investment more
engaging. LendingClub provides data on loans that have been approved and rejected
since 2007, including the assigned interest rate and type of loan. This provides several
opportunities for data analysis. There are several issues with this dataset that you have
been asked to resolve before you can process the data. This will require you to perform
some cleaning, reformatting, and other transformation techniques.
Data: Lab 2-3 Lending Club Approve Stats.zip - 120MB Zip / 120MB Excel
page 85
Tableau | Prep
Tableau Software, Inc. All rights reserved.
Attribute Description
loan_amnt Requested loan amount
term Length of the loan in months
int_rate Interest rate of the loan
grade Quality of the loan: e.g. A, B, C
page 86
addr_state State
dti Debt-to-income ratio
delinq_2y Late payments within the past 2 years
earliest_cr_line Oldest credit account
open_acc Number of open credit accounts
revol_bal Total balance of all credit accounts
revol_util Percentage of available credit in use
Attribute Description
total_acc Total number of credit accounts
Note: When you use Power Query or a Tableau Prep flow, you create a set of steps that
will be used to transform the data. When you receive new data, you can run those through
those same steps (or flows) without having to recreate them each time.
page 87
a. loan_amnt
b. term
c. int_rate
d. grade
e. emp_length
f. home_ownership
g. annual_inc
h. issue_d
i. loan_status
j. title
k. zip_code
l. addr_state
m. dti
n. delinq_2y
o. earliest_cr_line
p. open_acc
q. revol_bal
r. revol_util
s. total_acc
7. Take a screenshot (label it 2-3MA) of your reduced columns.
Next, remove text values from numerical values and replace values so we can do
calculations and summarize the data. These extraneous text values include
months, <1, n/a, +, and years:
page 88
page 89
Tableau | Prep
Lab Note: Tableau Prep takes extra time to process large datasets.
1. Open Tableau Prep Builder.
2. Click Connect to Data > To a File > Microsoft Excel.
3. Locate the Lab 2-3 Lending Club Approve Stats.xlsx file on your computer
and click Open.
4. Drag LoanStats3c to your flow. Notice that all of the Field Names are incorrect.
First we have to fix the column headers and remove unwanted data.
5. Check Use Data Interpreter in the pane on the left to automatically fix the Field
Names.
6. Uncheck the box next to any attribute that is NOT in the following list to remove
it from our analysis. Hint: Once you get to initial_list_status, all of the remaining
fields can be removed.
a. loan_amnt
b. term
c. int_rate
d. grade
e. emp_length
f. home_ownership
g. annual_inc
h. issue_d
i. loan_status
j. title
k. zip_code
l. addr_state
m. dti
n. delinq_2y
o. earliest_cr_line
p. open_acc
q. revol_bal
r. revol_util
s. total_acc
7. Take a screenshot (label it 2-3TA) of your corrected and reduced list of Field
Names.
Next, remove text values from numerical values and replace values so we can do
calculations and summarize the data. These extraneous text values include
months, <1, n/a, +, and years:
page 90
8. Click the + next to LoanStats3c in the flow and choose Add Clean Step. It may
take a minute or two to load.
9. An Input step will appear in the top half of the workspace, and the details of that
step are in the bottom of the workspace in the Input Pane. Every flow requires at
least one Input step at the beginning of the flow.
10. In the Input Pane, you can further limit which fields you bring into Tableau Prep,
as well as seeing details about each field including:
a. Type: this indicates the data type of each field (for example, numeric, date, or
short text).
b. Linked Keys: this indicates whether or not the field is a primary or a foreign
key.
c. Sample Values: provides a few example values from that field so you can see
how the data are formatted.
11. In the term pane:
a. Right-click the header or click the three dots and choose Clean > Remove
Letters.
b. Click the Data Type (Abc) button in the top-left corner and change the data
type to Number (Whole).
1. Double-click <1 year in the list and type “0” to replace those values with 0.
2. Double-click n/a in the list and type “0” to replace those values with 0.
3. While you are in the Group Values window, you could quickly replace all of
the year values with single numbers (e.g., 10+ years becomes “10”) or you
can move to the next step to remove extra characters.
4. Click Done.
b. If you didn’t remove the “years” text in the previous step, right-click the
emp_length header or click the three dots and choose Clean > Remove
Letters and then Clean > Remove All Spaces.
c. Finally, click the Data Type (Abc) button in the top-left corner and change the
data type to Number (Whole).
13. In the flow pane, right-click Clean 1 and choose Rename and name the step
“Remove text”.
14. Take a screenshot (label it 2-3TB) of your cleaned data file, showing the
term and emp_length columns.
15. Click the + next to your Remove text task and choose Output.
16. In the Output pane, click Browse:
a. Navigate to your preferred location to save the file.
b. Name your file Lab 2-3 Lending Club Transform.hyper.
c. Click Accept.
page 91
17. Click Run Flow. When it is finished processing, click Done.
18. When you are finished answering the lab questions you may close Tableau Prep.
Save your file as Lab 2-3 Lending Club Transform.tfl.
Lab Note: The tools presented in this lab periodically change. Updated instructions, if
applicable, can be found in the eBook and lab walkthrough videos in Connect.
Case Summary: When you’re working with a new or unknown set of data, validating
the data is very important. When you make a data request, the IT manager who fills the
request should also provide some summary statistics that include the total number of
records and mathematical sums to ensure nothing has been lost in the transmission. This
lab will help you calculate summary statistics in Power BI and Tableau Desktop.
Data: Lab 2-4 Lending Club Transform.zip - 29MB Zip / 26MB Excel / 6MB Tableau
page 92
Microsoft Excel
Tableau | Desktop
Tableau Software, Inc. All rights reserved.
page 93
In this part we are interested in understanding more about the loan amounts, interest
rates, and annual income by looking at their summary statistics. This process can be used
for data validation and later for outlier detection.
a. Drag loan_amnt to the Fields box. Click the drop-down menu next to
loan_amnt and choose Sum.
b. Drag loan_amnt to the Fields box below the existing field. Click the drop-
down menu next to the new loan_amnt and choose Average.
c. Drag loan_amnt to the Fields box below the existing field. Click the drop-
down menu next to the new loan_amnt and choose Count.
d. Drag loan_amnt to the Fields box below the existing field. Click the drop-
down menu next to the new loan_amnt and choose Max.
15. Add two new Multi-row cards showing the same values (Sum, Average, Count,
Max) for int_rate and annual_inc.
page 94
16. Take a screenshot (label it 2-4MB) of the column statistics and value
distribution.
17. When you are finished answering the lab questions, you may close Power BI
Desktop. Save your file as Lab 2-4 Lending Club Summary.pbix.
Tableau | Desktop
Lab Note: The tools presented in this lab periodically change. Updated instructions, if
applicable, can be found in the eBook and lab walkthrough videos in Connect.
Case Summary: Your college admissions department is interested in determining the
likelihood that a new student will complete their 4-year program. They have tasked you
with analyzing data from the U.S. Department of Education to identify some variables that
may be predictive of the completion rate. The data used in this lab are a subset of the
College Scorecard dataset that is provided by the U.S. Department of Education. These
data provide federal financial aid and earnings information, insights into the performance
of schools eligible to receive federal financial aid, and the outcomes of students at those
schools.
Data: Lab 2-5 College Scorecard Dataset.zip - 0.5MB Zip / 1.4MB Txt
Microsoft Excel
page 96
page 97
6. Take a screenshot (label it 2-5MA) of your columns with the proper data
types.
7. From the Home tab, click Close & Load.
8. To ensure that you captured all of the data through the extraction from the txt file,
we need to validate them:
a. In the Queries & Connections pane, verify that there are 7,703 rows loaded.
b. Compare the attribute names (column headers) to the attributes listed in the data
dictionary (found in Appendix K of the textbook). There should be 30 columns
(the last column in Excel should be AD).
c. Click Column H for the SAT_AVG attribute. In the summary statistics at the
bottom of your worksheet, the overall average SAT score should be 1,059.07.
page 98
16. From the menu bar, click Analysis > Aggregate Measures to remove the check
mark. To show each unique entry, you have to disable aggregate measures.
17. To show the summary statistics, go to the menu bar and click Worksheet > Show
Summary. A Summary card appears on the right side of the screen with the
Count, Sum, Average, Minimum, Maximum, and Median values.
18. Drag Unitid to the Rows shelf and note the summary statistics.
19. Take a screenshot (label it 2-5TB) of the Unitid stats in your worksheet.
20. Create two new sheets and repeat steps 16–18 for Sat Avg and C150 4, noting
the count, sum, average, minimum, maximum, and median of each.
21. When you are finished answering the lab questions, you may close Tableau
Desktop. Save your file as Lab 2-5 College Scorecard Transform.twb. Your
data are now ready for the test plan. This lab will continue in Lab 3-3.
Lab Note: The tools presented in this lab periodically change. Updated instructions, if
applicable, can be found in the eBook and lab walkthrough videos in Connect.
Case Summary: You are a brand-new analyst and you just got assigned to work on the
Dillard’s account. You were provided an ER Diagram (available in Appendix J), but you
still aren’t sure what all of the different tables and fields represent. Before diving into
problem solving or even transforming the data to prepare them for analysis, it is important
to gain an understanding of what data are available to you. One of the steps in doing so is
connecting to the database and analyzing the way the tables relate.
Data: Dillard’s sales data are available only on the University of Arkansas Remote
Desktop (waltonlab.uark.edu). See your instructor for login credentials.
page 99
Microsoft Excel
Tableau | Desktop
Tableau Software, Inc. All rights reserved.
page 100
a. Server: essql1.walton.uark.edu
b. Database: WCOB_Dillards
c. Data Connectivity: DirectQuery
4. If prompted to enter credentials, you can keep the default to “Use my current
credentials” and click Connect.
5. If prompted with an Encryption Support warning, click OK to move past it.
6. Take a screenshot (label it 2-6MA) of the navigator window.
Learn about Power BI!
There are two ways to connect to data, either Import or DirectQuery. There are
pros and cons for each, and it will always depend on a few factors, including the
size of the dataset and the type of analysis you intend to do.
Import: Will pull in all data at once. This can take a long time, but once they are
imported, your analysis can be more efficient if you know that you plan to use
each piece of data that you import. This is also beneficial for some of the
analyses you will learn about in future chapters, such as clustering.
DirectQuery: Only creates a connection to the data. This is more efficient if you
are exploring all of the tables in a large database and are comfortable working
with only a sample of data. Note: Unless directed otherwise, you should always
use DirectQuery with Dillard’s data to prevent the remote desktop from running
out of storage space.
7. Place a check mark next to each of the following tables and click Load:
a. Customer, Department, SKU, SKU_Store, Store, Transact
8. Click the Model button (the icon with three connected boxes) in the toolbar on
the left to view the tables and relationships and note the following:
a. All the tables that you selected should appear in the Modeling tab with table
names, attributes, and relationships.
page 101
b. When you hover over any of the relationships, the keys that are common
between the two tables highlight.
1. Something important to consider is that in the raw data, the primary key is
typically the first attribute listed. In this Power BI modeling window, the
attributes have been re-ordered to appear in alphabetical order. For example,
SKU is the primary key of the SKU table, and it exists in the Transact table
as a foreign key.
9. Take a screenshot (label it 2-6MB) of the All tables sheet.
10. When you are finished answering the lab questions, you may close Power BI
Desktop. Save your file as Lab 2-6 Dillard’s Diagram.pbix.
Note: While it may seem easier and faster to rely on the automatically created
data model in Power BI, you should review the table relationships to make sure
the appropriate keys match.
Tableau | Desktop
a. The field names will appear in the data grid section on the bottom of the screen,
but the data themselves will not automatically load. If you click Update Now,
you can get a preview of the data held in the Transact table. You can do some
light data transformation at this point, but if your data transformation needs are
heavy, it would be better to perform that transformation in Tableau Prep before
bringing the data into Tableau Desktop.
6. Double-click the CUSTOMER table to add it to the data model in the top pane.
a. In the Edit Relationship window that pops up, confirm that the appropriate keys
are identified (Cust ID and Cust ID) and close the window.
7. Double-click each of the remaining tables that relate directly to the Transact table
from the list on the left:
page 102
8. Finally, double-click the SKU_STORE table from the list on the left.
a. The SKU_Store table is related to both the SKU and the Store tables, but
Tableau will likely default to connecting it to the Transact table, resulting in a
broken relationship.
b. To fix the relationship,
1. Close the Edit Relationships window without making changes.
2. Right-click SKU_STORE in the top pane and choose Move to > SKU.
3. Verify the related keys and close the Edit Relationships window.
4. Note: It is not necessary to also relate the SKU_Store table to the Store table
in Tableau; that is only a database requirement.
page 103
Lab Note: The tools presented in this lab periodically change. Updated instructions, if
applicable, can be found in the eBook and lab walkthrough videos in Connect.
Case Summary: You are a brand-new analyst and you just got assigned to work on the
Dillard’s account. After analyzing the ER Diagram to gain a bird’s-eye view of all the
different tables and fields in the database, you are ready to further explore the data in each
table and how the fields are formatted. In particular, you will connect to the Dillard’s
database using Tableau Prep or Microsoft BI and you will explore the data types, the
primary and foreign keys, and preview individual tables.
In Lab 2-6, the Tableau Track had you focus on Tableau Desktop. In this lab, you will
connect to Tableau Prep instead. Tableau Desktop showcases the table relationships
quicker, but Tableau Prep makes it easier to preview and clean the data prior to analysis.
Data: Dillard’s sales data are available only on the University of Arkansas Remote
Desktop (waltonlab.uark.edu). See your instructor for login credentials.
Microsoft Excel
page 104
Tableau | Prep
Tableau Software, Inc. All rights reserved.
a. Server: essql1.walton.uark.edu
b. Database: WCOB_DILLARDS
c. Data Connectivity: DirectQuery
4. If prompted to enter credentials, keep the default of “Use my current credentials”
and click Connect.
5. If prompted with an Encryption Support warning, click OK to move past it.
page 105
Tableau | Prep
1. Type: this indicates the data type of each field (for example, numeric, date,
or short text).
2. Linked Keys: this indicates whether or not the field is a primary or a foreign
key. In the Transact table, we can see that the Transaction_ID is the primary
key, and that there are three foreign keys in this table: Store, Cust_ID, and
SKU.
3. Sample Values: provides a few example values from that field so you can see
how the data are formatted.
6. Double-click the CUSTOMER table to add a new Input step to your flow.
7. Take a screenshot (label it 2-7TA).
8. Answer the lab questions and continue to Part 2.
page 106
1. Place a check mark in the TRANSACT and CUSTOMER tables in the Navigator
window.
2. Click Transform Data.
a. This will open a new window for the Power Query Editor (this is the same
interface that you will encounter in Excel’s Get & Transform).
b. On the left side of the Power Query Editor, you can click through the different
queries to see previews of each table’s data. Similar to Tableau Prep, you are
provided only a sample of the dataset.
c. Click the Transact query to preview the data from the Transact table.
d. Scroll the main view to the right to see more of the attributes.
3. Power Query does not default to providing data profiling information the way
Tableau Prep’s Clean step does, but we can activate those options.
4. Click the View tab and place check marks in the Column Distribution and
Column Profile boxes.
Tableau | Prep
1. Add a new Clean step extending from the TRANSACT table (click the + icon
next to TRANSACT and choose Clean Step from the menu). A phantom step for
View and Clean may already exist. If so, just click that step to add it:
a. The Clean step provides many different options for preparing your data, which
we will get to in future labs. In this lab, you will use it as a means for
familiarizing yourself with the dataset.
b. Beneath the Flow Pane, you can see two new panes: the Profile Pane and the
Data Grid.
1. The Data Grid provides a more robust sample of data values than you were
able to see in the Input Pane from the Input step.
2. The Profile Pane provides summary visualizations of each attribute in the
table. Note: When datasets are large, these summary values are calculated
only from the first several thousand records in the original table, so be
cautious about using these visualizations to drive insights! In this instance,
we can see a good example of this being merely a sample by looking at the
TRAN_DATE visual summary. It shows only dates from 12/30/2013 to
01/27/2014, but we know the dataset has transactions through 2016.
c. Some of the attributes are straightforward in what they represent, but others
aren’t as clear. For instance, you may be curious about what TRAN_TYPE
represents. Look at the data visualization provided for TRAN_TYPE in the
Profile Pane and click P. This will filter the results in the Data Grid.
1. Look at the results in the TRAN_AMT field and note whether they are
positive or negative (you can do so by looking at the data grid or by looking
at the filtered visualization for TRAN_AMT).
2. Adjust the filter so that you see only R transaction types. Note the values in
the Tran_Amt field again.
page 108
Lab Note: The tools presented in this lab periodically change. Updated instructions, if
applicable, can be found in the eBook and lab walkthrough videos in Connect.
Case Summary: You are a brand-new analyst and you just got assigned to work on the
Dillard’s account. So far you have analyzed the ER Diagram to gain a bird’s-eye view of
all the different tables and fields in the database, and you have explored the data in each
table to gain a glimpse of sample values from each field and how they are all formatted.
You also gained a little insight into the distribution of sample values across each field, but
at this point you are ready to dig into the data a bit more. In the previous comprehensive
labs, you connected to full tables in Tableau or Power BI to explore the data. In this lab,
instead of connecting to full tables, we will write a SQL query to pull only a subset of data
into Tableau or Excel. This tactic is more effective when the database is very large and
you can derive insights from a sample of the data. We will analyze 5 days’ worth of
transaction data from September 2016. In this lab we will look at the distribution of
transactions across different states in order to get to know our data a little better.
Data: Dillard’s sales data are available only on the University of Arkansas Remote
Desktop (waltonlab.uark.edu). See your instructor for login credentials.
Microsoft Excel
Tableau | Desktop
page 110
Tableau | Desktop
1. Open Tableau Desktop and click Connect to Data > To a Server > Microsoft
SQL Server.
2. Enter the following:
a. Server: essql1.walton.uark.edu
b. Database: WCOB_Dillards
c. All other fields can be left as is, click Sign In.
d. Instead of connecting to a table, you will create a New Custom SQL query.
Double-click New Custom SQL and input the following query:
SELECT TRANSACT.*, STATE
FROM TRANSACT
INNER JOIN STORE
ON TRANSACT.STORE = STORE.STORE
WHERE TRAN_DATE BETWEEN ‘20160901’ AND ‘20160905’
e. Click Preview Results… to test your query on a sample data set.
f. If everything looks good, close the preview and click OK.
page 111
Lab 2-8 Part 2 View the Distribution of Transaction
Amounts across States
In addition to data from the Transact table, our query also pulled in the attribute State from
the Store table. We can use this attribute to identify the sum of transaction amounts across
states.
1. We will perform this analysis using a PivotTable. Return to the worksheet in your
Excel workbook titled Sheet1.
2. From the Insert tab in the ribbon, click PivotTable.
3. Check Use this workbook’s Data Model and click OK.
4. Expand the Query1 and place check marks next to TRAN_AMT and STATE.
5. The TRAN_AMT default aggregation will likely be SUM. Change it by right-
clicking one of the TRAN_AMT values in the PivotTable, selecting Summarize
Values By > Average.
6. To make this output easier to interpret, you can sort the data so that you see the
states that have the highest average transaction amount first. To do so, have your
active cell anywhere in the Average of TRAN_AMT column, right-click the cell,
select Sort, then select Sort Largest to Smallest.
7. To view a visualization of these results, click the PivotTable Analyze tab in the
ribbon and click PivotChart.
8. The default will be a column chart, which is great for visualizing these data.
Click OK.
9. Take a screenshot (label it 2-8MB) of your PivotTable and PivotChart and
click OK.
10. When you are finished answering the lab questions, you may close Excel. Save
your file as Lab 2-8 Dillard’s Stats.xlsx.
Tableau | Desktop
page 112
page 113
1B. Resnick, “Researchers Just Released Profile Data on 70,000 OkCupid Users without Permission,” Vox,
2016, http://www.vox.com/2016/5/12/11666116/70000-okcupid-users-data-release (accessed October 31,
2016).
2N. Singer and A. Krolik, “Grindr and OkCupid Spread Personal Details, Study Says,” New York Times,
January 13, 2020, https://www.nytimes.com/2020/01/13/technology/grindr-apps-dating-data-tracking.html
(accessed December 2020).
3J. P. Isson and J. S. Harriott, Win with Advanced Business Analytics: Creating Business Value from Your
Data (Hoboken, NJ: Wiley, 2013).
4G. C. Simsion and G. C. Witt, Data Modeling Essentials (Amsterdam: Morgan Kaufmann, 2005).
5T. Singleton, “What Every IT Auditor Should Know about Data Analytics,” n.d.,
http://www.isaca.org/Journal/archives/2013/Volume-6/Pages/What-Every-IT-Auditor-Should-Know-About-
Data-Analytics.aspx#2.
6For a description of the audit data standards, please see this website:
http://www.aicpa.org/interestareas/frc/assuranceadvisoryservices/pages/assuranceandadvisory.aspx.
7S. White, “6 Ethical Questions about Big Data,” Financial Management, https://www.fm-
magazine.com/news/2016/jun/ethical-questions-about-big-data.html (accessed December 2020).