0% found this document useful (0 votes)
10 views197 pages

Ref1.Richardson2023.DataAnalytics for Accounting - Chapt 1-2

This chapter discusses the transformative role of Data Analytics in accounting and business, emphasizing its importance in providing insights and addressing accounting questions. It outlines the Data Analytics Process using the IMPACT cycle and highlights the skills and tools accountants need to effectively utilize Data Analytics. The chapter also addresses the growing reliance on Big Data and the need for an analytics mindset among accounting professionals.

Uploaded by

Dang Thu Hang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views197 pages

Ref1.Richardson2023.DataAnalytics for Accounting - Chapt 1-2

This chapter discusses the transformative role of Data Analytics in accounting and business, emphasizing its importance in providing insights and addressing accounting questions. It outlines the Data Analytics Process using the IMPACT cycle and highlights the skills and tools accountants need to effectively utilize Data Analytics. The chapter also addresses the growing reliance on Big Data and the need for an analytics mindset among accounting professionals.

Uploaded by

Dang Thu Hang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 197

page 2

Chapter 1
Data Analytics for Accounting and
Identifying the Questions

A Look at This Chapter


Data Analytics is changing both business and accounting. In this chapter,
we define Data Analytics and explain its impact on business and the
accounting profession, noting that the value of Data Analytics is derived
from the insights it provides. We also describe the need for an analytics
mindset in the accounting profession. We next describe the Data Analytics
Process using the IMPACT cycle and explain how this process is used to
address both business and accounting questions. We then emphasize the
skills accountants need as well as the tools available for their use. In this
chapter, we specifically emphasize the importance of identifying
appropriate accounting questions that Data Analytics might be able to
address.

A Look Ahead
Chapter 2 provides a description of how data are prepared and scrubbed to
be ready for analysis to address accounting questions. We explain how to
extract, transform, and load data and then how to validate and normalize the
data. In addition, we explain how data standards are used to facilitate the
exchange of data between data sender and receiver. We finalize the chapter
by emphasizing the need for ethical data collection and data use to maintain
data privacy.

page 3

As the access to accounting data proliferates and tools and accountant


skills advance, accountants are relying more on Big Data to address
accounting questions. Whether those questions relate to audit, tax or other
accounting areas, increasingly value will be created by performing Data
Analytics. In this chapter, we introduce you to the need for Data Analytics
in accounting, and how accounting professionals are increasingly asked to
develop an analytics mindset for any and all accounting roles.

Cobalt S-Elinoi/Shutterstock

Technology such as Data Analytics, artificial intelligence, machine


learning, blockchain, and robotic process automation will be playing a
greater role in the accounting profession this year, according to a recent
report from the Institute of Management Accountants.
The report indicates that finance and accounting professionals are
increasingly implementing Big Data in their business processes, and the
pattern is likely to continue in the future. The IMA surveyed its members for
the report and received 170 responses from CFOs and other management
accountants. Many of the CFOs are predicting big changes for 2020 in
their businesses.

Sources: M. Cohn, “Accountants to Rely More on Big Data in 2020,”


Accounting Today, January 4, 2020,
https://www.accountingtoday.com/news/accountants-to-rely-more-on-big-
data-in-2020 (accessed December 2020).

OBJECTIVES
After reading this chapter, you should be able to:

LO 1-1 Define Data Analytics.

LO 1-2 Understand why Data Analytics matters to business.

LO 1-3 Explain why Data Analytics matters to accountants.

LO 1-4 Describe the Data Analytics Process using the


IMPACT cycle.

LO 1-5 Describe the skills needed by accountants.

LO 1-6 Explain how the IMPACT model may be used to


address a specific business question.

page 4

DATA ANALYTICS
LO 1-1
Define Data Analytics.

Data surround us! By the year 2024, it is expected that the volume of data
created, captured, copied, and consumed worldwide will be 149 zettabytes
(compared to 2 zettabytes in 2010 and 59 zettabytes in 2020).1 In fact, more
data have been created in the last 2 years than in the entire previous history
of the human race.2 With so much data available about each of us (e.g., how
we shop, what we read, what we’ve bought, what music we listen to, where
we travel, whom we trust, what devices we use, etc.), arguably, there is the
potential for analyzing those data in a way that can answer fundamental
business questions and create value.
We define Data Analytics as the process of evaluating data with the
purpose of drawing conclusions to address business questions. Indeed,
effective Data Analytics provides a way to search through large structured
data (data that adheres to a predefined data model in a tabular format) and
unstructured data (data that does not adhere to a predefined data format)
to discover unknown patterns or relationships.3 In other words, Data
Analytics often involves the technologies, systems, practices,
methodologies, databases, statistics, and applications used to analyze
diverse business data to give organizations the information they need to
make sound and timely business decisions.4 That is, the process of Data
Analytics aims to transform raw data into knowledge to create value.
Big Data refers to datasets that are too large and complex for
businesses’ existing systems to handle utilizing their traditional capabilities
to capture, store, manage, and analyze these datasets. Another way to
describe Big Data (or frankly any available data source) is by use of four
Vs: its volume (the sheer size of the dataset), velocity (the speed of data
processing), variety (the number of types of data), and veracity (the
underlying quality of the data). While sometimes Data Analytics and Big
Data are terms used interchangeably, we will use the term Data Analytics
throughout and focus on the possibility of turning data into knowledge and
that knowledge into insights that create value.

PROGRESS CHECK
1. How does having more data around us translate into value for a company?

What must we do with those data to extract value?


2. Banks know a lot about us, but they have traditionally used externally

generated credit scores to assess creditworthiness when deciding whether to

extend a loan. How would you suggest a bank use Data Analytics to get a more

complete view of its customers’ creditworthiness? Assume the bank has access

to a customer’s loan history, credit card transactions, deposit history, and direct

deposit registration. How could it assess whether a loan might be repaid?

HOW DATA ANALYTICS AFFECTS BUSINESS


LO 1-2
Understand why Data Analytics matters to business.

There is little question that the impact of data and Data Analytics on
business is overwhelming. In fact, in PwC’s 18th Annual Global CEO
Survey, 86 percent of chief executive officers (CEOs) say they find it
important to champion digital technologies and emphasize a clear vision of
using technology for a competitive advantage, while 85 percent say they put
a high value on Data Analytics. In fact, per PwC’s 6th Annual page 5
Digital IQ survey of more than 1,400 leaders from digital
businesses, the area of investment that tops CEOs’ list of priorities is
business analytics.5
A recent study from McKinsey Global Institute estimates that Data
Analytics and technology could generate up to $2 trillion in value per year
in just a subset of the total possible industries affected.6 Data Analytics
could very much transform the manner in which companies run their
businesses in the near future because the real value of data comes from Data
Analytics. With a wealth of data on their hands, companies use Data
Analytics to discover the various buying patterns of their customers,
investigate anomalies that were not anticipated, forecast future possibilities,
and so on. For example, with insight provided through Data Analytics,
companies could execute more directed marketing campaigns based on
patterns observed in their data, giving them a competitive advantage over
companies that do not use this information to improve their marketing
strategies. By pairing structured data with unstructured data, patterns could
be discovered that create new meaning, creating value and competitive
advantage. In addition to producing more value externally, studies show
that Data Analytics affects internal processes, improving productivity,
utilization, and growth.7
And increasingly, data analytic tools are available as self-service
analytics allowing users the capabilities to analyze data by aggregating,
filtering, analyzing, enriching, sorting, visualizing, and dashboarding for
data-driven decision making on demand.
PwC notes that while data has always been important, executives are
more frequently being asked to make data-driven decisions in high-stress
and high-change environments, making the reliance on Data Analytics even
greater these days!8

PROGRESS CHECK
3. Let’s assume a brand manager at Procter and Gamble identifies that an older

demographic might be concerned with the use of Tide Pods to do their laundry.

How might Procter and Gamble use Data Analytics to assess if this is a

problem?

4. How might Data Analytics assess the decision to either grant overtime to

current employees or hire additional employees? Specifically, consider how

Data Analytics might be helpful in reducing a company’s overtime direct labor

costs in a manufacturing setting.

HOW DATA ANALYTICS AFFECTS


ACCOUNTING
LO 1-3
Explain why Data Analytics matters to accountants.

Data Analytics is expected to have dramatic effects on auditing and


financial reporting as well as tax and managerial accounting. We detail how
we think this might happen in each of the following sections.

page 6

Auditing
Data Analytics plays an increasingly critical role in the future of audit. In a
recent Forbes Insights/KPMG report, “Audit 2020: A Focus on Change,”
the vast majority of survey respondents believe both that:

1. Audits must better embrace technology.


2. Technology will enhance the quality, transparency, and accuracy of the
audit.

Indeed, “As the business landscape for most organizations becomes


increasingly complex and fast-paced, there is a movement toward
leveraging advanced business analytic techniques to refine the focus on risk
and derive deeper insights into an organization.”9 Many auditors believe
that audit data analytics will, in fact, lead to deeper insights that will
enhance audit quality. This sentiment of the impact of Data Analytics on the
audit has been growing for several years now and has given many public
accounting firms incentives to invest in technology and personnel to
capture, organize, and analyze financial statement data to provide enhanced
audits, expanded services, and added value to their clients. As a result, Data
Analytics is the next innovation in the evolution of the audit and
professional accounting industry.
Given the fact that operational data abound and are easier to collect and
manage, combined with CEOs’ desires to utilize these data, the accounting
firms may now approach their engagements with a different mindset. No
longer will they be simply checking for errors, material misstatements,
fraud, and risk in financial statements or merely be reporting their findings
at the end of the engagement. Instead, audit professionals will now be
collecting and analyzing the company’s data similar to the way a business
analyst would to help management make better business decisions. This
means that, in many cases, external auditors will stay engaged with clients
beyond the audit. This is a significant paradigm shift. The audit process is
changing from a traditional process toward a more automated one, which
will allow audit professionals to focus more on the logic and rationale
behind data queries and less on the gathering of the actual data.10 As a
result, audits will not only yield important findings from a financial
perspective, but also information that can help companies refine processes,
improve efficiency, and anticipate future problems.

“It’s a massive leap to go from traditional audit approaches to one that


fully integrates big data and analytics in a seamless manner.”11

Data Analytics also expands auditors’ capabilities in services like testing


for fraudulent transactions and automating compliance-monitoring activities
(like filing financial reports to the U.S. Securities and Exchange
Commission [SEC] or to the Internal Revenue Service [IRS]). This is
possible because Data Analytics enables auditors to analyze the complete
dataset, rather than the sampling of the financial data done in a traditional
audit. Data Analytics enables auditors to improve its risk assessment in both
its substantive and detailed testing.

page 7
We address auditing questions and Data Analytics in Chapters 5 and 6.

Lab Connection
Lab 1-3 has you explore questions auditors would answer with
Data Analytics.

Management Accounting
Of all the fields of accounting, it would seem that the aims of Data
Analytics are most akin to management accounting. Management
accountants (1) are asked questions by management, (2) find data to address
those questions, (3) analyze the data, and (4) report the results to
management to aid in their decision making. The description of the
management accountant’s task and that of the data analyst appear to be
quite similar, if not identical in many respects.
Whether it be understanding costs via job order costing, understanding
the activity-based costing drivers, forecasting future sales on which to base
budgets, or determining whether to sell or process further or make or
outsource its production processes, analyzing data is critical to management
accountants.
As information providers for the firm, it is imperative for management
accountants to understand the capabilities of data and Data Analytics to
address management questions.
We address management accounting questions and Data Analytics in
Chapter 7.

Lab Connection

Lab 1-2 and Lab 1-4 have you explore questions managers
would answer with Data Analytics.
Financial Reporting and Financial Statement
Analysis
Data Analytics also potentially has an impact on financial reporting. With
the use of so many estimates and valuations in financial accounting, some
believe that employing Data Analytics may substantially improve the
quality of the estimates and valuations. Data from within an enterprise
system and external to the company and system might be used to address
many of the questions that face financial reporting. Many financial
statement accounts are just estimates, and so accountants often ask
themselves questions like this to evaluate those estimates:
1. How much of the accounts receivable balance will ultimately be
collected? What should the allowance for loan losses look like?
2. Is any of our inventory obsolete? Should our inventory be valued at
market or cost (applying the lower-of-cost-or-market rule)? When will it
be out of date? Do we need to offer a discount on it now to get it sold?
3. Has our goodwill been impaired due to the reduction in profitability
from a recent merger? Will it regain value in the near future?
4. How should we value contingent liabilities like warranty claims or
litigation? Do we have the right amount?

Data Analytics may also allow an accountant or auditor to assess the


probability of a goodwill write-down, warranty claims, or the collectability
of bad debts based on what customers, investors, and other stakeholders are
saying about the company in blogs and in social media (like Facebook and
Twitter). This information might help the firm determine both its page 8
optimal response to the situation and appropriate adjustment to
its financial reporting.
It may be possible to use Data Analytics to scan the environment—that
is, scan Google searches and social media (such as Instagram and
Facebook) to identify potential risks to and opportunities for the firm. For
example, in a data analytic sense, it may allow a firm to monitor its
competitors and its customers to better understand opportunities and threats
around it. For example, are its competitors, customers, or suppliers facing
financial difficulty that might affect the company’s interactions with them
and/or open up new opportunities that otherwise it wouldn’t have
considered?
We address financial reporting and financial statement analysis
questions and Data Analytics in Chapter 8.

Lab Connection

Lab 1-1 has you explore questions financial accountants would


answer with Data Analytics.

Tax
Traditionally, tax work dealt with compliance issues based on data from
transactions that have already taken place. Now, however, tax executives
must develop sophisticated tax planning capabilities that assist the company
with minimizing its taxes in such a way to avoid or prepare for a potential
audit. This shift in focus makes tax data analytics valuable for its ability to
help tax staffs predict what will happen rather than react to what just did
happen. Arguably, one of the things that Data Analytics does best is
predictive analytics—predicting the future! An example of how tax data
analytics might be used is the capability to predict the potential tax
consequences of a potential international transaction, R&D investment, or
proposed merger or acquisition in one of their most value-adding tasks, that
of tax planning!
One of the issues of performing predictive Data Analytics is the efficient
organization and use of data stored across multiple systems on varying
platforms that were not originally designed for use in the tax department.
Organizing tax data into a data warehouse to be able to consistently model
and query the data is an important step toward developing the capability to
perform tax data analytics. This issue is exemplified by the 29 percent of
tax departments that find the biggest challenge in executing an analytics
strategy is integrating the strategy with the IT department and gaining
access to available technology tools.12
We address tax questions and Data Analytics in Chapter 9.

PROGRESS CHECK
5. Why are management accounting and Data Analytics considered similar in

many respects?

6. How specifically will Data Analytics change the way a tax staff does its taxes?

page 9

THE DATA ANALYTICS PROCESS USING


THE IMPACT CYCLE
LO 1-4
Describe the Data Analytics Process using the IMPACT cycle.

Data Analytics is a process to identify business questions and problems that


can be addressed with data. We start to describe our Data Analytics Process
by using an established Data Analytics model called the IMPACT cycle by
Isson and Harriott (as shown in Exhibit 1-1).

EXHIBIT 1-1
The IMPACT Cycle

Source: Isson, J. P., and J. S. Harriott. Win with Advanced Business Analytics: Creating Business
Value from Your Data. Hoboken, NJ: Wiley, 2013.
We explain the full IMPACT cycle briefly here, but in more detail later
in Chapters 2, 3, and 4. We use its approach for thinking about the steps
included in Data Analytics throughout this textbook, all the way from
carefully identifying the question to accessing and analyzing the data to
communicating insights and tracking outcomes.13
Step 1: Identify the Questions (Chapter 1)
It all begins with understanding a business problem that needs addressing.
Questions can arise from many sources, including how to better attract
customers, how to price a product, how to reduce costs, or how to find
errors or fraud. Having a concrete, specific question that is potentially
answerable by Data Analytics is an important first step.
Indeed, accountants often possess a unique skillset to improve an
organization’s Data Analytics by their ability to ask the right questions,
especially since they often understand a company’s financial data. In other
words, “Your Data Won’t Speak Unless You Ask It the Right Data Analysis
Questions.”14 We could ask any question in the world, but if we don’t
ultimately have the right data to address the question, there really isn’t
much use for Data Analytics for those questions.

Additional attributes to consider might include the following:

Audience: Who is the audience that will use the results of the analysis
(internal auditor, CFO, financial analyst, tax professional, etc.)?
Scope: Is the question too narrow or too broad?
Use: How will the results be used? Is it to identify risks? Is it to make
data-driven business decisions?

page 10
Here is an example of potential questions accountants might address
using Data Analytics:

Are employees circumventing internal controls over payments?


What are appropriate cost drivers for activity-based costing purposes?
To minimize taxes, should we have our company headquarters in
Dublin, Ireland, or in Chicago?
Are our customers paying us in a timely manner? Are we paying our
suppliers in a timely manner?
How can we more accurately predict the allowance for loan losses for
our bank loans?
How can we find transactions that are risky in terms of accounting
issues?
Who authorizes checks above $100,000?
How can errors made in journal entries be identified?
Should we outsource our products to Indonesia, or produce them
ourselves?
Step 2: Master the Data (Chapter 2)
Mastering the data requires one to know what data are available and
whether those data might be able to help address the business problem. We
need to know everything about the data, including how to access,
availability, reliability (if there are errors or missing data), frequency of
updates, what time periods are covered to make sure the data coincide with
the timing of our business problem, and so on.
In addition, to give us some idea of the data questions, we may want to
consider the following:

Review data availability in a firm’s internal systems (including those in


the financial reporting system or enterprise systems that might occur in
its accounting processes—financial, procure-to-pay, production, order-
to-cash, human resources).
Review data availability in a firm’s external network, including those
that might already be housed in an existing data warehouse.
Examine data dictionaries and other contextual data—to provide details
about the data.
Evaluate and perform the ETL (extraction, transformation, and loading)
processes and assess the time required to complete.
Assess data validation and completeness—to provide a sense of the
reliability of the data.
Evaluate and perform data normalization—to reduce data redundancy
and improve data integrity.
Evaluate and perform data preparation and scrubbing—Data Analytics
professionals estimate that they spend between 50 and 90 percent of
their time cleaning data so the data can be analyzed.15
Step 3: Perform Test Plan (Chapter 3)
After mastering the data and after the data are ready (in step 2), we are
prepared for analysis. With the data ready for analysis, we need to think of
the right approach to the data to be able to answer the question.
In Data Analytics, we work to extract knowledge from the data to
address questions and problems. Using all available data, we see if we can
identify a relationship between the response (or dependent) variables and
those items that affect the response (also called predictor, page 11
explanatory, or independent variables). To do so, we’ll
generally make a model, or a simplified representation of reality, to address
this purpose.
An example might be helpful here. Let’s say we are trying to predict
each of your classmates’ performance on their next intermediate accounting
exam. The response or dependent variable will be the score on the next
exam. What helps predict the performance of each exam will be our
predictor, explanatory, or independent variables. Variables such as study
time, score on last exam, IQ, and standardized test scores (ACT, SAT, etc.),
as well as student enjoyment of accounting, might all be considered.
Perhaps given your experience, you can name other predictor variables to
include in our model predicting exam performance.
The research question, the model, the data availability, and the expected
statistical inference may all suggest the use of different data approaches.
Provost and Fawcett16 detail eight different approaches to Data Analytics
depending on the question. We will discuss the most applicable ones to
accounting more formally in Chapter 3 and highlight accounting questions
that they might address. The eight different approaches include the
following:
Classification—An attempt to assign each unit (or individual) in a
population into a few categories. An example of classification might be,
of all the loans this bank has offered, which are most likely to default?
Or which loan applications are expected to be approved? Or which
transactions would a credit card company flag as potentially being
fraudulent and deny payment? Which companies are most likely to go
bankrupt in the next two years?
Regression—A data approach used to predict a specific dependent
variable value based on independent variable inputs using a statistical
model. Regression analysis might be used to assess the relationship
between an investment in R&D and subsequent operating income.
Another example would be the use of regression to identify an
appropriate cost driver to allocate overhead as part of activity-based
costing.
Similarity matching—An attempt to identify similar individuals based
on data known about them. A company may use similarity matching to
find new customers that may closely resemble their best customers (in
hopes that they find additional profitable customers).
Clustering—An attempt to divide individuals (like customers) into
groups (or clusters) in a useful or meaningful way. In other words,
identifying groups of similar data elements and the underlying drivers of
those groups. For example, clustering might be used to segment loyalty
card customers into groups based on buying behavior related to
shopping frequency or purchasing volume, for additional analysis and
marketing activities.
Co-occurrence grouping—An attempt to discover associations
between individuals based on transactions involving them. Amazon
might use this to sell another item to you by knowing what items are
“frequently bought together” or “Customers who bought this item also
bought….” Exhibit 1-2 shows us an Amazon search for the Yamaha
MG10XU stereo mixer provides several related item suggestions to the
customer.
EXHIBIT 1-2
Example of Co-occurrence Grouping on Amazon.com

Amazon Inc.

Profiling—An attempt to characterize the “typical” behavior of an


individual, group, or population by generating summary statistics about
the data (including mean, median, minimum, maximum, and standard
deviation). By understanding the typical behavior, we’ll be able to more
easily identify abnormal behavior. When behavior departs from that
typical behavior—which we’ll call an anomaly—then further
investigation is warranted. Profiling might be used in accounting to
identify fraud or just those transactions that might warrant some
additional investigation (e.g., travel expenses that are three standard
deviations above the norm).

page 12

Link prediction—An attempt to predict connections between two data


items. This might be used in social media. For example, because an
individual might have 22 mutual Facebook friends with me and we both
attended Brigham Young University in the same year, is there a chance
we would like to be Facebook friends as well? Exhibit 1-3 provides an
example of this used in Facebook. Link prediction in an accounting
setting might work to use social media to look for relationships between
related parties that are not otherwise disclosed.

EXHIBIT 1-3
Example of Link Prediction on Facebook

Michael DeLeon/Getty Images; Sam Edwards/Glow Images; Daniel Ernst/Getty Images;


Exactostock/SuperStock; McGraw HIll

Data reduction—A data approach that attempts to reduce the amount of


information that needs to be considered to focus on the most critical
items (e.g., highest cost, highest risk, largest impact, etc.). It does this
by taking a large set of data (perhaps the population) and reducing it
with a smaller set that has the vast majority of the critical information of
the larger set. An example might include the potential to use these
techniques in auditing. While auditing has employed various page 13
random and stratified sampling over the years, Data Analytics suggests
new ways to highlight which transactions do not need the same level of
additional vetting (such as substantive testing) as other transactions.
Step 4: Address and Refine Results (Chapter
3)
After the data have been analyzed (in step 3 of the IMPACT cycle), the
fourth step is to address and refine results. Data analysis is iterative. We
slice, dice, and manipulate the data; find correlations; test hypotheses; ask
ourselves further, hopefully better questions; ask colleagues what they
think; and revise and rerun the analysis potentially multiple times. But once
that is complete, we have the results ready to communicate to interested
stakeholders that hopefully directly addresses their questions.
Steps 5 and 6: Communicate Insights and
Track Outcomes (Chapter 4 and each chapter
thereafter)
Once the results have been determined (in step 4 of the IMPACT cycle),
insights are formed by decision makers and are communicated (the “C” in
the IMPACT cycle) and some outcomes will be continuously tracked (the
“T” in the IMPACT cycle).
Chapter 4 discusses ways to communicate results, including the use of
executive summaries, static reports, digital dashboards, and data
visualizations. Data Analytics is especially interested in reporting results
that help decision makers see the data in an all-new way to develop insights
that help answer business questions, recognizing that different users
consume deliverables in a potentially different way. Increasingly, digital
dashboards and data visualizations are particularly helpful in
communicating insights and tracking outcomes.

Back to Step 1
Since the IMPACT cycle is iterative, once insights are gained and outcomes
are tracked, new more refined questions emerge that may use the same or
different data sources with potentially different analyses and thus, the
IMPACT cycle begins anew.

PROGRESS CHECK
7. Let’s say we are trying to predict how much money college students spend on

fast food each week. What would be the response, or dependent, variable?

What would be examples of independent variables?

8. How might a data reduction approach be used in auditing to allow the auditor to

spend more time and effort on the most important (e.g., most risky, largest

dollar volume, etc.) items?


DATA ANALYTIC SKILLS AND TOOLS
NEEDED BY ANALYTIC-MINDED
ACCOUNTANTS
LO 1-5
Describe the skills and tools needed by accountants.

While we don’t believe that accountants need to become data scientists—


they may never need to build a database from scratch or perform the real,
hardcore Data Analytics—they must know how to do the following:

Clearly articulate the business problem the company is facing.


Communicate with the data scientists about specific data needs and
understand the underlying quality of the data.

page 14

Draw appropriate conclusions to the business problem based on the data


and make recommendations on a timely basis.
Present their results to individual members of management (CEOs, audit
managers, etc.) in an accessible manner to each member.

Consistent with that, in this text we emphasize skills that analytic-


minded accountants should have in the following seven areas:

1. Developed analytics mindset—know when and how Data Analytics can


address business questions.
2. Data scrubbing and data preparation—comprehend the process needed to
clean and prepare the data before analysis.
3. Data quality—recognize what is meant by data quality, be it
completeness, reliability, or validity.
4. Descriptive data analysis—perform basic analysis to understand the
quality of the underlying data and its ability to address the business
question.
5. Data analysis through data manipulation—demonstrate ability to sort,
rearrange, merge, and reconfigure data in a manner that allows enhanced
analysis. This may include diagnostic, predictive, or prescriptive
analytics to appropriately analyze the data.
6. Statistical data analysis competency—identify and implement an
approach that will use statistical data analysis to draw conclusions and
make recommendations on a timely basis.
7. Data visualization and data reporting—report results of analysis in an
accessible way to each varied decision maker and his or her specific
needs.

We address these seven skills throughout the first four chapters in the
text in hopes that the analytic-minded accountant will develop and practice
these skills to be ready to address business questions. We then demonstrate
these skills in the labs and hands-on analysis throughout the rest of the
book.

Data Analytics at Work

What Does a Data Analyst Do at a Big Four Accounting


Firm?

Data Sources: We extract financial data from a number of


different ERP systems including SAP, Abacus, Sage, and
Microsoft Navision (among others).

Data Scrubbing and Data Preparation: A huge part of our


time goes into data cleaning and data transformation.

Tools Used: Excel, Unix commands, SQL, and Python are


used to automate large chunks of our work.

Knowledge Needed: Basic Excel, programming skills (SQL,


Python), and audit knowledge such as understanding journal
entries and trial balances are needed.
Source: “Data Analyst at a Big 4—What Is It Like? My Opinion Working as a Data
Analyst at a Big Four,” https://cryptobulls.info/data-analyst-at-a-big-4-what-is-it-like-
pros-cons-ernst-young-deloitte-pwc-kpmg, posted February 29, 2020 (accessed
January 2, 2021).

Choose the Right Data Analytics Tools


In addition to developing the right skills, it is also important to be familiar
with the right Data Analytics tools for each task. There are many tools
available for Data Analytics preparation, modeling, and page 15
visualization. Gartner annually assesses a collection of these tools
and creates the “magic quadrant” for business intelligence, depicted in
Exhibit 1-4. The magic quadrant can provide insight into which tools you
should consider using.

EXHIBIT 1-4
Gartner Magic Quadrant for Business Intelligence and Analytics
Platforms

Source: Sallam, R. L., C. Howson, C. J. Idoine, T. W. Oestreich, J. L. Richardson, and J. Tapadinhas,


“Magic Quadrant for Business Intelligence and Analytics Platforms,” Gartner RAS Core Research
Notes, Gartner, Stamford, CT (2020).
Based on Gartner’s magic quadrant, it is easy to see that Tableau and
Microsoft provide innovative solutions. While there are other tools that are
popular in different industries, such as Qlik and TIBCO, Tableau and
Microsoft tools are the ones you will most likely encounter because of their
position as leaders in the Data Analytics space. For this reason, each of the
labs throughout this textbook will give you or your instructor the option to
choose either a Microsoft Track or a Tableau Track to help you become
proficient in those tools. The skills you learn as you work through the labs
are transferrable to other tools as well.

The Microsoft Track


Microsoft’s offerings for Data Analytics and business intelligence (BI)
include Excel, Power Query, Power BI, and Power Automate. It is likely
that you already have some familiarity with Excel as it is used for
everything from recording transactions to running calculations and
preparing financial reports. These tools are summarized in Exhibit 1-5.
EXHIBIT 1-5
Microsoft Data Analytics Tools

Excel is the most ubiquitous spreadsheet software and most commonly


used for basic data analysis. It allows for the creation of tables, advanced
formulas to do quick or complex calculations, and the ability to create
PivotTables and basic charts and graphs. One major issue with using Excel
for analysis of large datasets is the 1,048,576 row limit due to memory
constraints. It is available on Windows and Mac as well as through the
Microsoft 365 online service for simple collaboration and page 16
sharing, although the most complete set of features and
compatibility with advanced plug-ins is available only on Windows.
Power Query is a tool built into Excel and Power BI Desktop on
Windows that lets Excel connect to a variety of different data sources, such
as tables in Excel, databases hosted on popular platforms like SQL Server,
or through open database connectivity (ODBC) connections. Power Query
makes it possible to connect, manipulate, clean, and join data so you can
pull them into your Excel sheet or use them in Power BI to create summary
reports and advanced visualizations. Additionally, it tracks each step you
perform so you can apply the same transformations to new data without
recreating the work from scratch.
Power BI is an analytic platform that enables generation of simple or
advanced Data Analytics models and visualizations that can be compiled
into dashboards for easy sharing with relevant stakeholders. It builds on
data from Excel or other databases and can leverage models created with
Power Query to quickly summarize key data findings. Microsoft provides
Power BI Desktop for free only on Windows or through a web-based app,
though the online version does not have all of the features of the desktop
version and is primarily used for sharing.
Power Automate is a tool that leverages robotics process automation
(RPA) to automate routine tasks and workflows, such as scraping and
collecting data from nonstructured sources, including emails and other
online services. These can pull data from relevant sources based on events,
such as when an invoice is generated. Power Automate is a web-based
subscription service with a tool that works only on Windows to automate
keystrokes and mouse clicks.

The Tableau Track


In previous years, Tableau was ranked slightly higher than Microsoft on its
ability to execute and it continues to be a popular choice for analytics
professionals. Tableau’s primary offerings include Tableau Prep for data
preparation and Tableau Desktop for data visualization and storytelling.
Tableau has an advantage over Microsoft in that its tools are available for
both Windows and Mac computers. Additionally, Tableau offers online
services through Tableau Server and Tableau Online with the same,
complete feature set as their apps. These are summarized in Exhibit 1-6.

EXHIBIT 1-6
Tableau Data Analytics Tools

Tableau Prep is primarily used for data combination, cleaning,


manipulation, and insights. It enables users to interact with data and quickly
identify data quality issues with a clear map of steps performed page 17
so others can review the cleaning process. It is available on
Windows, Mac, and Tableau Online.
Tableau Desktop can be used to generate basic to advanced Data
Analytics models and visualizations with an easy-to-use drag-and-drop
interface.
Tableau Public is a free limited edition of Tableau Desktop that is
specifically tailored to sharing and analyzing public datasets. It has some
significant limitations for broader analysis.

PROGRESS CHECK
9. Given the “magic quadrant” in Exhibit 1-4, why are the software tools

represented by the Microsoft and Tableau tracks considered innovative?

10. Why is having the Tableau software tools fully available on both Windows and

Mac computers an advantage for Tableau over Microsoft?

HANDS-ON EXAMPLE OF THE IMPACT


MODEL
LO 1-6
Explain how the IMPACT model may be used to address a specific business question.

Here we provide a complete, hands-on example of the IMPACT model to


show how it could be implemented for a specific situation.
Let’s suppose I am trying to get a loan to pay off some credit card debt
and my friend has told me about a new source of funds that doesn’t involve
a bank. In recent years, facilitated by the Internet, peer-to-peer lenders
allow individuals to both borrow and lend money to each other. While there
are other peer-to-peer lenders, in this case, we will specifically consider the
LendingClub.
My question is whether I will be able to get a loan given my prior loan
history (poor), credit score, and the like. According to our approaches
mentioned, this would be an example of a classification approach because
we are attempting to predict whether a person applying for a loan will be
approved and funded or whether she will be denied a loan.
Identify the Questions
Stated specifically, our question is, “What are some characteristics of
rejected loans?”

Master the Data


LendingClub is a U.S.-based, peer-to-peer lending company,
headquartered in San Francisco, California. LendingClub facilitates both
borrowing and lending by providing a platform for unsecured personal
loans between $1,000 and $35,000. The loan period is for either 3 or 5
years. There is information available that allows potential investors to
search and browse the loan listings on the LendingClub website and select
loans in which they would like to invest. The available information includes
information supplied about the borrower, amount of the loan, loan grade
(and related loan interest rate), and loan purpose. Investors invest in the
loans and make money from interest. LendingClub makes money by
charging borrowers an origination fee and investors a service fee. Since
2007, hundreds of thousands of borrowers have obtained more than $60
billion in loans via LendingClub.17
Some basic lending statistics are included on the LendingClub Statistics
website (Exhibit 1-7). Each bar represents the volume of loans each quarter
during its respective year.

page 18

EXHIBIT 1-7
LendingClub Statistics

Source: Accessed December 2020, https://www.lendingclub.com/info/statistics.action


Borrowers borrow money for a variety of reasons, including refinancing
other debt and paying off credit cards, as well as borrowing for other
purposes (Exhibit 1-8).

EXHIBIT 1-8
LendingClub Statistics by Reported Loan Purpose
42.33% of LendingClub borrowers report using their loans to refinance
existing loans as of September 30, 2020.
Source: Accessed December 2020, https://www.lendingclub.com/info/statistics.action
LendingClub provides datasets on the loans it approved and funded as
well as data for the loans that were declined. To address the question posed,
“What are some characteristics of rejected loans?,” we’ll use the dataset of
rejected loans.
The rejected loan datasets and related data dictionary are available from
your instructor or from Connect (in Additional Student Resources).

page 19
As we learn about the data, it is important to know what is available to
us. To that end, there is a data dictionary that provides descriptions for all
of the data attributes of the dataset. A cut-out of the data dictionary for the
rejected stats file (i.e., the statistics about those loans rejected) is shown in
Exhibit 1-9.

EXHIBIT 1-9
2007–2012 LendingClub Data Dictionary for Declined Loan Data
Source: LendingClub

RejectStats
File Description
Amount Requested Total requested loan amount
Application Date Date of borrower application
RejectStats
File Description

Loan Title Loan title


Risk_Score Borrower risk (FICO) score

Dept-To-Income Ratio of borrower total monthly debt payments divided by monthly


Ratio income.

Zip Code The first 3 numbers of the borrower zip code provided from loan
application.
State Two digit State Abbreviation provided from loan application.

Employment Length Employment length in years, where 0 is less than 1 and 10 is greater
than 10.

Policy Code policy_code=1 if publicly available.


policy_code=2 if not publicly available

We could also take a look at the data files available for the funded loan
data. However, for our analysis in the rest of this chapter, we use the Excel
file “DAA Chapter 1-1 Data” that has rejected loan statistics from
LendingClub for the time period of 2007 to 2012. It is a cleaned-up,
transformed file ready for analysis. We’ll learn more about data scrubbing
and preparation of the data in Chapter 2.
Exhibit 1-10 provides a cut-out of the 2007–2012 “Declined Loan”
dataset provided.

EXHIBIT 1-10
2007–2012 Declined Loan Applications (DAA Chapter 1-1 Data)
Dataset

Microsoft Excel, 2016


page 20

Perform Test Plan


Considering our question, “What are the characteristics of rejected loans (at
LendingClub)?,” and the available data, we will do three analyses to assess
what is considered in rejecting/accepting a loan, including:

1. The debt-to-income ratios and number of rejected loans.


2. The length of employment and number of rejected loans.
3. The credit (or risk) score and number of rejected loans.
Because LendingClub collects these three loan characteristics, we
believe it will provide LendingClub with the data needed to assess whether
the potential borrower will be able to pay back the loan, and give us an idea
if our loan will be approved or rejected.
The first analysis we perform considers the debt-to-income ratio of the
potential borrower. That is, before adding this potential loan, how big is the
potential borrower’s debt compared to the size of the potential borrower’s
annual income?
To consider the debt-to-income ratio in our analysis, three buckets
(labeled DTI bucket) are constructed for each grouping of the debt-to-
income ratio. These three buckets include the following:

1. High (debt is greater than 20 percent of income).


2. Medium (“Mid”) (debt is between 10 and 20 percent of income).
3. Low (debt is less than 10 percent of income).
Once those buckets are constructed, we are ready to analyze the
breakdown of rejected loan applications by the debt-to-income ratio.
The Excel PivotTable is an easy way to make comparisons between the
different levels of DTI. When we run a PivotTable analysis, we highlight
the loans, which count the number of loans applied for and rejected, and the
DTI bucket (see Exhibit 1-11). The PivotTable counts the number of loan
applications in each of the three DTI buckets: high, medium (mid), and low.
This suggests that because the high DTI bucket has the highest number of
loan applications, perhaps the applicant asked for a loan that was too big
given his or her income. LendingClub might have seen that as too big of a
risk and chosen to not extend the loan to the borrower using the debt-to-
income ratio as an indicator.

EXHIBIT 1-11
LendingClub Declined Loan Applications by DTI (Debt-to-Income)
DTI bucket includes high (debt > 20 percent of income), medium (“mid”)
(debt between 10 and 20 percent of income), and low (debt < 10 percent of
income).
Microsoft Excel, 2016
The second analysis is on the length of employment and its relationship
with rejected loans (see Exhibit 1-12). Arguably, the longer the
employment, the more stable of a job and income stream you will have to
ultimately repay the loan. LendingClub reports the number of years of
employment for each of the rejected applications. The PivotTable analysis
lists the number of loans by the length of employment. Almost 77 percent
(495,109 out of 645,414) out of the total rejected loans had worked at a job
for less than 1 year, suggesting potentially an important reason for rejecting
the requested loan. Perhaps some had worked a week, or just a month, and
still want a big loan?

EXHIBIT 1-12
LendingClub Declined Loan Applications by Employment Length
(Years of Experience)

Microsoft Excel, 2016


The third analysis we perform is to consider the credit or risk score of
the applicant. As noted in Exhibit 1-13, risk scores are typically classified
in this way with those in the excellent and very good category receiving the
lowest possible interest rates and best terms with a credit score above 750.
On the other end of the spectrum are those with very bad credit (with a
credit score less than 600).

EXHIBIT 1-13
Breakdown of Customer Credit Scores (or Risk Scores)

Source: Cafecredit.com
We will classify the sample according to this breakdown into excellent,
very good, good, fair, poor, and very bad credit according to their credit
score noted in Exhibit 1-13.
As part of the analysis of credit score and rejected loans, we again perform
PivotTable analysis (as seen in Exhibit 1-14) by counting the number of
rejected loan applications by credit (risk) score. We’ll note in the rejected
loans that nearly 82 percent [(167,379 + 151,716 + 207,234)/645,414] of
the applicants have either very bad, poor, or fair credit ratings, suggesting
this might be a good reason for a loan rejection. We also note that only 0.3
percent (2,494/645,414) of those rejected loan applications had excellent
credit.

page 21

EXHIBIT 1-14
The Count of LendingClub Rejected Loan Applications by Credit or
Risk Score Classification Using PivotTable Analysis
(PivotTable shown here required manually sorting rows to get in proper
order.)
Microsoft Excel, 2016

page 22

page 23

Address and Refine Results


Now that we have completed the basic analysis, we can refine our analysis
for greater insights. An example of this more refined analysis might be a
further investigation of the rejected loans. For example, if these are the
applications that were all rejected, the question is how many of these that
might apply for a loan not only had excellent credit, but also had worked
more than 10 years and had asked for a loan that was less than 10 percent of
their income (in the low DTI bucket)? Use of a PivotTable (as shown in
Exhibit 1-15) allows us to consider this three-way interaction and provides
an answer of 365 out of 645,414 (0.057 percent of the total). This might
suggest that the use of these three metrics is reasonable at predicting loan
rejection because the number who have excellent credit, worked more than
10 years, and requested a loan that was less than 10 percent of their income
was such a small percentage of the total.

EXHIBIT 1-15
The Count of LendingClub Declined Loan Applications by Credit (or
Risk Score), Debt-to-Income (DTI Bucket), and Employment Length
Using PivotTable Analysis (Highlighting Added)

Microsoft Excel, 2016


Perhaps those with excellent credit just asked for too big of a loan given
their existing debt and that is why they were rejected. Exhibit 1-16 shows
the PivotTable analysis. The analysis shows those with excellent credit
asked for a larger loan (16.2 percent of income) given the debt page 24
they already had as compared to any of the others, suggesting a
reason even those potential borrowers with excellent credit were rejected.

EXHIBIT 1-16
The Average Debt-to-Income Ratio (Shown as a Percentage) by Credit
(Risk) Score for LendingClub Declined Loan Applications Using
PivotTable Analysis

Microsoft Excel, 2016

Communicate Insights
Certainly further and more sophisticated analysis could be performed, but at
this point we have a pretty good idea of what LendingClub uses to decide
whether to extend or reject a loan to a potential borrower. We can
communicate these insights either by showing the PivotTables or simply
stating what three of the determinants are. What is the most effective
communication? Just showing the PivotTables themselves, showing a graph
of the results, or simply sharing the names of these three determinants to the
decision makers? Knowing the decision makers and how they like to
receive this information will help the analyst determine how to
communicate insights.

Track Outcomes
There are a wide variety of outcomes that could be tracked. But in this case,
it might be best to see if we could predict future outcomes. For example, the
data we analyzed were from 2007 to 2012. We could make our predictions
for subsequent years based on what we had found in the past and then test
to see how accurate we are with those predictions. We could also change
our prediction model when we learn new insights and additional data
become available.

page 25

PROGRESS CHECK
11. Lenders often use the data item of whether a potential borrower rents or owns

their house. Beyond the three characteristics of rejected loans analyzed in this

section, do you believe this data item would be an important determinant of

rejected loans? Defend your answer.

12. Performing your own analysis, download the rejected loans dataset titled “DAA

Chapter 1-1 Data” and perform an Excel PivotTable analysis by state (including

the District of Columbia) and figure out the number of rejected applications for
the state of California. That is, count the loans by state and see what

percentage of the rejected loans came from California. How close is that to the

relative proportion of the population of California as compared to that of the

United States?

13. Performing your own analysis, download the rejected loans dataset titled “DAA

Chapter 1-1 Data” and run an Excel PivotTable by risk (or credit) score

classification and DTI bucket to determine the number of (or percentage of)

rejected loans requested by those rated as having an excellent credit score.

Summary
In this chapter, we discussed how businesses and accountants derive
value from Data Analytics. We gave some specific examples of how
Data Analytics is used in business, auditing, managerial accounting,
financial accounting, and tax accounting.
We introduced the IMPACT model and explained how it is used to
address accounting questions. And then we talked specifically about the
importance of identifying the question. We walked through the first few
steps of the IMPACT model and introduced eight data approaches that
might be used to address different accounting questions. We also
discussed the data analytic skills needed by analytic-minded accountants.
We followed this up using a hands-on example of the IMPACT
model, namely what are the characteristics of rejected loans at
LendingClub. We performed this analysis using various filtering and
PivotTable tasks.

■ With data all around us, businesses and accountants are


looking at Data Analytics to extract the value that the data
might possess. (LO 1-1, 1-2, 1-3)
■ Data Analytics is changing the audit and the way that
accountants look for risk. Now, auditors can consider 100
percent of the transactions in their audit testing. It is also
helpful in finding anomalous or unusual transactions. Data
Analytics is also changing the way financial accounting,
managerial accounting, and taxes are done at a company. (LO
1-3)
■ The IMPACT cycle is a means of performing Data Analytics
that goes all the way from identifying the question, to
mastering the data, to performing data analyses and
communicating and tracking results. It is recursive in nature,
suggesting that as questions are addressed, new, more refined
questions may emerge that can be addressed in a similar way.
(LO 1-4)
■ Eight data approaches address different ways of testing the
data: classification, regression, similarity matching,
clustering, co-occurrence grouping, profiling, link prediction,
and data reduction. These are explained in more detail in
Chapter 3. (LO 1-4)
■ Data analytic skills needed by analytic-minded accountants
are specified and are consistent with the IMPACT cycle,
including the following: (LO 1-5)
◦ Developed analytics mindset.
◦ Data scrubbing and data preparation. page 26
◦ Data quality.
◦ Descriptive data analysis.
◦ Data analysis through data manipulation.
◦ Statistical data analysis competency.
◦ Data visualization and data reporting.
■ We showed an example of the IMPACT cycle using
LendingClub data regarding rejected loans to illustrate the
steps of the IMPACT cycle. (LO 1-6)

Key Words
Big Data (4) Datasets that are too large and complex for businesses’ existing systems to handle
utilizing their traditional capabilities to capture, store, manage, and analyze these datasets.
classification (11) A data approach that attempts to assign each unit in a population into a few
categories potentially to help with predictions.
clustering (11) A data approach that attempts to divide individuals (like customers) into groups
(or clusters) in a useful or meaningful way.
co-occurrence grouping (11) A data approach that attempts to discover associations between
individuals based on transactions involving them.
Data Analytics (4) The process of evaluating data with the purpose of drawing conclusions to
address business questions. Indeed, effective Data Analytics provides a way to search through
large structured and unstructured data to identify unknown patterns or relationships.
data dictionary (19) Centralized repository of descriptions for all of the data attributes of the
dataset.
data reduction (12) A data approach that attempts to reduce the amount of information that
needs to be considered to focus on the most critical items (i.e., highest cost, highest risk, largest
impact, etc.).
link prediction (12) A data approach that attempts to predict a relationship between two data
items.
predictor (or independent or explanatory) variable (11) A variable that predicts or
explains another variable, typically called a predictor or independent variable.
profiling (11) A data approach that attempts to characterize the “typical” behavior of an
individual, group, or population by generating summary statistics about the data (including mean,
standard deviations, etc.).
regression (11) A data approach that attempts to estimate or predict, for each unit, the
numerical value of some variable using some type of statistical model.
response (or dependent) variable (10) A variable that responds to, or is dependent on,
another.
similarity matching (11) A data approach that attempts to identify similar individuals based on
data known about them.
structured data (4) Data that are organized and reside in a fixed field with a record or a file.
Such data are generally contained in a relational database or spreadsheet and are readily
searchable by search algorithms.
unstructured data (4) Data that do not adhere to a predefined data model in a tabular format.
ANSWERS TO PROGRESS CHECKS
1. The plethora of data alone does not necessarily translate into value. However, if we

carefully analyze the data to help address critical business problems and questions,

the data have the potential to create value. page 27


2. Banks frequently use credit scores from outside sources like Experian,

TransUnion, and Equifax to evaluate creditworthiness of its customers. However,

if they have access to all of their customer’s banking information, Data Analytics

would allow them to evaluate their customers’ creditworthiness. Banks would know

how much money they have and how they spend it. Banks would know if they had

prior loans and if they were paid in a timely manner. Banks would know where they

work and the size and stability of monthly income via the direct deposits. All of

these combined, in addition to a credit score, might be used to assess

creditworthiness if customers desire a loan. It might also give banks needed

information for a marketing campaign to target potential creditworthy customers.

3. The brand manager at Procter and Gamble might use Data Analytics to see what

is being said about Procter and Gamble’s Tide Pods product on social media

websites (e.g., Snapchat, Twitter, Instagram, and Facebook), particularly those that

attract an older demographic. This will help the manager assess if there is a

problem with the perceptions of its laundry detergent products.

4. Data Analytics might be used to collect information on the amount of overtime. Who

worked overtime? What were they working on? Do we actually need more full-time

employees to reduce the level of overtime (and its related costs to the company

and to the employees)? Would it be cost-effective to just hire full-time employees

instead of paying overtime? How much will costs increase just to pay for fringe

benefits (health care, retirement, etc.) for new employees versus just paying
existing employees for their overtime. All of these questions could be addressed by

analyzing recent records explaining the use of overtime.

5. Management accounting and Data Analytics both (1) address questions asked by

management, (2) find data to address those questions, (3) analyze the data, and

(4) report the results to management. In all material respects, management

accounting and Data Analytics are similar, if not identical.

6. The tax staff would become much more adept at efficiently organizing data from

multiple systems across an organization and performing Data Analytics to help with

tax planning to structure transactions in a way that might minimize taxes.

7. The dependent variable could be the amount of money spent on fast food.

Independent variables could be proximity of the fast food, ability to cook own food,

discretionary income, socioeconomic status, and so on.

8. The data reduction approach might help auditors spend more time and effort on the

most risky transactions or on those that might be anomalous in nature. This will

help them more efficiently spend their time on items that may well be of highest

importance.

9. According to the “magic quadrant,” the software tools represented by the Microsoft

and Tableau Tracks are considered innovative because they lead the market in the

“ability to execute” and “completeness of vision” dimensions.

10. Having Tableau software tools available on both the Mac and Windows computers

gives the analyst needed flexibility that is not available for the Microsoft Track,

which are fully available only on Windows computers.

11. The use of the data item whether a potential borrower owns or rents their house

would be expected to complement the risk score, debt levels (DTI bucket), and

length of employment, since it can give a potential lender additional data on the

financial position and financial obligations (mortgage or rent payments) of the

borrower.
12. An analysis of the rejected loans suggests that 85,793 of the total 645,414 rejected

loans were from the state of California. That represents 13.29 percent of the total

rejected loans. This is greater than the relative population of California to the United

States as of the 2010 census, of 12.1 percent (37,253,956/308,745,538).

13. A PivotTable analysis of the rejected loans suggests that more than 30.6 percent

(762/2,494) of those in the excellent risk credit score range asked for a loan with a

debt-to-income ratio of more than 20 percent. page 28

Microsoft Excel, 2016

Microsoft Excel, 2016


Multiple Choice Questions
1. (LO 1-1) Big Data is often described by the four Vs, or

a. volume, velocity, veracity, and variability.

b. volume, velocity, veracity, and variety.

c. volume, volatility, veracity, and variability.

d. variability, velocity, veracity, and variety.

2. (LO 1-4) Which data approach attempts to assign each unit in a population into a

small set of classes (or groups) where the unit best fits?

a. Regression

b. Similarity matching

c. Co-occurrence grouping

d. Classification page 29

3. (LO 1-4) Which data approach attempts to identify similar individuals based on data

known about them?

a. Classification

b. Regression

c. Similarity matching

d. Data reduction

4. (LO 1-4) Which data approach attempts to predict connections between two data

items?

a. Profiling
b. Classification

c. Link prediction

d. Regression

5. (LO 1-6) Which of these terms is defined as being a central repository of

descriptions for all of the data attributes of the dataset?

a. Big Data

b. Data warehouse

c. Data dictionary

d. Data Analytics

6. (LO 1-5) Which skills were not emphasized that analytic-minded accountants

should have?

a. Developed an analytics mindset

b. Data scrubbing and data preparation

c. Classification of test approaches

d. Statistical data analysis competency

7. (LO 1-5) In which areas were skills not emphasized for analytic-minded

accountants?

a. Data quality

b. Descriptive data analysis

c. Data visualization and data reporting

d. Data and systems analysis and design

8. (LO 1-4) The IMPACT cycle includes all except the following steps:

a. perform test plan.


b. visualize the data.

c. master the data.

d. track outcomes.

9. (LO 1-4) The IMPACT cycle specifically includes all except the following steps:

a. data preparation.

b. communicate insights.

c. address and refine results.

d. perform test plan.

10. (LO 1-1) By the year 2024, the volume of data created, captured, copied, and

consumed worldwide will be 149 _____.

a. zettabytes

b. petabytes

c. exabytes

d. yottabytes

page 30

Discussion and Analysis


1. (LO 1-1) The opening article “Accountants to Rely More on Big Data in 2020”

suggested that Data Analytics would be increasingly implementing Big Data in their

business processes. Why is that? How can Data Analytics help accountants do

their jobs?

2. (LO 1-1) Define Data Analytics and explain how a university might use its

techniques to recruit and attract potential students.


3. (LO 1-2) Give a specific example of how Data Analytics creates value for

businesses.

4. (LO 1-3) Give a specific example of how Data Analytics creates value for auditing.

5. (LO 1-3) How might Data Analytics be used in financial reporting? And how might it

be used in doing tax planning?

6. (LO 1-3) How is the role of management accounting similar to the role of the data

analyst?

7. (LO 1-4) Describe the IMPACT cycle. Why does its order of the processes and its

recursive nature make sense?

8. (LO 1-4) Why is identifying the question such a critical first step in the IMPACT

process cycle?

9. (LO 1-4) What is included in mastering the data as part of the IMPACT cycle

described in the chapter?

10. (LO 1-4) What data approach mentioned in the chapter might be used by Facebook

to find friends?

11. (LO 1-4) Auditors will frequently use the data reduction approach when considering

potentially risky transactions. Provide an example of why focusing on a portion of

the total number of transactions might be important for auditors to assess risk.

12. (LO 1-4) Which data approach might be used to assess the appropriate level of the

allowance for doubtful accounts?

13. (LO 1-6) Why might the debt-to-income attribute included in the declined loans

dataset considered in the chapter be a predictor of declined loans? How about the

credit (risk) score?

14. (LO 1-6) To address the question “Will I receive a loan from LendingClub?” we had

available data to assess the relationship among (1) the debt-to-income ratios and

number of rejected loans, (2) the length of employment and number of rejected

loans, and (3) the credit (or risk) score and number of rejected loans. What
additional data would you recommend to further assess whether a loan would be

offered? Why would they be helpful?

Problems
1. (LO 1-4) Match each specific Data Analytics test to a specific test approach, as part

of performing a test plan:

Classification

Regression

Similarity Matching

Clustering

Co-occurrence Grouping

Profiling

Link Prediction

Data Reduction

page 31

Test
Specific Data Analytics Test Approach

1. Predict which firms will go bankrupt and


which firms will not go bankrupt.
2. Use stratified sampling to focus audit effort
on transactions with greatest risk.
3. Work to understand normal behavior, to
then be able to identify abnormal behavior
(such as fraud).
Test
Specific Data Analytics Test Approach

4. Look for relationships between related


parties that are not otherwise disclosed.
5. Predict which new customers resemble the
company’s best customers.
6. Predict the relationship between an
investment in advertising expenditures and
subsequent operating income.
7. Segment all of the company’s customers
into groups that will allow further specific
analysis.
8. The customers who buy product X will be
most likely to be also interested in product
Y.

2. (LO 1-4) Match each of the specific Data Analytics tasks to the stage of the

IMPACT cycle:

Identify the Questions

Master the Data

Perform Test Plan

Address and Refine Results

Communicate Insights

Track Outcomes
Stage of
IMPACT
Specific Data Analytics Test Cycle

1. Should we use company-specific data or


macro-economic data to address the
accounting question?
2. What are appropriate cost drivers for
activity-based costing purposes?
3. Should we consider using regression
analysis or clustering analysis to evaluate
the data?
4. Should we use tables or graphs to show
management what we’ve found?
5. Now that we’ve evaluated the data one way,
should we perform another analysis to gain
additional insights?
6. What type of dashboard should we use to
get the latest, up-to-date results?

3. (LO 1-5) Match the specific analysis need/characteristic to the appropriate

Microsoft Track software tool:

Excel

Power Query

Power BI

Power Automate

page 32
Specific Analysis Microsoft Track
Need/Characteristic Tool

1. Basic visualization
2. Robotics process automation
3. Data joining
4. Advanced visualization
5. Works on Windows/Mac/Online
platforms
6. Dashboards
7. Collect data from multiple sources
8. Data cleaning

4. (LO 1-5) Match the specific analysis need/characteristic to the appropriate Tableau

Track software tool:

Tableau Prep Builder

Tableau Desktop

Tableau Public

Specific Analysis Tableau Track


Need/Characteristic Tool

1. Advanced visualization
2. Analyze and share public datasets
3. Data joining
Specific Analysis Tableau Track
Need/Characteristic Tool

4. Presentations
5. Data transformation
6. Dashboards
7. Data cleaning

5. (LO 1-6) Navigate to the Connect Additional Student Resources page. Under

Chapter 1 Data Files, download and consider the LendingClub data dictionary file

“LCDataDictionary,” specifically the LoanStats tab. This represents the data

dictionary for the loans that were funded. Choose among these attributes in the

data dictionary and indicate which are likely to be predictive that loans will go

delinquent, or that loans will ultimately be fully repaid and which are not predictive.

Predictive?
Predictive Attributes (Yes/No)

1. date (Date when the borrower accepted


the offer)
2. desc (Loan description provided by
borrower)
3. dti (A ratio of debt owed to income
earned)
4. grade (LC assigned loan grade)
5. home_ownership (Values include Rent,
Own, Mortgage, Other)
Predictive?
Predictive Attributes (Yes/No)

6. loanAmnt (Amount of the loan)


7. next_pymnt_d (Next scheduled
payment date)
8. term (Number of payments on the loan)
9. tot_cur_bal (Total current balance of all
accounts)

6. (LO 1-6) Navigate to the Connect Additional Student Resources page. Under

Chapter 1 Data Files, download and consider the rejected loans dataset of

LendingClub data titled “DAA Chapter 1-1 Data.” Choose among these attributes

in the data dictionary, and indicate which are likely to be predictive of loan rejection,

and which are not.

page 33

Predictive Attributes Predictive? (Yes/No)

1. Amount Requested
2. Zip Code
3. Loan Title
4. Debt-To-Income Ratio
5. Application Date
6. Risk_Score
7. Employment Length
7. (LO 1-6) Navigate to the Connect Additional Student Resources page. Under

Chapter 1 Data Files, download and consider the rejected loans dataset of

LendingClub data titled “DAA Chapter 1-1 Data” from the Connect website and

perform an Excel PivotTable by state; then figure out the number of rejected

applications for the state of Arkansas. That is, count the loans by state and

compute the percentage of the total rejected loans in the United States that came

from Arkansas. How close is that to the relative proportion of the population of

Arkansas as compared to the overall U.S. population (per 2010 census)? Use your

browser to find the population of Arkansas and the United States and calculate the

relative percentage and answer the following questions.

7A. Multiple Choice: What is the percentage of total loans rejected in the United
States that came from Arkansas?

a. Less than 1%.

b. Between 1% and 2%.

c. More than 2%.

7B. Multiple Choice: Is this loan rejection percentage greater than the percentage
of the U.S. population that lives in Arkansas (per 2010 census)?

a. Loan rejection percentage is greater than the population.

b. Loan rejection percentage is less than the population.

8. (LO 1-6) Download the rejected loans dataset of LendingClub data titled “DAA

Chapter 1-1 Data” from Connect Additional Student Resources and do an Excel

PivotTable by state; then figure out the number of rejected applications for each

state.

8A. Put the following states in order of their loan rejection percentage based on the
count of rejected loans (from high [1] to low [11]) of the total rejected loans.
Does each state’s loan rejection percentage roughly correspond to its relative
proportion of the U.S. population?

State Rank 1 (High) to 11 (Low)

1. Arkansas (AR)
2. Hawaii (HI)
3. Kansas (KS)
4. New Hampshire (NH)
5. New Mexico (NM)
6. Nevada (NV)
7. Oklahoma (OK)
8. Oregon (OR)
9. Rhode Island (RI)
10. Utah (UT)
11. West Virginia (WV)

page 34

8B. What is the state with the highest percentage of rejected loans?
8C. What is the state with the lowest percentage of rejected loans?
8D. Analysis: Does each state’s loan rejection percentage roughly correspond to its
relative proportion of the U.S. population (by 2010 U.S. census at
https://en.wikipedia.org/wiki/2010_United_States_census)?

For Problems 9, 10, and 11, we will be cleaning a data file in preparation for

subsequent analysis.
The analysis performed on LendingClub data in the chapter (“DAA Chapter 1-1

Data”) was for the years 2007–2012. For this and subsequent problems, please

download the rejected loans table for 2013 from Connect Additional Student

Resources titled “DAA Chapter Data 1-2.”

9. (LO 1-6) Consider the 2013 rejected loan data from LendingClub titled “DAA

Chapter 1-2 Data” from Connect Additional Student Resources. Browse the file in

Excel to ensure there are no missing data. Because our analysis requires risk

scores, debt-to-income data, and employment length, we need to make sure each

of them has valid data. There should be 669,993 observations.

a. Assign each risk score to a risk score bucket similar to the chapter. That is,
classify the sample according to this breakdown into excellent, very good, good,
fair, poor, and very bad credit according to their credit score noted in Exhibit 1-
13. Classify those with a score greater than 850 as “Excellent.” Consider using
nested if–then statements to complete this. Or sort by risk score and manually
input into appropriate risk score buckets.

b. Run a PivotTable analysis that shows the number of loans in each risk score
bucket.

Which risk score bucket had the most rejected loans (most observations)? Which

risk score bucket had the least rejected loans (least observations)? Is it similar to

Exhibit 1-14 performed on years 2007–2012?

10. (LO 1-6) Consider the 2013 rejected loan data from LendingClub titled “DAA

Chapter 1-2 Data.” Browse the file in Excel to ensure there are no missing data.

Because our analysis requires risk scores, debt-to-income data, and employment

length, we need to make sure each of them has valid data. There should be

669,993 observations.
a. Assign each valid debt-to-income ratio into three buckets (labeled DTI bucket) by
classifying each debt-to-income ratio into high (>20.0 percent), medium (10.0–
20.0 percent), and low (<10.0 percent) buckets. Consider using nested if–then
statements to complete this. Or sort the row and manually input.

b. Run a PivotTable analysis that shows the number of loans in each DTI bucket.

Which DTI bucket had the highest and lowest grouping for this rejected Loans

dataset? Any interpretation of why these loans were rejected based on debt-to-

income ratios?

11. (LO 1-6) Consider the 2013 rejected loan data from LendingClub titled “DAA

Chapter 1-2 Data.” Browse the file in Excel to ensure there are no missing data.

Because our analysis requires risk scores, debt-to-income data, and employment

length, we need to make sure each of them has valid data. There should be

669,993 observations.

a. Assign each risk score to a risk score bucket similar to the chapter. That is,
classify the sample according to this breakdown into excellent, very good, good,
fair, poor, and very bad credit according to their credit score noted in Chapter 1.
Classify those with a score greater than 850 as “Excellent.” Consider using
nested if-then statements to complete this. Or sort by risk score and manually
input into appropriate risk score buckets (similar to Problem 9).

b. Assign each debt-to-income ratio into three buckets (labeled DTI bucket) by
classifying each debt-to-income ratio into high (>20.0 percent), medium (10.0–
20.0 percent), and low (<10.0 percent) buckets. Consider using page 35
nested if-then statements to complete this. Or sort the row and
manually classify into the appropriate bucket.

c. Run a PivotTable analysis to show the number of excellent risk scores but high
DTI bucket loans in each employment year bucket.
Which employment length group had the most observations to go along with

excellent risk scores but high debt-to-income? Which employment year group had

the least observations to go along with excellent risk scores but high debt-to-

income? Analysis: Any interpretation of why these loans were rejected?

page 36

LABS

Lab 1-0 How to Complete Labs

The labs in this book will provide valuable hands-on experience in


generating and analyzing accounting problems. Each lab will provide a
company summary with relevant facts, techniques that you will use to
complete your analysis, software that you’ll need, and an overview of the
lab steps.
When you’ve completed your lab, your instructor may ask you to submit
a screenshot lab document showing screenshots of work you have
completed at various points in the lab. This lab will demonstrate how to
create a lab document for submission.
Lab 1-0 Part 1 Explore Different Tool Tracks
When completing labs throughout this textbook, you may be given the
option to complete one or more tracks. Depending on the software your
instructor chooses to emphasize, you may see instructions for one or more
of the following tracks:
Microsoft Track: Lab instructions for Microsoft tools, including Excel,
Power Query, and Power BI, will appear in a green box like this:

Microsoft | Excel

1. Open Excel and create a new blank workbook.


2. …

Tableau Track: Lab instructions for Tableau tools, including Tableau


Prep and Tableau Desktop, will appear in a blue box like this:

Tableau | Desktop

1. Open Tableau Desktop and create a new workbook.


2. …

Throughout the lab you will be asked to answer questions about the
process and the results. Add your screenshots to your screenshot lab
document. All objective and analysis questions should be answered in
Connect or included in your screenshot lab document, depending on your
instructor’s preferences.

Lab 1-0 Part 1 Objective Questions (LO 1-1,


1-5)
OQ 1. According to your instructor, which track(s) will you
be completing this semester? (Answer this in Connect
or write your response in your lab document.)
OQ 2. Where should you answer objective lab page 37
questions? (Answer this in Connect or write
your response in your lab document.)

Lab 1-0 Part 1 Analysis Questions (LO 1-1, 1-


5)
OQ 1. What is the purpose of taking screenshots of your
progress through the labs? (Answer this in Connect or
write your response in your lab document.)
Lab 1-0 Part 2 Take Screenshots of Your
Tools
This part will make sure that you are able to locate and open the software
needed for future labs and take screenshots of your progress through the
labs. Before you begin the lab, you should create a new blank Word
document where you will record your screenshots and responses and save it
as 1-0 Lab Document [Your name] [Your email].docx. Note that anytime
you see the camera icon you should capture the current state of your
own work on your computer screen. If you don’t know how to capture
screenshots, see the instructions included in the boxes below. Once you
have completed the lab and collected your screenshots and answers, you
may be asked to submit your screenshot lab document to your instructor for
grading.

Microsoft | Excel + Power Query, Power


BI Desktop

1. If you haven’t already, download and install the latest version of


Excel and Power BI Desktop on your Windows computer or log
on to the remote desktop.

a. To install Excel, if your university provides Microsoft Office, go


to portal.office.com and click Install Office.
b. To install Power BI Desktop, search for Power BI Desktop in the
Microsoft Store and click Install.
c. To access both Excel and Power BI Desktop on the remote
desktop, go to waltonlab.uark.edu and log in with the username
and password provided by your instructor.

2. Open Excel and create a new blank workbook.


3. From the ribbon, click Data> Get Data > Launch Power Query
Editor. A blank window will appear.
4. Take a screenshot (label it 1-0MA) of the Power Query Editor
window and paste it into your lab document.

a. To take a screenshot in Windows:


1. Open the Start menu and search for “Snipping Tool” or “Snip
& Sketch”.
2. Click New (Rectangular Snip) and draw a rectangle across
your screen that includes your entire window. A preview
window with your screenshot will appear.
3. Press Ctrl + C to copy your screenshot.
4. Go to your lab document and press Ctrl + V to paste the
screenshot into your document.

b. To take a screenshot on a Mac:


1. Press Cmd + Shift + 4 and draw a rectangle across your
screen that includes your entire window. Your screenshot will
be saved in your Desktop folder. page 38
2. Navigate to your Desktop folder and drag the
screenshot file into your lab document.

5. Close the Power Query Editor and close your Excel workbook.
6. Open Power BI Desktop and close the welcome screen.
7. Take a screenshot (label it 1-0MB) of the Power BI Desktop
workspace and paste it into your lab document.
8. Close Power BI Desktop.
Tableau | Prep, Desktop

1. If you haven’t already, download and install the latest version of


Tableau Prep and Tableau Desktop on your computer or log on to
the remote desktop.

a. To install Tableau Prep and Tableau Desktop, go to


tableau.com/academic/students and click Get Tableau for Free.
Complete the form, then download and run the installers for both
applications. Be sure to register using your school email address
(ending in .edu)—this will help ensure that your application for a
student license will be approved.
b. To access both Tableau Prep and Tableau Desktop on a remote
desktop, go to waltonlab.uark.edu and log in with the username
and password provided by your instructor.

2. Open Tableau Prep and open a sample flow.


3. Take a screenshot (label it 1-0TA) of the blank Tableau Prep
window and paste it into your lab document.

a. To take a screenshot in Windows:


1. Open the Start menu and search for “Snipping Tool” or “Snip
& Sketch”.
2. Click New (Rectangular Snip) and draw a rectangle across
your screen that includes your entire window. A preview
window with your screenshot will appear.
3. Press Ctrl + C to copy your screenshot.
4. Go to your lab document and press Ctrl + V to paste the
screenshot into your document.
b. To take a screenshot on a Mac:
1. Press Cmd + Shift + 4 and draw a rectangle across your
screen that includes your entire window. Your screenshot will
be saved in your Desktop folder.
2. Navigate to your Desktop folder and drag the screenshot file
into your lab document.

4. Close Tableau Prep.


5. Open Tableau Desktop and create a new workbook.
6. Choose a sample workbook from the selection screen or press the
Esc key.
7. Take a screenshot (label it 1-0TB) of the blank Tableau
Desktop workbook and paste it into your lab document.
8. Close Tableau Desktop.

page 39

Lab 1-0 Part 2 Objective Questions (LO 1-1,


1-5)
OQ 1. Where did you go to complete this lab activity?
(Answer this in Connect or write your response in your
lab document.)
OQ 2. What type of computer operating system do you
normally use? (Answer this in Connect or write your
response in your lab document.)
Lab 1-0 Part 2 Analysis Questions (LO 1-1, 1-
5)
AQ 1. Compare and Contrast: If you completed both tracks
in this lab, which tool are you most interested in
learning and why? (This question does not appear in
Connect. Write your response in your lab document.)

Lab 1-0 Submit Your Screenshot Lab


Document
Verify that you have captured all of your required screenshots and have
answered any questions your instructor has assigned, then upload your
screenshot lab document to Connect or the location indicated by your
instructor.

Lab 1-1 Data Analytics Questions in


Financial Accounting

Case Summary: Let’s see how we might perform some simple Data
Analytics. The purpose of this lab is to help you identify relevant questions
that may be answered using Data Analytics.
You were just hired as an analyst for a credit rating agency that evaluates
publicly listed companies in the United States. The agency already has
some Data Analytics tools that it uses to evaluate financial statements and
determine which companies have higher risk and which companies are
growing quickly. The agency uses these analytics to provide ratings that
will allow lenders to set interest rates and determine whether to lend money
in the first place. As a new analyst, you’re determined to make a good first
impression.
Lab 1-1 Part 1 Identify the Questions
Think about ways that you might analyze data from a financial statement.
You could use a horizontal analysis to view trends over time, a vertical
analysis to show account proportions, or ratios to analyze relationships.
Before you begin the lab, you should create a new blank Word document
where you will record your screenshot and save it as Lab 1-1 [Your name]
[Your email address].docx.

Lab 1-1 Part 1 Analysis Questions


AQ 1. Use what you know about financial statement analysis
(or search the web if you need a refresher) to generate
three different metrics for evaluating financial
performance. For example, if you wanted to evaluate a
company’s profit margin from one year to the next your
question might be, “Has Apple Inc’s gross margin
increased in the last 3 years?”
AQ 2. Next to each question generate a hypothetical answer to
the question to help you identify what your expected
output would be. You may use some insight or intuition
or search for industry averages to inform your
hypothesis. For example: “Hypothesis: Apple Inc’s
gross margin has increased slightly in the past 3 years.”
AQ 3. Evaluate each question from Part 1. There are specific
data attributes that will help you find the answer you’re
looking for. For example, if your question was “Has
[Company X’s] gross margin increased in the last 3
years?” and the expected answer is “Apple page 40
Inc’s gross margin has increased slightly in
the past 3 years,” this tells you what attributes (or
fields) to look for: company name, gross margin (sales
revenues – cost of goods sold), year.
Lab 1-1 Part 2 Master the Data
To answer your questions, you’ll need to evaluate specific account values or
financial statement paragraphs. As an analyst, you have access to the
Securities and Exchange Commission’s (SEC’s) EDGAR database of
XBRL financial statements as well as a list of XBRL tags from the
Financial Accounting Standards Board (FASB). XBRL stands for
eXtensible Business Reporting Language and is used to make the data in
financial statements machine-readable. Public companies have been
preparing XBRL reports since 2008. While there are some issues with
XBRL data, such data have become a useful means for comparing and
analyzing financial statements. Every value, date, and paragraph is “tagged”
with a label that identifies what each specific value represents, similar to
assigning attributes in a database. Because companies tag their financial
statements with XBRL tags, you can use those tags to identify specific data
that you need to answer your questions.
For example:
Company name = EntitySectorIndustryClassificationPrimary
Gross margin = GrossProfit
Sales revenues = SalesRevenueNet
Cost of goods sold = CostOfGoodsAndServicesSold
Year = DocumentPeriodEndDate
Identify XBRL tags from the FASB’s taxonomy:

1. Open a web browser, and go to xbrlview.fasb.org.


2. Click the + next to US GAAP (2021-01-31).
3. Click the ALL (Main/Entire) option, and then click Open to load the
taxonomy.
4. Navigate through the financial statements to determine which accounts
you need to answer your questions from Part 1. The name of the XBRL
tag is found in the properties pane next to “Name.” For example, the tag
for Total Assets can be found by expanding 104000 Statement of
Financial Position, Classified > Statement of Financial Position
[Abstract] > Statement [Table] > Statement [Line Items] > Assets
[Abstract] > Assets, Total. You may also use the search function.
Note: Be careful when you use the search function. The tag you see in
the results may appear in the wrong statement. Double-click the tag to
expand the tree and show where the account appears.
5. Click the Assets, Total element to load the data in the Details tab and
scroll to the bottom to locate the tag Name in the Properties panel.
6. Take a screenshot (label it 1-1A) of the Total Assets tag information
in the XBRL taxonomy.

Lab 1-1 Part 2 Analysis Questions


AQ 1. For each of your questions, identify the account or data
attribute you need to answer your question. Then use
FASB’s XBRL taxonomy to identify the specific
XBRL tags that represent those accounts.

Lab 1-1 Submit Your Screenshot Lab


Document
Verify that you have answered any questions your instructor has assigned,
then upload your screenshot lab document to Connect or to the location
indicated by your instructor.

page 41

Lab 1-2 Data Analytics Questions in


Managerial Accounting
Case Summary: Each day as you work in your company’s credit
department, you must evaluate the credit worthiness of new and existing
customers. As you observe the credit application process, you wonder if
there might be an opportunity to look at data from consumer lending to see
if you can help improve your company’s process. You are asked to evaluate
LendingClub, a U.S.-based, peer-to-peer lending company, headquartered
in San Francisco, California. LendingClub facilitates both borrowing and
lending by providing a platform for unsecured personal loans between
$1,000 and $35,000. The loan period is for either 3 or 5 years. You should
begin by identifying appropriate questions and developing a hypothesis for
each question. Then, using publicly available data, you should identify data
fields and values that could help answer your questions.
Lab 1-2 Part 1 Identify the Questions
Your company currently collects information about your customers when
they apply for credit and evaluates the credit limit and utilization, or how
much credit is currently being used, to determine whether a new customer
should be extended credit. You think there might be a way to provide better
input for that credit decision.
When you evaluate the criteria LendingClub lenders use to evaluate
loan applications, you notice that the company uses credit application data
to assign a risk score to all loan applicants. This risk score is used to help
lenders determine (1) whether a loan is likely to be repaid and (2) what
interest rate approved loans will receive. The risk score is calculated using a
number of inputs, such as employment and payment history. You have been
asked to consider if there may be better inputs to evaluate this given that the
number of written-off accounts has increased in the past 2 years. Using
available data, you would like to propose a model that would help create a
risk score that you could apply to your own company’s customers.

Lab 1-2 Part 1 Analysis Questions (LO 1-3, 1-


4)
AQ 1. Use what you know about loan risk (or search the web
if you need a refresher) to identify three different
questions that might influence risk. For example, if you
suspect risky customers live in a certain location, your
question might be “Where do the customers with
highest risk live?”
AQ 2. For each question you identified in AQ1, generate a
hypothetical answer to each question to help you
identify what your expected output would be. You may
use some insight or intuition or search the Internet for
ideas on how to inform your hypothesis. For example:
“Hypothesis: High-risk customers likely live in coastal
towns.”
AQ 3. Finally, identify the data that you would need to answer
each of your questions. For example, to determine
customer location, you might need the city, state, and
zip code. Additionally, if you hypothesize a specific
region, you’d need to know which cities, states, and/or
zip codes belong to that region.
Lab 1-2 Part 2 Master the Data
Now that you have an idea of what questions would help influence your risk
model and the types of data that you need to collect, it is time to evaluate
the specific data that LendingClub collects using a listing of data attributes
that it collects in Table 1-2A. Look through the list of attributes and review
the description of each, thinking about how these might influence a risk
score.

page 42

LAB TABLE 1-2A


Names and Descriptions of Selected Data Attributes Collected by
LendingClub

Attribute Description
id Loan identification number
member_id Membership identification number

loan_amnt Requested loan amount


emp_length Employment length

issue_d Date of loan issue

loan_status Fully paid or charged off


pymnt_plan Payment plan: yes or no

purpose Loan purpose: e.g., wedding, medical, debt_consolidation, car


zip_code Zip code

addr_state State

dti Debt-to-income ratio


delinq_2y Late payments within the past 2 years

earliest_cr_line Oldest credit account


inq_last_6mnths Credit inquiries in the past 6 months

open_acc Number of open credit accounts


Attribute Description
revol_bal Total balance of all credit accounts

revol_util Percentage of available credit in use


total_acc Total number of credit accounts

application_type Individual or joint application

Lab 1-2 Part 2 Analysis Questions (LO 1-3, 1-


4)
AQ 1. Evaluate each of your questions from Part 1. Do the
data you identified in your questions exist in the table
provided? If so, write the applicable fields next to each
question in your document.
AQ 2. Are there data values you identified in Part 1 that don’t
exist in the table? Explain how you might collect the
missing data or where you might locate it.

Lab 1-2 Submit Your Screenshot Lab


Document
No screenshots are required for this lab.

Lab 1-3 Data Analytics Questions in


Auditing

Case Summary: ABC Company is a large retailer that collects its order-to-
cash data in a large ERP system that was recently updated to comply with
the AICPA’s audit data standards. ABC Company currently collects all
relevant data in the ERP system and digitizes any contracts, orders, or
receipts that are completed on paper. The credit department reviews
customers who request credit. Sales orders are approved by managers
before being sent to the warehouse for preparation and shipment. Cash
receipts are collected by a cashier and applied to a customer’s outstanding
balance by an accounts receivable clerk.
You have been assigned to the audit team that will perform the internal
controls audit of ABC Company. In this lab, you should identify appropriate
questions and develop a hypothesis for each question. Then you should
translate questions into target fields and value in a database and perform a
simple analysis.

page 43
Lab 1-3 Part 1 Identify the Questions
Your audit team has been tasked with identifying potential internal control
weaknesses within the order-to-cash process. You have been asked to
consider what the risk of internal control weakness might look like and how
the data might help identify it.
Before you begin the lab, you should create a new blank Word document
where you will record your screenshot and save it as Lab 1-3 [Your name]
[Your email address].docx.

Lab 1-3 Part 1 Analysis Questions (LO 1-3, 1-


4)
AQ 1. Use what you know about internal controls over the
order-to-cash process (or search the web if you need a
refresher) to identify three different questions that
might indicate internal control weakness. For example,
if you suspect that a manager may be delaying approval
of shipments sent to customers, your question might be
“Are any shipping managers approving shipments more
than 2 days after they are received?”
AQ 2. Next to each question generate a hypothetical answer to
help you identify what your expected output would be.
You may use some insight or intuition or search the
Internet for ideas on how to inform your hypothesis.
For example: “Hypothesis: Only one or two shipping
managers are approving shipments more than 2 days
after they are received.”
AQ 3. Finally, identify the data that you would need to answer
each of your questions. For example, to determine the
timing of approval and who is involved, you might
need the approver ID, the order date, and the approval
date.
Lab 1-3 Part 2 Master the Data
To answer your questions, you’ll need to evaluate the data that are
available. As a starting point, you should look at attributes listed in the
AICPA’s audit data standards. The AICPA set these standards to map
common data elements that should be accessible in any modern enterprise
system and make it possible for auditors to create a common set of analytic
models and tools. To access the audit data standards, complete the
following steps:

1. Open your web browser and search for “Audit data standards order to
cash.” Follow the link to the “Audit Data Standards Library—AICPA,”
then look for the “Audit Data Standard—Order to Cash Subledger
Standard” PDF document.
2. Quickly scroll through the document and evaluate the tables (e.g.,
Sales_Orders_YYYYMMDD_YYYYMMDD), field names (e.g.,
Sales_Order_ID), and descriptions (e.g., “Unique identifier for each
sales order.”).
3. Take a screenshot (label it 1-3A) of the page showing 2.1
Sales_Orders_YYYYMMDD_YYYYMMDD.
4. As you skim the tables, make note of any data elements you identified in
Part 1 that don’t appear in the list of fields in the audit data standard.

Lab 1-3 Part 2 Analysis Questions (LO 1-3, 1-


4)
AQ 1. List some of the tables and fields from the audit data
standard that relate to each question you identified in
Part 1. For example, if you’re looking for the shipment
timing and approval data, you would need the
Shipments_Made_YYYYMMDD_YYYYMMDD
table and Approved_By, Entered_Date, and
Approved_Date fields.
AQ 2. Are there data values you identified in Part 1 that don’t
exist in the tables? Explain how you might collect the
missing data or where you might locate them.

page 44

Lab 1-3 Submit Your Screenshot Lab


Document
Verify that you have captured your required screenshot and have answered
any questions your instructor has assigned, then upload your screenshot lab
document to Connect or the location indicated by your instructor.

Lab 1-4 Comprehensive Case: Questions


about Dillard’s Store Data

Case Summary: Dillard’s is a department store with approximately 330


stores in 29 states in the United States. Its headquarters is located in Little
Rock, Arkansas. You can learn more about Dillard’s by looking at
finance.yahoo.com (ticker symbol = DDS) and the Wikipedia site for DDS.
You’ll quickly note that William T. Dillard II is an accounting grad of the
University of Arkansas and the Walton College of Business, which may be
why he shared transaction data with us to make available for this lab and
labs throughout this text. In this lab, you will identify appropriate questions
for a retailer. Then, translate questions into target tables, fields, and values
in the Dillard’s database.
Lab 1-4 Part 1 Identify the Questions
From the Walton College website, we note the following:

The Dillard’s Department Store Database contains retail sales


information gathered from store sales transactions. The sale process
begins when a customer brings items intended for purchase (clothing,
jewelry, home décor, etc.) to any store register. A Dillard’s sales
associate scans the individual items to be purchased with a barcode
reader. This populates the transaction table (TRANSACT), which will
later be used to generate a sales receipt listing the item, department,
and cost information (related price, sale price, etc.) for the customer.
When the customer provides payment for the items, payment details
are recorded in the transaction table, the receipt is printed, and the
transaction is complete. Other tables are used to store information
about stores, products, and departments.
Source: http://walton.uark.edu/enterprise/dillardshome.php (accessed July 15, 2021).

This is a gifted dataset that is based on real operational data. Like any
real database, integrity problems may be noted. This can provide a unique
opportunity not only to be exposed to real data, but also to illustrate the
effects of data integrity problems.
For this lab, you should rely on your creativity and prior business
knowledge to answer the following analysis questions. Answer these
questions in your lab doc or in Connect and then continue to the next part of
this lab.

Lab 1-4 Part 1 Analysis Questions (LO 1-1, 1-


3, 1-4)
AQ 1. Assume that Dillard’s management is interested in
improving profitability. Write three questions that
could be asked to assess current profitability levels for
each product and how profitability could be improved
in the near future.
AQ 2. Assume that Dillard’s management wishes to improve
its online sales and profitability on those sales. What
three questions could be asked to see where Dillard’s
stands on its online sales?
Lab 1-4 Part 2 Master the Data
An analysis of data related to Dillard’s retail sales can provide some of the
answers to your questions from Part 1. Consider the attributes that are given
in Lab Exhibit 1-4.

page 45

LAB EXHIBIT 1-4


Dillard’s Sales Transaction Tables and Attributes
CUSTOMER table:

Sample
Attribute Description values
CUST_ID Unique identifier representing a 219948527,
customer instance 219930818
CITY City where the customer lives. HOUSTON,
COOS BAY
STATE State where the customer lives. FL, TX

ZIP_CODE Customer’s 5-digit zip code. 72701,


84770
ZIP_SECSEG Customer’s geographic segment code 5052, 6474

DISTANCE_TO_NEAREST_STORE Miles from the customer’s house to 0.687, 6.149


the closest Dillard’s store.

PREFERRED_STORE Dillard’s store number the customer 910, 774


prefers to shop at regardless of
distance to the customer’s home
address.

DEPARTMENT table:
Sample
Attribute Description values
DEPT The Dillard’s unique identifier for a collection of 0471, 0029
merchandise within a store format

DEPT_DESC The name for a department collection (lowest level “Christian


of the category hierarchy) of merchandise within a Dior”, “REBA”
store format.
DEPTDEC The first three digits of a department code, a way to 047X, 002X
classify departments at a higher level.

DEPTDEC_DESC Descriptive name representing the decade (middle ‘BASICS’,


level of the category hierarchy) to which a ‘TREATMENT’
department belongs.

DEPTCENT The first two digits of a department code, a way to 04XX, 00XX
classify departments at a higher level.

DEPTCENT_DESC The descriptive name of the century (top level of CHILDRENS,


the category hierarchy). COSMETICS

SKU table:

Sample
Attribute Description values
SKU Unique identifier for an item, identifies the item by 0557578,
size within a color and style for a particular vendor. 6383039

DEPT The Dillard’s unique identifier for a collection of 0134, 0343


merchandise within a store format.

SKU_CLASS Three-character alpha/numeric classification code K51, 220


used to define the merchandise. Class requirements
vary by department.

SKU_STYLE The Dillard’s numeric identifier for a style of 091923690,


merchandise. LBF41728

UPC A number provided by vendors to identify their 889448437421,


product to the size level. 44212146767
Sample
Attribute Description values
COLOR Color of an item. BLACK,
PINEBARK

SKU_SIZE Size of an item. Product sizes are not standardized 6, 085M


and issued by vendor

BRAND_NAME The item’s brand. Stride Rite,


UNKNOWN

CLASSIFICATION Category used to sort products into logical groups. Dress Shoe

PACKSIZE Number that describes how many of the product 001, 002
come in a package

page 46
SKU_STORE table:

Sample
Attribute Description values
STORE The numerical identifier for a Dillard’s store. 915, 701

SKU Unique identifier for an item, identifies the item by size 4305296,
within a color and style for a particular vendor. 6137609

RETAIL The price of an item. 11.90,


45.15
COST The price charged by a vendor for an item. 8.51, 44.84

STORE table:

Sample
Attribute Description values
Sample
Attribute Description values

STORE The numerical identifier for any type of Dillard’s 767, 460
location.

DIVISION The division to which a location is assigned for 07, 04


operational purposes.
CITY The city where the store is located. IRVING,
MOBILE
STATE The state abbreviation where the store is located. MO, AL

ZIP_CODE The 5-digit zip code of a store’s address. 70601, 35801


ZIP_SECSEG The 4-digit code of a neighborhood within a specific zip 5052, 6474
code.

TRANSACT table:

Sample
Attribute Description values
TRANSACTION_ID Unique numerical identifier for each scan of an item at 40333797,
a register. 15129264
TRAN_DATE Calendar date the transaction occurred in a store. 1/1/2015,
5/19/2014
STORE The numerical identifier for any type of Dillard’s 716, 205
location.

REGISTER The numerical identifier for the register where the 91, 55, 12
item was scanned.

TRAN_NUM Sequential number of transactions scanned on a 184, 14


register.
TRAN_TIME Time of day the transaction occurred. 1839, 1536

CUST_ID Unique identifier representing the instance of a 118458688,


customer. 115935775
Sample
Attribute Description values

TRAN_LINE_NUM Sequential number of each scan or element in a 3, 2


transaction.

MIC Manufacturer Identification Code used to uniquely 154, 128,


identify a vendor or brand within a department. 217
TRAN_TYPE An identifier for a purchase or return type of P, R
transaction or line item
ORIG_PRICE The original unit price of an item before discounts. 20.00, 6.00

SALE_PRICE The discounted unit price of an item. 15.00, 2.64,


6.00
TRAN_AMT The total pre-tax dollar amount the customer paid in a 15.00, 2.64
transaction.
TENDER_TYPE The type of payment a customer used to complete the BANK,
transaction. DLRD,
DAMX
SKU Unique identifier for an item, identifies the item by 6107653,
size within a color and style for a particular vendor. 9999999950

Source: http://walton.uark.edu/enterprise/dillardshome.php (accessed January 15, 2021).

Lab 1-4 Part 2 Objective Questions (LO 1-1,


1-3, 1-4)
OQ 1. What tables and fields could address the question of the
profit margin (sales price less cost) on each product
(SKU) available for sale?
OQ 2. If you’re interested in learning which product is sold
most often at each store, which tables and fields would
you consider?
page 47

Lab 1-4 Part 2 Analysis Questions (LO 1-1, 1-


3, 1-4)
AQ 1. You’re trying to learn about where Dillard’s stores are
located to identify locations for the next additional
store. Consider the STORE table. What questions could
be asked about store location given data availability?
AQ 2. What questions would you have regarding data fields in
the SKU table that could be used to help address the
cost of shipping? What additional information would
be helpful to address this question?

Lab 1-4 Submit Your Screenshot Lab


Document
No screenshots are required for this lab.

Lab 1-5 Comprehensive Case: Connect to


Dillard’s Store Data

Lab Note: The tools presented in this lab periodically change. Updated
instructions, if applicable, can be found in the eBook and lab walkthrough
videos in Connect.
Case Summary: Dillard’s is a department store with approximately 330
stores in 29 states in the United States. Its headquarters is located in Little
Rock, Arkansas. You can learn more about Dillard’s by looking at
finance.yahoo.com (ticker symbol = DDS) and the Wikipedia site for DDS.
You’ll quickly note that William T. Dillard II is an accounting grad of the
University of Arkansas and the Walton College of Business, which may be
why he shared transaction data with us to make available for this lab and
labs throughout this text. In this lab, you will learn how to load Dillard’s
data into the tools used for data analysis.
Data: Dillard’s sales data are available only on the University of
Arkansas Remote Desktop (waltonlab.uark.edu). See your instructor for
login credentials.
Lab 1-5 Part 1 Load the Dillard’s Data in
Excel + Power Query and Tableau Prep
Before you begin the lab, you should create a new blank Word document
where you will record your screenshots and save it as Lab 1-5 [Your name]
[Your email address].docx.
From the Walton College website, we note the following:

The Dillard’s Department Store Database contains retail sales


information gathered from store sales transactions. The sale process
begins when a customer brings items intended for purchase (clothing,
jewelry, home décor, etc.) to any store register. A Dillard’s sales
associate scans the individual items to be purchased with a barcode
reader. This populates the transaction table (TRANSACT), which will
later be used to generate a sales receipt listing the item, department,
and cost information (related price, sale price, etc.) for the customer.
When the customer provides payment for the items, payment details
are recorded in the transaction table, the receipt is printed, and the
transaction is complete. Other tables are used to store information
about stores, products, and departments.
Source: http://walton.uark.edu/enterprise/dillardshome.php (accessed July 15, 2021).

This is a gifted dataset that is based on real operational data. Like any
real database, integrity problems may be noted. This can provide a unique
opportunity not only to be exposed to real data, but also to illustrate the
effects of data integrity problems. The TRANSACT table itself contains
107,572,906 records. Analyzing the entire population would take a
significant amount of computational time, especially if multiple users are
querying it at the same time.

page 48
In Part 1 of this lab, you will learn how to load the Dillard’s data into
either Excel + Power Query or Tableau Prep so that you can extract,
transform, and load the data for later assignments. You will also filter the
data to a more manageable size. In Part 2, you will learn how to load the
Dillard’s data into either Power BI Desktop or Tableau Desktop to prepare
your data for visualization and Data Analytics models.

Microsoft | Excel + Power Query Editor

1. Create a new project in Power BI Desktop.


2. In the Home ribbon, click Get Data > From Database > From
SQL Server database and click Connect.
3. Enter the following and click OK:
a. Server: essql1.walton.uark.edu
b. Database: WCOB_DILLARDS
c. Data connectivity mode: Direct Query

4. If prompted for credentials, click Use my current credentials and


click Connect.
5. If prompted with a warning about an insecure connection, click
OK.
6. Check the box Select multiple items.
7. Check the following tables and click Transform Data or Edit:
a. TRANSACT, STORE Note: Future labs may ask you to load
different tables.

8. Take a screenshot (label it 1-5MA).


9. In Power Query Editor:
a. Click the TRANSACT query from the list on the left side of the
screen.
b. Click the drop-down menu to the right of the TRAN_DATE
attribute to show filter options.
c. Choose Date Filters > Between….
d. Enter the date range is after or equal to 1/1/2014 and before or
equal to 1/7/2014 and click OK. Note: Future labs may ask you
to load different date ranges.
e. Click Close & Load and wait for a moment while the data load
into Excel.
f. If you see a warning that not all data can be displayed, click OK.

10. Take a screenshot (label it 1-5MB).


11. When you are finished answering the lab questions you may close
Excel. Save your file as Lab 1-5 Dillard’s Filter.xlsx.

Tableau | Prep

1. Open Tableau Prep Builder.


2. Click Connect to Data, and choose Microsoft SQL Server from
the list. page 49
3. Enter the following and click Sign In:
a. Server: essql1.walton.uark.edu
b. Database: WCOB_DILLARDS
c. Authentication: Windows Authentication
4. Double-click the TRANSACT and STORE tables. Note: Future
labs may ask you to load different tables.
5. In the flow sheet, drag STORE onto your new TRANSACT and
choose JOIN.
6. Click the + next to Join 1 and choose + Clean Step.
7. Take a screenshot (label it 1-5TB).
8. Locate TRAN_DATE in the bottom preview pane and click … >
Filter> Range of Dates.
9. Enter 1/1/2014 to 1/7/2014 or drag the sliders to limit the dates to
this range and click Done. Note: Because Tableau Prep samples
data in this design view, you may not see any results on this step.
Don’t panic; you will see them in the next step. Future labs may
ask you to load different date ranges.
10. Right-click Clean 1 from the flow and rename the step “Date
Filter”.
11. Click + next to your new Date Filter task and choose Output.
12. Click the Browse button in the Output pane, choose a folder to
save the file, name your file Lab 1-5 Dillard’s Filter.hyper, and
click Accept.
13. Click Run Flow.
14. Take a screenshot (label it 1-5TB).
15. When you are finished answering the lab questions you may close
Tableau Prep. Save your file as Lab 1-5 Dillard’s Filter.tfl.

Lab 1-5 Part 1 Analysis Questions (LO 1-3, 1-


4)
AQ 1. Why would you want to filter the date field before
loading data into your model for analysis?
AQ 2. What are some limitations introduced into your
analysis by filtering on such a small date range?
Lab 1-5 Part 2 Tableau Desktop and Power BI
Desktop
Now that you have had some experience preparing data in Excel or Tableau
Prep, it is time to learn how to load the Dillard’s data into either Power BI
Desktop or Tableau Desktop so that you can extract, transform, and load the
data for data models and visualizations. You will also use filters to extract a
more manageable dataset.

Microsoft | Power BI Desktop

1. Create a new project in Power BI Desktop.


2. In the Home ribbon, click Get Data > SQL Server database and
click Connect. page 50
3. Enter the following and click OK:
a. Server: essql1.walton.uark.edu
b. Database: WCOB_DILLARDS
c. Data Connectivity mode: DirectQuery

4. If prompted for credentials, click Use my current credentials and


click Connect.
5. If prompted with a warning about an insecure connection, click
OK.
6. Check the following tables and click Transform Data:
a. TRANSACT, STORE Note: Future labs may ask you to load
different tables.

7. In the Power Query Editor window that appears, complete the


following:
a. Click the TRANSACT query.
b. Click the drop-down menu to the right of the TRAN_DATE
attribute to show filter options.
c. Choose Date Filters > Between….
d. Enter the date range is after or equal to 1/1/2014 and before or
equal to 1/7/2014 and click OK. Note: Future labs may ask you
to load different date ranges.
e. Take a screenshot (label it 1-5MC).
f. Click Close & Apply and wait for a moment while the data load
into Power BI Desktop.

8. Now that you have loaded your data into Power BI Desktop,
continue to explore the data:

a. Click the Model pane on the left side of the screen.


b. In the TRANSACT table, click the STORE attribute.
c. In the Properties pane on the right, change the data type to Text.
d. In the STORE table, click the STORE attribute.
e. In the Properties pane on the right, change the data type to Text.

9. Take a screenshot (label it 1-5MD).


10. When you are finished answering the lab questions you may close
Power BI Desktop. Save your file as Lab 1-5 Dillard’s
Filter.pbix.

Tableau | Desktop
1. Create a new workbook in Tableau.
2. Go to Connect > To a Server > Microsoft SQL Server.
3. Enter the following and click Sign In:
a. Server: essql1.walton.uark.edu
b. Database: WCOB_DILLARDS

4. Double-click the tables you need.

page 51

a. TRANSACT and STORE Note: Future labs may ask you to


load different tables.
b. Verify the relationship includes the Store attribute from both
tables and close the Relationships window.

5. Click the TRANSACT table and click Update Now to preview


the data.
6. Take a screenshot (label it 1-5TC).
7. In the top-right corner of the Data Source screen, click Add below
Filters.

a. Click Add….
b. Choose Tran Date and click OK.
c. Choose Range of Dates and click Next.
d. Drag the sliders to limit the data from 1/1/2014 to 1/7/2014 and
click OK. Note: Future labs may ask you to load different date
ranges.
e. Take a screenshot (label it 1-5TD).
f. Click OK to return to the Data Source screen.
8. Click the TRANSACT table and then click Update Now to
preview the data.
9. When you are finished answering the lab questions you may close
Tableau. Save your file as Lab 1-5 Dillard’s Filter.twb.

Note: Tableau will try to query the server after each change you
make and will take a up to a minute. After each change, click Cancel
to stop the query until you’re ready to prepare the final report.

Lab 1-5 Part 2 Analysis Questions (LO 1-3, 1-


4)
AQ 1. Compare the tools you used in Part 2 with the tools you
used in Part 1. What are some of the differences
between these visualization tools (Power BI Desktop or
Tableau Desktop) and those data prep tools (Power
Query or Tableau Prep)?

Lab 1-5 Submit Your Screenshot Lab


Document
Verify that you have answered any questions your instructor has assigned,
then upload your screenshot lab document to Connect or to the location
indicated by your instructor.

1Statista, https://www.statista.com/statistics/871513/worldwide-data-created/ (accessed


December 2020).
2Bernard Marr, “Big Data: 20 Mind-Boggling Facts Everyone Must Read,” Forbes,
September 30, 2015, at http://www.forbes.com/sites/bernardmarr/2015/09/30/big-data-20-
mind-boggling-facts-everyone-must-read/#2a3289006c1d (accessed March 2019).
3Roger S. Debreceny and Glen L. Gray, “IT Governance and Process Maturity: A
Multinational Field Study,” Journal of Information Systems 27, no. 1 (Spring 2013), pp.
157–88.
4H. Chen, R. H. L. Chiang, and V. C. Storey, “Business Intelligence Research,” MIS
Quarterly 34, no. 1 (2010), pp. 201–3.
5“Data Driven: What Students Need to Succeed in a Rapidly Changing Business World,”
PwC, https://www.pwc.com/us/en/faculty-resource/assets/pwc-data-driven-paper-
feb2015.pdf, February 2015 (accessed March 20, 2019).
6“The Trillion-Dollar Opportunity for the Industrial Sector: How to Extract Full Value from
Technology,” McKinsey Global Institute, https://www.mckinsey.com/business-
functions/mckinsey-digital/our-insights/the-trillion-dollar-opportunity-for-the-industrial-
sector#, November 2018 (accessed December 2018).
7Joseph Kennedy, “Big Data’s Economic Impact,” https://www.ced.org/blog/entry/big-datas-
economic-impact, December 3, 2014 (accessed January 9, 2016).
8“What’s Next for Tech for Finance? Data-Driven Decision Making,” PwC,
https://www.pwc.com/us/en/cfodirect/accounting-podcast/data-driven-decision-
making.html, October 2020 (accessed December 2020).
9Deloitte, “Adding Insight to Audit: Transforming Internal Audit through Data Analytics,”
http://www2.deloitte.com/content/dam/Deloitte/ca/Documents/audit/ca-en-audit-adding-
insight-to-audit.pdf (accessed January 10, 2016).
10PwC, “Data Driven: What Students Need to Succeed in a Rapidly Changing Business
World,” http://www.pwc.com/us/en/faculty-resource/assets/PwC-Data-driven-paper-
Feb2015.pdf, February 2015 (accessed January 9, 2016).
11EY, “How Big Data and Analytics Are Transforming the Audit,” https://eyo-iis-
pd.ey.com/ARC/documents/EY-reporting-ssue-9.pdf, posted April 2015. (accessed January
27, 2016).
12Deloitte, “The Power of Tax Data Analytics,”
http://www2.deloitte.com/us/en/pages/tax/articles/top-ten-things-about-tax-data-
analytics.html (accessed October 12, 2016).
13We also note our use of the terms IMPACT cycle and IMPACT model interchangeably
throughout the book.
14M. Lebied, “Your Data Won’t Speak Unless You Ask It the Right Data Analysis
Questions,” Datapine, June 21, 2017, https://www.datapine.com/blog/data-analysis-
questions/ (accessed December 2020).
15“One-Third of BI Pros Spend up to 90% of Time Cleaning Data,”
http://www.eweek.com/database/one-third-of-bi-pros-spend-up-to-90-of-time-cleaning-
data.html, posted June 2015 (accessed March 15, 2016).
16Foster Provost and Tom Fawcett, Data Science for Business: What You Need to Know
about Data Mining and Data-Analytic Thinking (Sebastopol, CA: O’Reilly Media, 2013).
17https://www.lendingclub.com/ (accessed September 29, 2016).
page 52

Chapter 2
Mastering the Data

A Look at This Chapter


This chapter provides an overview of the types of data that are used in the accounting
cycle and common data that are stored in a relational database. The second step of the
IMPACT cycle is “mastering the data,” which is sometimes called ETL for extracting,
transforming, and loading the data. We will describe how data are requested and extracted
to answer business questions and how to transform data for use via data preparation,
validation, and cleaning. We conclude with an explanation of how to load data into the
appropriate tool in preparation for analyzing data to make decisions.

A Look Back
Chapter 1 defined Data Analytics and explained that the value of Data Analytics is in the
insights it provides. We described the Data Analytics Process using the IMPACT cycle
model and explained how this process is used to address both business and accounting
questions. We specifically emphasized the importance of identifying appropriate questions
that Data Analytics might be able to address.

A Look Ahead
Chapter 3 describes how to go from defining business problems to analyzing data,
answering questions, and addressing business problems. We identify four types of Data
Analytics (descriptive, diagnostic, predictive, and prescriptive analytics) and describe
various approaches and techniques that are most relevant to analyzing accounting data.
page 53
We are lucky to live in a world in which data are abundant. However, even with rich sources
of data, when it comes to being able to analyze data and turn them into useful information
and insights, very rarely can an analyst hop right into a dataset and begin analyzing.
Datasets almost always need to be cleaned and validated before they can be used. Not
knowing how to clean and validate data can, at best, lead to frustration and poor insights
and, at worst, lead to horrible security violations. While this text takes advantage of open
source datasets, these datasets have all been scrubbed not only for accuracy, but also to
protect the security and privacy of any individual or company whose details were in the
original dataset.

Wichy/Shutterstock

In 2015, a pair of researchers named Emil Kirkegaard and Julius Daugbejerg Bjerrekaer
scraped data from OkCupid, a free dating website, and provided the data onto the “Open
Science Framework,” a platform researchers use to obtain and share raw data. While the
aim of the Open Science Framework is to increase transparency, the researchers in this
instance took that a step too far—and a step into illegal territory. Kirkegaard and Bjerrekaer
did not obtain permission from OkCupid or from the 70,000 OkCupid users whose
identities, ages, genders, religions, personality traits, and other personal details maintained
by the dating site were provided to the public without any work being done to anonymize or
sanitize the data. If the researchers had taken the time to not just validate that the data were
complete, but also to sanitize them to protect the individuals’ identities, this would not have
been a threat or a news story. On May 13, 2015, the Open Science Framework removed the
OkCupid data from the platform, but the damage of the privacy breach had already been
done.1
A 2020 report suggested that “Any consumer with an average number of apps on their
phone—anywhere between 40 and 80 apps—will have their data shared with hundreds or
perhaps thousands of actors online,” said Finn Myrstad, the digital policy director for the
Norwegian Consumer Council, commenting specifically about dating apps.2
All told, data privacy and ethics will continue to be an issue for data providers and data
users. In this chapter, we look at the ethical considerations of data collection and data use
as part of mastering the data.
OBJECTIVES
After reading this chapter, you should be able to:

LO 2-1 Understand available internal and external data sources and how data
are organized in an accounting information system.

LO 2-2 Understand how data are stored in a relational database.

LO 2-3 Explain and apply extraction, transformation, and loading (ETL)


techniques to prepare the data for analysis.

LO 2-4 Describe the ethical considerations of data collection and data use.

page 54
As you learned in Chapter 1, Data Analytics is a process, and we follow an established
Data Analytics model called the IMPACT cycle.3 The IMPACT cycle begins with
identifying business questions and problems that can be, at least partially, addressed with
data (the “I” in the IMPACT model). Once the opportunity or problem has been identified,
the next step is mastering the data (the “M” in the IMPACT model), which requires you
to identify and obtain the data needed for solving the problem. Mastering the data requires
a firm understanding of what data are available to you and where they are stored, as well
as being skilled in the process of extracting, transforming, and loading (ETL) the data in
preparation for data analysis. While the extraction piece of the ETL process may often be
completed by the information systems team or a database administrator, it is also possible
that you will have access to raw data that you will need to extract out of the source
database. Both methods of requesting data for extraction and of extracting data yourself
are covered in this chapter. The mastering the data step can be described via the ETL
process. The ETL process is made up of the following five steps:
Step 1 Determine the purpose and scope of the data request (extract).
Step 2 Obtain the data (extract).
Step 3 Validate the data for completeness and integrity (transform).
Step 4 Clean the data (transform).
Step 5 Load the data in preparation for data analysis (load).
This chapter will provide details for each of these five steps.

HOW DATA ARE USED AND STORED IN THE


ACCOUNTING CYCLE
LO 2-1
Understand available internal and external data sources and how data are organized in an accounting
information system.

Before you can identify and obtain the data, you must have a comfortable grasp on what
data are available to you and where such data are stored.

Internal and External Data Sources


Data may come from a number of different sources, either internal or external to the
organization. Internal data sources include an accounting information system, supply chain
management system, customer relationship management system, and human resource
management system. Enterprise Resource Planning (ERP) (also known as Enterprise
Systems) is a category of business management software that integrates applications from
throughout the business (such as manufacturing, accounting, finance, human resources,
etc.) into one system.
An accounting information system is a system that records, processes, reports, and
communicates the results of business transactions to provide financial and nonfinancial
information for decision-making purposes. A supply chain management (SCM) system
includes information on active vendors (their contact info, where payment should be
made, how much should be paid), the orders made to date (how much, when the orders are
made), or demand schedules for what component of the final product is needed when. The
customer relationship management (CRM) system is an information system for
overseeing all interactions with current and potential customers with the goal of improving
relationships. CRM systems contain every detail about the customer. Companies also have
a set of data about what is arguably their most valuable asset: their employees. A human
resource management (HRM) system is an information system for managing all
interactions with current and potential employees.

page 55
Exhibit 2-1 provides an example of different categories of external data sources
including economic, financial, governmental, and other sources. Each of these may be
useful in addressing accounting and business questions.

EXHIBIT 2-1
Potential External Data Sources Available to Address Business and Accounting
Questions
Dataset
Category Description Website
Economics BRICS World Bank https://www.kaggle.com/docstein/brics-world-bank-indicators
Indicators (Brazil,
Russia, India, China
and South Africa)
Economics Bureau of Economic https://www.bls.gov/data/
Analysis data
Financial Financial statement https://www.calcbench.com/
data
Financial Financial statement https://www.sec.gov/edgar.shtml
data, EDGAR,
Securities and
Exchange
Commission
Financial Analyst forecasts Yahoo! Finance (finance.yahoo.com), Analysis Tab
Financial Stock market dataset https://www.kaggle.com/borismarjanovic/price-volume-data-for-all-us-
stocks-etfs
Financial Credit card fraud https://www.kaggle.com/mlg-ulb/creditcardfraud
detection
Financial Daily News/Stock https://www.kaggle.com/aaron7sun/stocknews
Market Prediction
Financial Retail Data https://www.kaggle.com/manjeetsingh/retaildataset
Analytics
Financial Peer-to-peer lending lendingclub.com (requires login)
data of approved and
rejected loans
Financial Daily stock prices Yahoo! Finance (finance.yahoo.com), Historical Data Tab
(and weekly and
monthly)
Financial Financial and https://pages.stern.nyu.edu/~adamodar/New_Home_Page/datacurrent.html
economic
summaries by
industry
General data.world https://data.world/
General kaggle.com https://www.kaggle.com/datasets
Government State of Ohio https://data.ohio.gov/wps/portal/gov/data/
financial data (Data
Ohio)
Dataset
Category Description Website
Government City of Chicago https://data.cityofchicago.org
financial data
Government City of New York https://www.checkbooknyc.com/spending_landing/yeartype/B/year/119
financial data
Marketing Amazon product https://data.world/datafiniti/consumer-reviews-of-amazon-products
reviews
Other Restaurant safety https://data.cityofnewyork.us/Health/DOHMH-New-York-City-
Restaurant-Inspection-Results/43nn-pn8j
Other Citywide payroll https://data.cityofnewyork.us/City-Government/Citywide-Payroll-Data-
data Fiscal-Year-/k397-673e
Other Property https://data.cityofnewyork.us/City-Government/Property-Valuation-and-
valuation/assessment Assessment-Data/yjxr-fw8i
Other USA facts—our https://www.irs.gov/uac/tax-stats
country in numbers
Other Interesting fun https://towardsdatascience.com/14-data-science-projects-to-do-during-
datasets—14 data your-14-day-quarantine-8bd60d1e55e1
science projects with
data
Other Links to Big Data https://aws.amazon.com/public-datasets/
Sets—Amazon Web
Services
Real Estate New York Airbnb https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data
data explanation
Real Estate U.S. Airbnb data https://www.kaggle.com/kritikseth/us-airbnb-open-data/tasks?
taskId=2542
Real Estate TripAdvisor hotel https://www.kaggle.com/andrewmvd/trip-advisor-hotel-reviews
reviews
Retail Retail sales https://www.kaggle.com/tevecsystems/retail-sales-forecasting
forecasting

page 56

Accounting Data and Accounting Information


Systems
A basic understanding of accounting processes and their associated data, how those data
are organized, and why the data were captured, can help you request the right data and
facilitate that request so that you know exactly where each piece of data is held.
Even with the focus on raw data and where they are stored, there is variety in how data
can be stored. Most commonly, data are stored in either flat files or a database. For many
of our examples and hands-on activities in this text, we will transform our data that are
stored in a database into a flat file. The most common example of a flat file that you are
likely used to is a range of data in an Excel spreadsheet. Put simply, a flat file is a means
of maintaining all of the data you need in one place. We can do a lot of incredible data
analysis and number crunching in flat files in Excel, but as far as storing our data, it is
generally inefficient to store all of the data that you need for a given business process in
one place. Instead, a relational database is frequently used for data storage because it is
more capable of ensuring data integrity and maintaining “one version of the truth” across
multiple processes. There are a variety of applications that support relational databases
(these are referred to as relational database management systems or RDBMS). In this
textbook we interact with data stored on a Microsoft SQL Server.
Microsoft SQL Server can support enterprise-level data in ways that smaller RDBMS
programs, such as Access and SQLite, cannot. While both Microsoft Access and SQLite
can be (and are) used in professional settings, the usage of SQL Server throughout the
textbook is meant to provide an experience that replicates working with much larger and
more complex datasets that you will likely find in the professional world.
There are many other examples of relational database management systems, including
Teradata, MySql, Oracle RDBMS, IBM DB2, Amazon RDS, and PostGreSQL. Regardless
of the DBMS, relational databases have principles that guide how they are modeled.
Exhibit 2-2, a simplified version of a Unified Modeling Language (UML) class
diagram, is an illustration or a drawing of the tables and their relationships to each other
(i.e., a database schema). Relational databases are discussed in greater depth in Learning
Objective 2-2.

EXHIBIT 2-2
Procure-to-Pay Database Schema (Simplified)

DATA AND RELATIONSHIPS IN A RELATIONAL


DATABASE
LO 2-2
Understand how data are stored in a relational database.

In this text, we will work with data in a variety of forms, but regardless of the tool we use
to analyze data, structured data should be stored in a normalized relational database.
There are occasions for working with data directly in the relational database, but many
times when we work with data analysis, we’ll prefer to export the data from the relational
database and view it in a more user-friendly form. The benefit of storing data in a
normalized database outweighs the downside of having to export, validate, and page 57
sanitize the data every time you need to analyze the information.
Storing data in a normalized, relational database instead of a flat file ensures that data
are complete, not redundant, and that business rules and internal controls are enforced; it
also aids communication and integration across business processes. Each one of these
benefits is detailed here:

Completeness. Ensures that all data required for a business process are included in the
dataset.
No redundancy. Storing redundant data is to be avoided for several reasons: It takes up
unnecessary space (which is expensive), it takes up unnecessary processing to run
reports to ensure that there aren’t multiple versions of the truth, and it increases the
risk of data-entry errors. Storing data in flat files yields a great deal of redundancy, but
normalized relational databases require there to be one version of the truth and for each
element of data to be stored in only one place.
Business rules enforcement. As will become increasingly evident as we progress
through the material in this text, relational databases can be designed to aid in the
placement and enforcement of internal controls and business rules in ways that flat
files cannot.
Communication and integration of business processes. Relational databases should be
designed to support business processes across the organization, which results in
improved communication across functional areas and more integrated business
processes.4

It is valuable to spend some time basking in the benefits of storing data in a relational
database because it is not necessarily easier to do so when it comes to building the data
model or understanding the structure. It is arguably more complex to normalize your data
than it is to throw redundant data without business rules or internal controls into a
spreadsheet.
Columns in a Table: Primary Keys, Foreign Keys, and
Descriptive Attributes
When requesting data, it is critical to understand how the tables in a relational database are
related. This is a brief overview of the different types of attributes in a table and how these
attributes support the relationships between tables. It is certainly not a comprehensive take
on relational data modeling, but it should be adequate in preparing you for creating data
requests.
Every column in a table must be both unique and relevant to the purpose of the table.
There are three types of columns: primary keys, foreign keys, and descriptive attributes.
Each table must have a primary key. The primary key is typically made up of one
column. The purpose of the primary key is to ensure that each row in the table is unique,
so it is often referred to as a “unique identifier.” It is rarely truly descriptive; instead, a
collection of letters or simply sequential numbers are often used. As a student, you are
probably already very familiar with your unique identifier—your student ID number at the
university is the way you as a student are stored as a unique record in the university’s data
model! Other examples of unique identifiers that you are familiar with would be Amazon
order numbers, invoice numbers, account numbers, Social Security numbers, and driver’s
license numbers.
One of the biggest differences between a flat file and a relational database is simply
how many tables there are—when you request your data into a flat file, you’ll receive one
big table with a lot of redundancy. While this is often ideal for analyzing data, page 58
when the data are stored in the database, each group of information is stored in
a separate table. Then, the tables that are related to one another are identified (e.g.,
Supplier and Purchase Order are related; it’s important to know which Supplier the
Purchase Order is from). The relationship is created by placing a foreign key in one of the
two tables that are related. The foreign key is another type of attribute, and its function is
to create the relationship between two tables. Whenever two tables are related, one of
those tables must contain a foreign key to create the relationship.
The other columns in a table are descriptive attributes. For example, Supplier Name
is a critical piece of data when it comes to understanding the business process, but it is not
necessary to build the data model. Primary and foreign keys facilitate the structure of a
relational database, and the descriptive attributes provide actual business information.
Refer to Exhibit 2-2, the database schema for a typical procure-to-pay process. Each
table has an attribute with the letters “PK” next to them—these are the primary keys for
each table. The primary key for the Materials Table is “Item_Number,” the primary key
for the Purchase Order Table is “PO_Number,” and so on. Several of the tables also have
attributes with the letters “FK” next to them—these are the foreign keys that create the
relationship between pairs of tables. For example, look at the relationship between the
Supplier Table and the Purchase Order Table. The primary key in the Supplier Table is
“Supplier ID.” The line between the two tables links the primary key to a foreign key in
the Purchase Order Table, also named “Supplier ID.”
The Line Items Table in Exhibit 2-3 has so much detail in it that it requires two
attributes to combine as a primary key. This is a special case of a primary key often
referred to as a composite primary key, in which the two foreign keys from the tables
that it is linking combine to make up a unique identifier. The theory and details that
support the necessity of this linking table are beyond the scope of this text—if you can
identify the primary and foreign keys, you’ll be able to identify the data that you need to
request. Exhibit 2-4 shows a subset of the data that are represented by the Purchase Order
table. You can see that each of the attributes listed in the class diagram appears as a
column, and the data for each purchase order are accounted for in the rows.

EXHIBIT 2-3
Line Items Table: Purchase Order Detail Table

Purchase Order Detail

PO_Number Item_Number Quantity Purchased


1787 10 50
1787 25 50
1789 5 30
1790 5 100

EXHIBIT 2-4
Purchase Order Table

page 59

PROGRESS CHECK
1. Referring to Exhibit 2-2, locate the relationship between the Supplier and Purchase Order tables.

What is the unique identifier of each table? (The unique identifier attribute is called the primary

key—more on how it’s determined in the next learning objective.) Which table contains the
attribute that creates the relationship? (This attribute is called the foreign key—more on how it’s

determined in the next learning objective.)

2. Referring to Exhibit 2-2, review the attributes in the Purchase Order table. There are two foreign

keys listed in this table that do not relate to any of the tables in the diagram. Which tables do you

think they are? What type of data would be stored in those two tables?

3. Refer to the two tables that you identified in Progress Check 2 that would relate to the Purchase

Order table, but are not pictured in this diagram. Draw a sketch of what the UML Class Diagram

would look like if those tables were included. Draw the two classes to represent the two tables

(i.e., rectangles), the relationships that should exist, and identify the primary keys for the two new

tables.

DATA DICTIONARIES
In the previous section, you learned about how data are stored by focusing on the procure-
to-pay database schema. Viewing schemas and processes in isolation clarifies each
individual process, but it can also distort reality—these schemas typically do not represent
their own separate databases. Rather, each process-specific database schema is a piece of a
greater whole, all combining to form one integrated database.
As you can imagine, once these processes come together to be supported in one
database, the amount of data can be massive. Understanding the processes and the basics
of how data are stored is critical, but even with a sound foundation, it would be nearly
impossible for an individual to remember where each piece of data is stored, or what each
piece of data represents.
Creating and using a data dictionary is paramount in helping database administrators
maintain databases and analysts identify the data they need to use. In Chapter 1, you were
introduced to the data dictionary for the LendingClub data for rejected loans (DAA
Chapter 1-1 Data). The same cut-out of the LendingClub data dictionary is provided in
Exhibit 2-5 as a reminder.

EXHIBIT 2-5
LendingClub Data Dictionary for Rejected Loan Data (DAA Chapter 1-1 Data)
Source: LendingClub Data

RejectStats File Description


Amount Requested Total requested loan amount
Application Date Date of borrower application
RejectStats File Description
Loan Title Loan title

Risk_Score Borrower risk (FICO) score


Dept-To-Income Ratio Ratio of borrower total monthly debt payments divided by monthly income.
Zip Code The first 3 numbers of the borrower zip code provided from loan application.
State Two digit State Abbreviation provided from loan application.
Employment Length Employment length in years, where 0 is less than 1 and 10 is greater than 10.
Policy Code policy_code=1 if publicly available.
policy_code=2 if not publicly available

Because the LendingClub data are provided in a flat file, the only information
necessary to describe the data are the attribute name (e.g., Amount Requested) and a
description of that attribute. The description ensures that the data in each attribute are used
and analyzed in the appropriate way—it’s always important to remember that technology
will do exactly what you tell it to, so you must be smarter than the computer! If you run
analysis on an attribute thinking it means one thing, when it actually means another, you
could make some big mistakes and bad decisions even when you are working with data
validated for completeness and integrity. It’s critical to get to know the data through
database schemas and data dictionaries thoroughly before attempting to do any data
analysis.
When you are working with data stored in a relational database, you will have more
attributes to keep track of in the data dictionary. Exhibit 2-6 provides an example of a data
dictionary for a generic Supplier table:

page 60

EXHIBIT 2-6
Supplier Data Dictionary

PROGRESS CHECK
4. What is the purpose of the primary key? A foreign key? A nonkey (descriptive) attribute?

5. How do data dictionaries help you understand the data from a database or flat file?

EXTRACT, TRANSFORM, AND LOAD (ETL) THE DATA


LO 2-3
Explain and apply extraction, transformation, and loading (ETL) techniques to prepare the data for
analysis.

Once you have familiarized yourself with the data via data dictionaries and schemas, you
are prepared to request the data from the database manager or extract the data yourself.
The ETL process begins with identifying which data you need and is complete when the
clean data are loaded in the appropriate format into the tool to be used for analysis.
This process involves:
1. Determining the purpose and scope of the data request.
2. Obtaining the data.
3. Validating the data for completeness and integrity.
4. Cleaning the data.
5. Loading the data for analysis.

page 61

Extract
Determine exactly what data you need in order to answer your business questions.
Requesting data is often an iterative process, but the more prepared you are when
requesting data, the more time you will save for yourself and the database team in the long
run.
Requesting the data involves the first two steps of the ETL process. Each step has
questions associated with it that you should try to answer.

Step 1: Determine the Purpose and Scope of the Data Request


What is the purpose of the data request? What do you need the data to solve? What
business problem will they address?
What risk exists in data integrity (e.g., reliability, usefulness)? What is the mitigation
plan?
What other information will impact the nature, timing, and extent of the data analysis?

Once the purpose of the data request is determined and scoped, as well as any risks and
assumptions documented, the next step is to determine whom to ask and specifically what
is needed, what format is needed (Excel, PDF, database), and by what deadline.

Step 2: Obtain the Data


How will data be requested and/or obtained? Do you have access to the data yourself,
or do you need to request a database administrator or the information systems
department to provide the data for you?
If you need to request the data, is there a standard data request form that you should
use? From whom do you request the data?
Where are the data located in the financial or other related systems?
What specific data are needed (tables and fields)?
What tools will be used to perform data analytic tests or procedures and why?

Obtaining the Data via a Data Request


Determining not only what data are needed, but also which tool will be used to test and
process the data will aid the database administrator in providing the data to you in the
most accessible format.
It is also necessary to specify the format in which you would like to receive the data; it
is often preferred to receive data in a flat file (i.e., if the data you requested reside in
multiple tables or different databases, they should be combined into one file without any
hierarchy or relationships built in), with the first row containing column headings (names
of the fields requested), and each subsequent row containing data that correspond with the
column headings. Subtotals, breaks, and subheadings complicate data cleaning and should
not be included.5 When you receive the data, make sure that you understand the data in
each column; the data dictionary should prove extremely helpful for this. If a data
dictionary is unavailable, then you should plan to meet with database users to get a clear
understanding of the data in each column.

Lab Connection

Lab 2-1 has you explore work through the process of requesting data from
IT.
page 62
In a later chapter, you will be provided a deep dive into the audit data standards (ADS)
developed by the American Institute of Certified Public Accountants (AICPA).6 The aim
of the ADS is to alleviate the headaches associated with data requests by serving as a
guide to standardize these requests and specify the format an auditor desires from the
company being audited. These include the following:
1. Order-to-Cash subledger standards
2. Procure-to-Pay subledger standards
3. Inventory subledger standards
4. General Ledger standards
While the ADS provide an opportunity for standardization, they are voluntary.
Regardless of whether your request for data will conform to the standards, a data request
form template (as shown in Exhibit 2-7) can make communication easier between data
requester and provider.

EXHIBIT 2-7
Example Standard Data Request Form

Requester Name:
Requester Contact Number:
Requester Email Address:
Please provide a description of the information needed (indicate which tables and which fields you require):

What will the information be used for?

Frequency (circle one) One-Off Annually Termly Other:_____


Format you wish the data to be delivered in (circle Spreadsheet Word Text File Other:
one): Document _____
Request Date:
Required Date:

Intended Audience:
Customer (if not requester):

Once the data are received, you can move on to the transformation phase of the ETL
process. The next step is to ensure the completeness and integrity of the extracted data.

Obtaining the Data Yourself


At times, you will have direct access to a database or information system that holds all or
some of the data you need. In this case, you may not need to go through a formal data
request process, and you can simply extract the data yourself.

page 63
After identifying the goal of the data analysis project in the first step of the IMPACT
cycle, you can follow a similar process to how you would request the data if you were
going to extract it yourself:

1. Identify the tables that contain the information you need. You can do this by looking
through the data dictionary or the relationship model.
2. Identify which attributes, specifically, hold the information you need in each table.
3. Identify how those tables are related to each other.
Once you have identified the data you need, you can start gathering the information.
There are a variety of methods that you could take to retrieve the data. Two will be
explained briefly here—SQL and Excel—and there is a deep dive into SQL in Appendices
D and E, as well as a deep dive into Excel’s VLookup and Index/Match in Appendix B.
SQL: “Structured Query Language” (SQL, often pronounced sequel) is a computer
language to interact with data (tables, records, and attributes) in a database by creating,
updating, deleting, and extracting. For Data Analytics we only need to focus on extracting
data that match the criteria of our analysis goals. Using SQL, we can combine data from
one or more tables and organize the data in a way that is more intuitive and useful for data
analysis than the way the data are stored in the relational database. A firm understanding
of the data—the tables, how they are related, and their respective primary and foreign keys
—is integral to extracting the data.
Typically, data should be stored in the database and analyzed in another tool such as
Excel, IDEA, or Tableau. However, you can choose to extract only the portion of the data
that you wish to analyze via SQL instead of extracting full tables and transforming the
data in Excel, IDEA, or Tableau. This is especially preferable when the raw data stored in
the database are large enough to overwhelm Excel. Excel 2016 can hold only 1,048,576
rows on one spreadsheet. When you attempt to bring in full tables that exceed that amount,
even when you use Excel’s powerful Power BI tools, it will slow down your analysis if the
full table isn’t necessary.
As you will explore in labs throughout this textbook, SQL isn’t only directly within the
database. When you plan to perform your analysis in Excel, Power BI, or Tableau, each
tool has an SQL option for you to directly connect to the database and pull in a subset of
the data.
There is more description about writing queries and a chance to practice creating joins
in Appendix E.
Microsoft Excel or Power BI: When data are not stored in a relational database, or are
not too large for Excel, the entire table can be analyzed directly in a spreadsheet. The
advantage is that further analysis can be done in Excel or Power BI and it is beneficial to
have all the data to drill down into more detail once the initial question is answered. This
approach is often simpler for doing exploratory analysis (more on this in a later chapter).
Understanding the primary key and foreign key relationships is also integral to working
with the data directly in Excel.
When your data are stored directly in Excel, you can also use Excel functions and
formulas to combine data from multiple Excel tables into one table, similar to how you
can join data with SQL in Access or another relational database. Two of Excel’s most
useful techniques for looking up data from two separate tables and matching them based
on a matching primary key/foreign key relationship is the VLookup or Index/Match
functions. There are a variety of ways that the VLookup or Index/Match function can be
used, but for extracting and transforming data it is best used to add a column to a table.
More information about using VLookup and Index/Match functions in Excel is
provided in Appendix B.
The question of whether to use SQL or Excel’s tools (such as VLookup) is primarily
answered by where the data are stored. Because data are most frequently stored in a
relational database (as discussed earlier in this chapter, due to the efficiency page 64
and data integrity benefits relational databases provide), SQL will often be the
best option for retrieving data, after which those data can be loaded into Excel or another
tool for further analysis. Another benefit of SQL queries is that they can be saved and
reproduced at will or at regular intervals. Having a saved SQL query can make it much
easier and more efficient to re-create data requests. However, if the data are already stored
in a flat file in Excel, there is little reason to use SQL. Sometimes when you are
performing exploratory analysis, even if the data are stored in a relational database, it can
be beneficial to load entire tables into Excel and bypass the SQL step. This should be
considered carefully before doing so, though, because relational databases handle large
amounts of data much better than Excel can. Writing SQL queries can also make it easier
to load only the data you need to analyze into Excel so that you do not overwhelm Excel’s
resources.

Data Analytics at Work


Jump Start Your Accounting Career with Data Analytics Knowledge
A Robert Half survey finds that 86 percent of CFOs say that Data Analytics skills are mandatory for at
least some accounting and finance positions, with 37 percent of CFOs saying these skills are
mandatory for all accounting and finance positions, and another 49 percent reported they are
mandatory for some roles. Indeed, these CFOs reported that technology experience and aptitude are
the most difficult to find in accounting and finance candidates.
So, jump right in and be ready for the accounting careers of the future, starting with extract,
transfer, and load (ETL) skills!

Source: Robert Half Associates, “Survey: Finance Leaders Report Technology Skills Most Difficult to
Find When Hiring,” August 22, 2019, http://rh-us.mediaroom.com/2019-08-22-Survey-Finance-
Leaders-Report-Technology-Skills-Most-Difficult-To-Find-When-Hiring (accessed January 22, 2021).

Transform
Step 3: Validating the Data for Completeness and Integrity
Anytime data are moved from one location to another, it is possible that some of the data
could have been lost during the extraction. It is critical to ensure that the extracted data are
complete (that the data you wish to analyze were extracted fully) and that the integrity of
the data remains (that none of the data have been manipulated, tampered with, or
duplicated during the extraction). Being able to validate the data successfully requires you
to not only have the technical skills to perform the task, but also to know your data well. If
you know what to reasonably expect from the data in the extraction then you have a higher
likelihood of identifying errors or issues from the extraction. Examples of data validation
questions are: “How many records should have been extracted?” “What checksums or
control totals can be performed to ensure data extraction is accurate?”
The following four steps should be completed to validate the data after extraction:

1. Compare the number of records that were extracted to the number of records in the
source database: This will give you a quick snapshot into whether any data were
skipped or didn’t extract properly due to an error or data type mismatch. This is a
critical first step, but it will not provide information about the data themselves other
than ensuring that the record counts match.

page 65

2. Compare descriptive statistics for numeric fields: Calculating the minimums,


maximums, averages, and medians will help ensure that the numeric data were
extracted completely.
3. Validate Date/Time fields in the same way as numeric fields by converting the data type
to numeric and running descriptive statistic comparisons.
4. Compare string limits for text fields: Text fields are unlikely to cause an issue if you
extracted your data into Excel because Excel allows a generous maximum character
number (for example, Excel 2016 allows 32,767 characters per cell). However, if you
extracted your data into a tool that does limit the number of characters in a string, you
will want to compare these limits to the source database’s limits per field to ensure that
you haven’t cut off any characters.

If an error is found, depending on the size of the dataset, you may be able to easily find
the missing or erroneous data by scanning the information with your eyes. However, if the
dataset is large, or if the error is difficult to find, it may be easiest to go back to the
extraction and examine how the data were extracted, fix any errors in the SQL code, and
re-run the extraction.

Lab Connection

Lab 2-5, Lab 2-6, Lab 2-7, and Lab 2-8 explore the process of loading and
validating data.

Step 4: Cleaning the Data


After validating the data, you should pay close attention to the state of the data and clean
them as necessary to improve the quality of the data and subsequent analysis. The
following four items are some of the more common ways that data will need to be cleaned
after extraction and validation:

1. Remove headings or subtotals: Depending on the extraction technique used and the file
type of the extraction, it is possible that your data could contain headings or subtotals
that are not useful for analysis. Of course, these issues could be overcome in the
extraction steps of the ETL process if you are careful to request the data in the correct
format or to only extract exactly the data you need.
2. Clean leading zeroes and nonprintable characters: Sometimes data will contain
leading zeroes or “phantom” (nonprintable) characters. This will happen particularly
when numbers or dates were stored as text in the source database but need to be
analyzed as numbers. Nonprintable characters can be white spaces, page breaks, line
breaks, tabs, and so on, and can be summarized as characters that our human eyes can’t
see, but that the computer interprets as a part of the string. These can cause trouble
when joining data because, while two strings may look identical to our eyes, the
computer will read the nonprintable characters and will not find a match.
3. Format negative numbers: If there are negative numbers in your dataset, ensure that the
formatting will work for your analysis. For example, if your data contain negative
numbers formatted in parentheses and you would prefer this formatting to be as a
negative sign, this needs to be corrected and consistent.
4. Correct inconsistencies across data, in general: If the source database did not enforce
certain rules around data entry, it is possible that there are inconsistencies across the
data—for example, if there is a state field, Arkansas could be formatted as “AR,”
“Ark,” “Ar.,” and so on. These will need to be replaced with a common value before
you begin your analysis if you are interested in grouping data geographically.

page 66

Lab Connection

Lab 2-2 and Lab 2-3 walk through how to prepare data for analysis and
resolve common data quality issues.

A Note about Data Quality


As you prepare your data for analysis, you should pay close attention to the quality of the
underlying data. Incorrect or invalid data can skew your results and lead to inaccurate
conclusions. Low-quality data will often contain numerous errors, obsolete or incorrect
data, or invalid data.
To evaluate a dataset and its underlying data quality, here are five main data quality
issues to consider when you evaluate data for the first time:

1. Dates: The most common problems revolve around the date format because there are
so many different ways a date can be presented. For example, look at the different ways
you can show July 6, 2024: 6-Jul-2024; 6.7.2024; 45479 (in Excel); 07/06/2024 (in the
United States); 06/07/2024 (in Europe); and the list goes on. You need to format the
date to match the acceptable format for your tool. The ISO 8601 standard indicates you
should format dates in the year-month-day format (2024-07-06), and most professional
query tools accept this format. If you use Excel to transform dates to this format,
highlight your dates and go to Home > Number > Format Cells and choose Custom.
Then type in YYYY-MM-DD and click OK.

Format Cells Window in Excel


Microsoft Excel

page 67

2. Numbers: Numbers can be misinterpreted, particularly if they are manually entered.


For example, 1 or I; 0 or O; 3 or E; 7 or seven. Watch for invalid number formats when
you start sorting and analyzing your data, and then go back and correct them.
Additionally, accounting artifacts such as dollar signs, commas, and parentheses are
pervasive in spreadsheet data (e.g., $12,345.22 or (1,422.53)). As you clean the data,
remove any extra accounting characters so numbers appear in their raw form (e.g.,
12345.22 or -1422.53).
3. International characters and encoding: When you work with data that span multiple
countries, it is likely that you will come across special characters, such as accent marks
(á or À), umlats (Ü), invisible computer characters (TAB, RETURN, linebreak, null),
or special characters that are used in query and scripting languages (*, #, “, ’). In many
cases, these can be corrected with a find and replace or contained in quote marks so
they are ignored by the query language. Additionally, while most modern computer
programs use UNICODE as the text encoding language, older databases will generate
data in the ASCII format. If your tool fails to populate your dataset accurately, having
international characters and symbols is likely to be a cause.
4. Languages and measures: Similar to international characters, data elements may
contain a variety of words or measures that have the same meaning. For example,
cheese or fromage; ketchup or catsup; pounds or lbs; $ or €; Arkansas or AR. In order
to properly analyze the comparable data, you’ll need to translate them into a common
format by choosing one word as the standard and replacing the equivalent words. Also
make sure the measure doesn’t change the meaning. The total value in U.S. dollars is
not the same thing as the total value in euros. Make sure you’re comparing apples to
apples or euros to euros.
5. Human error: Whenever there is manual input into the data, there is a high probability
that data will be bad simply because they were mistyped or entered into the wrong
place. There’s no hard and fast rule for dealing with input errors other than being
vigilant and making corrections (e.g., find and replace) when they occur.

Load
Step 5: Loading the Data for Data Analysis
If the extraction and transformation steps have been done well by the time you reach this
step, the loading part of the ETL process should be the simplest step. It is so simple, in
fact, that if your goal is to do your analysis in Excel and you have already transformed and
cleaned your data in Excel, you are finished. There should be no additional loading
necessary.
However, it is possible that Excel is not the last step for analysis. The data analysis
technique you plan to implement, the subject matter of the business questions you intend
to answer, and the way in which you wish to communicate results will all drive the choice
of which tool you use to perform your analysis.
Throughout the text, you will be introduced to a variety of different tools to use for
analyzing data beyond including Excel, Power BI, Tableau Prep, and Tableau Desktop. As
these tools are introduced to you, you will learn how to load data into them.

ETL or ELT?
If loading the data into Excel is indeed the last step, are you actually “extracting,
transforming, and loading,” or is it “extracting, loading, and transforming”?
The term ETL has been in popular use since the 1970s, and even though methods for
extracting and transforming data have gotten easier to use, more accessible, as well as
more robust, the term has stuck. Increasingly, however, the procedure is shifting toward
ELT. Particularly with tools such as Microsoft’s Power BI suite, all of the loading and
transforming can be done within Excel, with data directly loaded into Excel page 68
from the database, and then transformed (also within Excel). The most
common method for mastering the data that we use throughout this textbook is more in
line with ELT than ETL; however, even when the order changes from ETL to ELT, it is
still more common to refer to the procedure as ETL.

PROGRESS CHECK
6. Describe two different methods for obtaining data for analysis.

7. What are four common data quality issues that must be fixed before analysis can take place?

ETHICAL CONSIDERATIONS OF DATA COLLECTION


AND USE
LO 2-4
Describe the ethical considerations of data collection and data use.

Mastering the data goes beyond just ETL processes. Mastering the data also includes
having some assurance that the data collection is not only secure, but also that the ethics of
data collection and data use have been considered.
In the past, the scope for digital risk was limited to cybersecurity threats to make sure
the data were secure; however, increasingly the concern is the risk of lacking ethical data
practices. Indeed, the concerns regarding data gleaned from traditional and nontraditional
sources are that they are used in an ethical manner and for their intended purpose.
Potential ethical issues include an individual’s right to privacy and whether assurance
is offered that certain data are not misused. For example, is the individual about whom
data has been collected able to limit who has access to her personal information, and how
those data are used or shared? If an individual’s credit card is submitted for an e-
commerce transaction, does the customer have assurance that the credit card number will
not be misused?
To address these and other concerns, the Institute of Business Ethics suggests that
companies consider the following six questions to allow a business to create value from
data use and analysis, and still protect the privacy of stakeholders 7:
1. How does the company use data, and to what extent are they integrated into firm
strategy? What is the purpose of the data? Are they accurate or reliable? Will they
benefit the customer or the employee?
2. Does the company send a privacy notice to individuals when their personal data
are collected? Is the request to use the data clear to the user? Do they agree to the
terms and conditions of use of their personal data?
3. Does the company assess the risks linked to the specific type of data the company
uses? Have the risks of data use or data breach of potentially sensitive data been
considered?
4. Does the company have safeguards in place to mitigate the risks of data misuse?
Are preventive controls on data access in place and are they effective? Are penalties
established and enforced for data misuse?
5. Does the company have the appropriate tools to manage the risks of data misuse?
Is the feedback from these tools evaluated and measured? Does internal audit regularly
evaluate these tools?
6. Does our company conduct appropriate due diligence when sharing with or
acquiring data from third parties? Do third-party data providers follow similar
ethical standards in the acquisition and transmission of the data?

page 69
The user of the data must continue to recognize the potential risks associated with data
collection and data use, and work to mitigate those risks in a responsible way.

PROGRESS CHECK
8. A firm purchases data from a third party about customer preferences for laundry detergent. How

would you recommend that this firm conduct appropriate due diligence about whether the third-

party data provider follows ethical data practices? An audit? A questionnaire? What questions

should be asked?

Summary
■ The first step in the IMPACT cycle is to identify the questions that you
intend to answer through your data analysis project. Once a data analysis
problem or question has been identified, the next step in the IMPACT
cycle is mastering the data, which includes obtaining the data needed and
preparing it for analysis. We often call the processes associated with
mastering the data ETL, which stands for extract, transform, and load.
(LO 2-2, 2-3)
■ In order to obtain the right data, it is important to have a firm grasp of
what data are available to you and how that information is stored. (LO 2-
2)
◦ Data are often stored in a relational database, which helps to ensure that an
organization’s data are complete and to avoid redundancy. Relational
databases are made up of tables with rows of data that represent records.
Each record is uniquely identified with a primary key. Tables are related to
other tables by using the primary key from one table as a foreign key in
another table.
■ Extract: To obtain the data, you will either have access to extract the data
yourself or you will need to request the data from a database administrator
or the information systems team. If the latter is the case, you will complete
a data request form, indicating exactly which data you need and why. (LO
2-3)
■ Transform: Once you have the data, they will need to be validated for
completeness and integrity—that is, you will need to ensure that all of the
data you need were extracted and that all data are correct. Sometimes
when data are extracted some formatting or sometimes even entire records
will get lost, resulting in inaccuracies. Correcting the errors and cleaning
the data is an integral step in mastering the data. (LO 2-3)
■ Load: Finally, after the data have been cleaned, there may be one last step
of mastering the data, which is to load them into the tool that will be used
for analysis. Often, the cleaning and correcting of data occur in Excel, and
the analysis will also be done in Excel. In this case, there is no need to
load the data elsewhere. However, if you intend to do more rigorous
statistical analysis than Excel provides, or if you intend to do more robust
data visualization than can be done in Excel, it may be necessary to load
the data into another tool following the transformation process. (LO 2-3)
■ Mastering the data goes beyond just the ETL processes. Those who collect
and use data also have the responsibility of being good stewards,
providing some assurance that the data collection is not only secure, but
also that the ethics of data collection and data use have been considered.
(LO 2-4)

page 70

Key Words
accounting information system (54) A system that records, processes, reports, and communicates the results of
business transactions to provide financial and nonfinancial information for decision-making purposes.
composite primary key (58) A special case of a primary key that exists in linking tables. The composite primary
key is made up of the two primary keys in the table that it is linking.
customer relationship management (CRM) system (54) An information system for managing all interactions
between the company and its current and potential customers.
data dictionary (59) Centralized repository of descriptions for all of the data attributes of the dataset.
data request form (62) A method for obtaining data if you do not have access to obtain the data directly yourself.
descriptive attributes (58) Attributes that exist in relational databases that are neither primary nor foreign keys.
These attributes provide business information, but are not required to build a database. An example would be
“Company Name” or “Employee Address.”
Enterprise Resource Planning (ERP) (54) Also known as Enterprise Systems, a category of business
management software that integrates applications from throughout the business (such as manufacturing, accounting,
finance, human resources, etc.) into one system.
ETL (60) The extract, transform, and load process that is integral to mastering the data.
flat file (57) A means of storing data in one place, such as in an Excel spreadsheet, as opposed to storing the data in
multiple tables, such as in a relational database.
foreign key (58) An attribute that exists in relational databases in order to carry out the relationship between two
tables. This does not serve as the “unique identifier” for each record in a table. These must be identified when
mastering the data from a relational database in order to extract the data correctly from more than one table.
human resource management (HRM) system (54) An information system for managing all interactions
between the company and its current and potential employees.
mastering the data (54) The second step in the IMPACT cycle; it involves identifying and obtaining the data
needed for solving the data analysis problem, as well as cleaning and preparing the data for analysis.
primary key (57) An attribute that is required to exist in each table of a relational database and serves as the
“unique identifier” for each record in a table.
relational database (56) A means of storing data in order to ensure that the data are complete, not redundant, and
to help enforce business rules. Relational databases also aid in communication and integration of business processes
across an organization.
supply chain management (SCM) system (54) An information system that helps manage all the company’s
interactions with suppliers.

ANSWERS TO PROGRESS CHECKS


1. The unique identifier of the Supplier table is [Supplier ID], and the unique identifier of the Purchase

Order table is [PO Number]. The Purchase Order table contains the foreign key.

2. The foreign key attributes in the Purchase Order table that do not relate to any tables in the view are

EmployeeID and CashDisbursementID. These attributes probably relate to the Employee table (so

that we can tell which employee was responsible for each Purchase Order) and the Cash

Disbursement table (so that we can tell if the Purchase Orders have been paid for yet, and if so, on

which check). The Employee table would be a complete listing of each employee, as well as

containing the details about each employee (for example, phone number, address, etc.). The Cash

Disbursement table would be a listing of the payments the company has made.

page 71

3.

4. The purpose of the primary key is to uniquely identify each record in a table. The purpose of a foreign

key is to create a relationship between two tables. The purpose of a descriptive attribute is to provide

meaningful information about each record in a table. Descriptive attributes aren’t required for a

database to run, but they are necessary for people to gain business information about the data stored

in their databases.

5. Data dictionaries provide descriptions of the function (e.g., primary key or foreign key when

applicable), data type, and field names associated with each column (attribute) of a database. Data

dictionaries are especially important when databases contain several different tables and many
different attributes in order to help analysts identify the information they need to perform their

analysis.

6. Depending on the level of security afforded to a business analyst, she can either obtain data directly

from the database herself or she can request the data. When obtaining data herself, the analyst must

have access to the raw data in the database and a firm knowledge of SQL and data extraction

techniques. When requesting the data, the analyst doesn’t need the same level of extraction skills,

but she still needs to be familiar with the data enough in order to identify which tables and attributes

contain the information she requires.

7. Four common issues that must be fixed are removing headings or subtotals, cleaning leading zeroes

or nonprintable characters, formatting negative numbers, and correcting inconsistencies across the

data.

8. Firms can ask to see the terms and conditions of their third-party data supplier, and ask questions to

come to an understanding regarding if and how privacy practices are maintained. They also can

evaluate what preventive controls on data access are in place and assess whether they are followed.

Generally, an audit does not need to be performed, but requesting a questionnaire be filled out would

be appropriate.

Multiple Choice Questions


1. (LO 2-3) Mastering the data can also be described via the ETL process. The ETL process stands for:

a. extract, total, and load data.

b. enter, transform, and load data.

c. extract, transform, and load data.

d. enter, total, and load data.

page 72

2. (LO 2-3) Which of the following describes part of the goal of the ETL process?

a. Identify which approach to Data Analytics should be used.

b. Load the data into a relational database for storage.

c. Communicate the results and insights found through the analysis.


d. Identify and obtain the data needed for solving the problem.

3. (LO 2-2) The advantages of storing data in a relational database include which of the following?

a. Help in enforcing business rules

b. Increased information redundancy

c. Integrating business processes

d. All of the answers are correct

e. a and b

f. b and c

g. a and c

4. (LO 2-3) The purpose of transforming data is:

a. to validate the data for completeness and integrity.

b. to load the data into the appropriate tool for analysis.

c. to obtain the data from the appropriate source.

d. to identify which data are necessary to complete the analysis.

5. (LO 2-2) Which attribute is required to exist in each table of a relational database and serves as the

“unique identifier” for each record in a table?

a. Foreign key

b. Unique identifier

c. Primary key

d. Key attribute

6. (LO 2-2) The metadata that describe each attribute in a database are which of the following?

a. Composite primary key

b. Data dictionary

c. Descriptive attributes

d. Flat file

7. (LO 2-3) As mentioned in the chapter, which of the following is not a common way that data will need

to be cleaned after extraction and validation?

a. Remove headings and subtotals.


b. Format negative numbers.

c. Clean up trailing zeroes.

d. Correct inconsistencies across data.

8. (LO 2-2) Why is Supplier ID considered to be a primary key for a Supplier table?

a. It contains a unique identifier for each supplier.

b. It is a 10-digit number.

c. It can either be for a vendor or miscellaneous provider.

d. It is used to identify different supplier categories.

9. (LO 2-2) What are attributes that exist in a relational database that are neither primary nor foreign

keys?

a. Nondescript attributes

b. Descriptive attributes

c. Composite keys

d. Relational table attributes

page 73

10. (LO 2-4) Which of the following questions are not suggested by the Institute of Business Ethics to

allow a business to create value from data use and analysis, and still protect the privacy of

stakeholders?

a. How does the company use data, and to what extent are they integrated into firm strategy?

b. Does the company send a privacy notice to individuals when their personal data are collected?

c. Does the data used by the company include personally identifiable information?

d. Does the company have the appropriate tools to mitigate the risks of data misuse?

Discussion and Analysis


1. (LO 2-2) The advantages of a relational database include limiting the amount of redundant data that

are stored in a database. Why is this an important advantage? What can go wrong when redundant

data are stored?


2. (LO 2-2) The advantages of a relational database include integrating business processes. Why is it

preferable to integrate business processes in one information system, rather than store different

business process data in separate, isolated databases?

3. (LO 2-2) Even though it is preferable to store data in a relational database, storing data across

separate tables can make data analysis cumbersome. Describe three reasons it is worth the trouble

to store data in a relational database.

4. (LO 2-2) Among the advantages of using a relational database is enforcing business rules. Based on

your understanding of how the structure of a relational database helps prevent data redundancy and

other advantages, how does the primary key/foreign key relationship structure help enforce a

business rule that indicates that a company shouldn’t process any purchase orders from suppliers

who don’t exist in the database?

5. (LO 2-2) What is the purpose of a data dictionary? Identify four different attributes that could be

stored in a data dictionary, and describe the purpose of each.

6. (LO 2-3) In the ETL process, the first step is extracting the data. When you are obtaining the data

yourself, what are the steps to identifying the data that you need to extract?

7. (LO 2-3) In the ETL process, if the analyst does not have the security permissions to access the data

directly, then he or she will need to fill out a data request form. While this doesn’t necessarily require

the analyst to know extraction techniques, why does the analyst still need to understand the raw data

very well in order to complete the data request?

8. (LO 2-3) In the ETL process, when an analyst is completing the data request form, there are a

number of fields that the analyst is required to complete. Why do you think it is important for the

analyst to indicate the frequency of the report? How do you think that would affect what the database

administrator does in the extraction?

9. (LO 2-3) Regarding the data request form, why do you think it is important to the database

administrator to know the purpose of the request? What would be the importance of the “To be used

in” and “Intended audience” fields?

10. (LO 2-3) In the ETL process, one important step to process when transforming the data is to work

with null, n/a, and zero values in the dataset. If you have a field of quantitative data (e.g., number of

years each individual in the table has held a full-time job), what would be the effect of the following?

a. Transforming null and n/a values into blanks

b. Transforming null and n/a values into zeroes

c. Deleting records that have null and n/a values from your dataset
(Hint: Think about the impact on different aggregate functions, such as COUNT and AVERAGE.)

page 74

11. (LO 2-4) What is the theme of each of the six questions proposed by the Institute of Business Ethics?

Which one addresses the purpose of the data? Which one addresses how the risks associated with

data use and collection are mitigated? How could these two specific objectives be achieved at the

same time?

Problems
1. (LO 2-2) Match the relational database function to the appropriate relational database term:

Composition primary key

Descriptive attribute

Foreign key

Primary key

Relational database

Relational
Database
Relational Database Function Term

1. Serves as a unique identifier in a database table.


2. Creates a relationship between two tables.
3. Two foreign keys from the tables that it is linking
combine to make up a unique identifier.
4. Describes each record with characteristics with
actual business information.
5. A means of storing data to ensure data are
complete, redundant, to help enforce business rules.

2. (LO 2-3) Identify the order sequence in the ETL process as part of mastering the data (i.e., 1 is first;

5 is last).
Sequence Order (1
Steps of the ETL Process to 5)

1. Validate the data for completeness and


integrity.
2. Sanitize the data.
3. Obtain the data.
4. Load the data in preparation for data
analysis.
5. Determine the purpose and scope of the data
request.

3. (LO 2-3) Identify which ETL tasks would be considered “Validating” the data, and which would be

considered “Cleaning” the data.

Validating
or
ETL Task Cleaning

1. Compare the number of records that were extracted to


the number of records in the source database.
2. Remove headings or subtotals.
3. Remove leading zeroes and nonprintable characters.
4. Compare descriptive statistics for numeric fields.
5. Format negative numbers.
6. Compare string limits for text fields.
7. Correct inconsistencies across data, in general.

page 75

4. (LO 2-3) Match each ETL task to the stage of the ETL process:

Determine purpose
Obtain

Validate

Clean

Load

Stage of
ETL
ETL Task Process

1. Use SQL to extract data from the source database.


2. Remove headings or subtotals.
3. Choose which database and specific data will be
needed to address the accounting question.
4. Compare the number of records extracted to the
number of records in the source database.
5. Make sure all data formats start with two capital
letters. Fix inconsistencies.
6. Input the data into the analysis tool.

5. (LO 2-4) For each of the six questions suggested by the Institute of Business Ethics to evaluate data

privacy, categorize each question into one of these three types:

A. Evaluate the company’s purpose of the data

B. Evaluate the company’s use or misuse of the data

C. Evaluate the due diligence of the company’s data vendors in preventing misuse of the data

Category
Institute of Business Ethics Questions regarding Data A, B, or
Use and Privacy C?

1. Does the company assess the risks linked to the specific


type of data the company uses?
2. Does the company send a privacy notice to individuals
when their personal are collected?
Category
Institute of Business Ethics Questions regarding Data A, B, or
Use and Privacy C?

3. How does the company use data, and to what extent are
they integrated into firm strategy?
4. Does our company conduct appropriate due diligence
when sharing with or acquiring data from third parties?
5. Does the company have the appropriate tools to
manage the risks of misuse?
6. Does the company have safeguards in place to mitigate
these risks of misuse?

6. (LO 2-2) Which of the following are useful, established characteristics of using a relational database?

Institute of Business Ethics Useful, Established


Questions regarding Data Use Characteristics of Relational
and Privacy Databases? Yes/No

1. Completeness
2. Reliable
3. No redundancy
4. Communication and integration
of business processes
5. Less costly to purchase
6. Less effort to maintain
7. Business rules are enforced

page 76

7. (LO 2-3) As part of master the data, analysts must make certain trade-offs when they consider which

data to use. Consider these three different scenarios:


a. Analysis: What are the trade-offs of using data that are highly relevant to the question, but have a
lot of missing data?

b. Analysis: What are the trade-offs an analyst should consider between data that are very
expensive to acquire and analyze, but will most directly address the question at hand? How would
you assess whether they are worth the extra cost?

c. Analysis: What are the trade-offs between extracting needed data by yourself, or asking a data
scientist to get access to the data?

8. (LO 2-4) The Institute of Business Ethics proposes that a company protect the privacy of

stakeholders by considering these questions of its third-party data providers:

Does our company conduct appropriate due diligence when sharing with or acquiring data from
third parties?

Do third-party data providers follow similar ethical standards in the acquisition and transmission of
the data?

a. Analysis: What type of due diligence with regard to a third party sharing and acquiring data would
be appropriate for the company (or company accountant or data scientist) to perform? An audit? A
questionnaire? Standards written in to a contract?

b. Analysis: How would you assess whether the third-party data provider follows ethical standards in
the acquisition and transmission of the data?

page 77

LABS

Lab 2-1 Request Data from IT—Sláinte

Case Summary: Sláinte is a fictional brewery that has recently gone through big changes.
Sláinte sells six different products. The brewery has only recently expanded its business to
distributing from one state to nine states, and now its business has begun stabilizing after
the expansion. With that stability comes a need for better analysis. You have been hired by
Sláinte to help management better understand the company’s sales data and provide input
for its strategic decisions. In this lab, you will identify appropriate questions and develop a
hypothesis for each question, generate a data request, and evaluate the data you receive.
Data: Lab 2-1 Data Request Form.zip - 10KB Zip / 13KB Word
Lab 2-1 Part 1 Identify the Questions and Generate a
Data Request
Before you begin the lab, you should create a new blank Word document where you will
record your screenshots and save it as Lab 2-1 [Your name] [Your email address].docx.
One of the biggest challenges you face with data analysis is getting the right data. You
may have the best questions in the world, but if there are no data available to support your
hypothesis, you will have difficulty providing value. Additionally, there are instances in
which the IT workers may be reluctant to share data with you. They may send incomplete
data, the wrong data, or completely ignore your request. Be persistent, and you may have
to look for creative ways to find insight with an incomplete picture.
One of Sláinte’s first priorities is to identify its areas of success as well as areas of
potential improvement. Your manager has asked you to focus specifically on sales data at
this point. This includes data related to sales orders, products, and customers.
Answer the Lab 2-1 Part 1 Analysis Questions and then complete a data request form
for those data you have identified for your analysis.
1. Open the Data Request Form.
2. Enter your contact information.
3. In the description field, identify the tables and fields that you’d like to analyze, along
with the time periods (e.g., past month, past year, etc.).
4. Indicate what the information will be used for in the appropriate box (internal
analysis).
5. Select a frequency. In this case, this is a “One-off request.”
6. Choose a format (spreadsheet).
7. Enter a request date (today) and a required date (one week from today).
8. Take a screenshot (label it 2-1A) of your completed form.

Lab 2-1 Part 1 Analysis Questions (LO 2-2)


AQ 1. Given that you are new and trying to get a grasp on Sláinte’s
operations, list three questions related to sales that would help you
begin your analysis. For example, how many products were sold in
each state?
AQ 2. Now hypothesize the answers to each of the questions. Remember,
your answers don’t have to be correct at this point. They will help
you understand what type of data you are looking for. For example:
500 in Missouri, 6,000 in Pennsylvania, 4,000 in New York, and so
on.
AQ 3. Finally, for each question, identify the specific tables and fields that
are needed to answer your questions. Use the data dictionary and ER
Diagram provided in Appendix J for guidance on what page 78
tables and attributes are available. For example, to
answer the question about state sales, you would need the
Customer_State attribute that is located in the Customer master table
as well as the Sales_Order_Quantity_Sold attribute in the Sales
table. If you had access to store or distribution center location data,
you may also look for a State field there, as well.
Lab 2-1 Part 2 Evaluate the Data Extract
After a few days, Rachel, an IT worker, responds to your request. She gives you the
following tables and attributes, shown in Lab Exhibit 2-1A:

EXHIBIT 2-1A

Sales_Orders Table

Attribute Description of Attribute


Sales_Order_ID (PK) Unique identifier for each sales order
Sales_Order_Date The date of the sales order, regardless of the date the order is entered
Shipping_Cost Shipping cost for the order

Sales_Order_Lines Table

Attribute Description of Attribute


Sales_Order_ID (FK) Unique identifier for each sales order
Sales_Order_Quantity_Sold Sales order line quantity
Product_Sale_Price Sales order line price per unit

Finished_Goods_Products Table

Attribute Description of Attribute


Product_Code (PK) Unique identifier for each product
Product_Description Product description (plain English) to indicate the name or other identifying characteristics
of the product
Product_Sale_Price Price per unit of the associated product

You may notice that while there are a few attributes that may be useful in your sales
analysis, the list may be incomplete and be missing several values. This is normal with
data requests.

Lab 2-1 Part 2 Objective Questions (LO 2-2)


OQ 1. Which tables and attributes are missing from the data extract that
would be necessary to answer the question “How many products
were sold in each state?”
OQ 2. What new question can you answer using the data extract?

Lab 2-1 Part 2 Analysis Questions (LO 2-2)


AQ 1. Evaluate your original questions and responses from Part 1. Can you
still answer the original questions that you identified in step 1 with
the data provided?
AQ 2. What additional tables and attributes would you need to answer your
questions?

Lab 2-1 Submit Your Screenshot Lab Document


Verify that you have answered any questions your instructor has assigned, then upload
your screenshot lab document to Connect or to the location indicated by your instructor.

page 79

Lab 2-2 Prepare Data for Analysis—Sláinte

Lab Note: The tools presented in this lab periodically change. Updated instructions, if
applicable, can be found in the eBook and lab walkthrough videos in Connect.
Case Summary: Sláinte is a fictional brewery that has recently gone through big
changes. Sláinte sells six different products. The brewery has only recently expanded its
business to distributing from one state to nine states, and now its business has begun
stabilizing after the expansion. Sláinte has brought you in to help determine potential areas
for sales growth in the next year. Additionally, management have noticed that the
company’s margins aren’t as high as they had budgeted and would like you to help
identify some areas where they could improve their pricing, marketing, or strategy.
Specifically, they would like to know how many of each product were sold, the product’s
actual name (not just the product code), and the months in which different products were
sold.
Data: Lab 2-2 Slainte Dataset.zip - 83KB Zip / 90KB Excel

Lab 2-2 Example Output


By the end of this lab, you will create a PivotTable that will let you explore sales data.
While your results will include different data values, your work should look similar to
this:
Lab 2-2 Part 1 Prepare a Data Model
Before you begin the lab, you should create a new blank Word document where you will
record your screenshots and save it as Lab 2-2 [Your name] [Your email address].docx.
Efficient relational databases contain normalized data. That is, each table contains only
data that are relevant to the object, and tables’ relationships are defined with primary
key/foreign key pairs as mentioned earlier in the chapter.
With Data Analytics, we often need to pull data from where they are stored into a
separate tool, such as Microsoft Power BI, Excel, or Tableau. Each of these tools provides
the opportunity to either connect directly to tables in the source data or to “denormalize”
the data.

Microsoft | Excel + Power Query

Microsoft Excel

LAB 2-2M Example of PivotTable in Microsoft Excel for November and


December

page 80

Tableau | Desktop
Tableau Software, Inc. All rights reserved.

LAB 2-2T Example of PivotTable in Tableau Desktop for November and


December

In this lab, you will learn how to connect to data in Microsoft Power BI or Excel using
the Internal Data Model and how to connect to data and build relationships among tables
in Tableau. This will prepare you for future labs that require you to transform data, as well
as aid in understanding of primary and foreign key relationships.

Microsoft | Excel + Power Query

1. Create a new blank spreadsheet in Excel.


2. From the Data tab on the ribbon, click Get Data > From File > From
Workbook. Note: In older versions of Excel, click the New Query button.
3. Locate the Lab 2-2 Slainte Dataset.xlsx file on your computer, and click
Import.
4. In the Navigator, check Select multiple items, then check the following tables to
import:

a. Finished_Goods_Products
b. Sales_Order
c. Sales_Order_Lines

5. Click Edit or Transform Data to open Power Query Editor.


6. Click through the table queries and attributes and correct the following issues:
a. Finished_Goods_Products:
1. Change the data type for Product_Sale_Price to Currency (click the column
header, then click Transform > Data Type > Currency. If prompted,
choose Replace current conversion step.

page 81

b. Sales_Order_Lines: Change the data type for Product_Sale_Price to Currency.


c. Sales_Order: Change the data type for Invoice_Order_Total and Shipping_Cost
to Currency.

7. Take a screenshot (label it 2-2MA) of the Power Query Editor window with
your changes.
8. At this point, we are ready to connect the data to our Excel sheet. We will only
create a connection so we can pull it in for specific analyses. Click the Home tab
and choose the arrow below Close & Load > Close & Load To…
9. Choose Only Create Connection and Add this data to the Data Model and
click OK. The three queries will appear in a tab on the right side of your sheet.
10. Save your workbook as Lab 2-2 Slainte Model.xlsx, and continue to Part 2.

Tableau | Desktop

1. Open Tableau Desktop.


2. Choose Connect > To a File > Microsoft Excel.
3. Locate the Lab 2-2 Slainte Dataset.xlsx and click Open.
4. Double-click or drag the following tables to the data pane on the top-right.
a. Finished_Goods_Products
b. Sales_Order_Lines: If prompted to Edit Relationship, match Product Code on
the left with Product Code on the right and close the window.
c. Sales_Order: If prompted to Edit Relationship, match Sales Order ID on the left
with Sales Order ID on the right and close the window.

5. Take a screenshot (label it 2-2TA).


6. Save your workbook as Lab 2-2 Slainte Model.twb, and continue to Part 2.

Lab 2-2 Part 1 Objective Questions (LO 2-3)


OQ 1. How many tables did you just load?
OQ 2. How many rows were loaded for the Sales_Order query?
OQ 3. How many rows were loaded for the Finished_Goods_Products
query?

Lab 2-2 Part 1 Analysis Questions (LO 2-3)


AQ 1. Have you used the Microsoft or Tableau tools before this class?
AQ 2. Compare and Contrast: If you completed this part with multiple
tools, which tool options do you think will be most useful for
preparing future data for analysis?

page 82
Lab 2-2 Part 2 Validate the Data
Now that the data have been prepared and organized, you’re ready for some basic analysis.
Given the sales data, management has asked you to prepare a report showing the total
number of each item sold each month between January and April 2020. This means that
we should create a PivotTable with a column for each month, a row for each product, and
the sum of the quantity sold where the two intersect.

Microsoft | Excel + Power Query

1. Open the Lab 2-2 Slainte Model.xlsx you created in Part 1.


2. Click the Insert tab on the ribbon and choose PivotTable.
3. In the Create PivotTable window, click Use this workbook’s Data Model and
click OK. A PivotTable Fields pane appears on the right of your worksheet.

Note: If at any point while working with your PivotTable, your PivotTable
Fields list disappears, you can make it reappear by ensuring that your active
cell is within the PivotTable itself. If the Field List still doesn’t reappear,
navigate to the Analyze tab in the Ribbon, and select Field List.

4. Click the > next to each table to show the available fields. If you don’t see your
three tables, click the All option directly below the PivotTable Fields pane title.
5. Drag Sales_Order.Sales_Order_Date to the Columns pane. Note: When you
add a date, Excel will automatically try to group the data by Year, Quarter, and so
on.

a. Remove Sales_Order_Date (Quarter) from the Columns pane.

6. Drag Finished_Good_Products.Product_Description to the Rows pane.


7. Drag Sales_Order_Lines.Sales_Order_Quantity_Sold to the Values pane.
Note: At this point, a warning will appear asking you to create relationships.
8. Click Auto-Detect… to automatically create relationships in the data model.
a. Click Manage Relationships… to verify that the primary key–foreign key
pairs are correct:
1. Sales_Order_Lines (Product_Code) = Finished_Good_Products (Product_Code)

2. Sales_Order_Lines (Sales_Order_ID) = Sales_Order (Sales_Order_ID)

b. Take a screenshot (label it 2-2MB) of your Manage Relationships window.


c. Click Close.

9. In the PivotTable, drill down to show the monthly data:


a. Click the + next to 2020.
b. If you see individual sales dates, right-click Jan and choose Expand/Collapse
> Collapse Entire Field.

10. Clean up your PivotTable. Rename labels and the title of the report to something
more useful, like “Sales by Month”.
11. Take a screenshot (label it 2-2MC).
12. When you are finished answering the lab questions, you may close Excel. Save
your file as Lab 2-2 Slainte Pivot.xlsx.

page 83

Tableau | Desktop

1. Open the Lab 2-2 Model.twb you created in Part 1.


2. Click on Sheet 1.
3. Drag Sales_Order.Sales Order Date to the Columns shelf. Note: When you add
a date, Tableau will automatically try to group the data by Year, Quarter, and so
on.

a. In the Columns pane, drill down on the date to show the quarters and months
[click the + next to YEAR(Sales Order Date) to show the Quarters, etc.].
b. Click QUARTER(Sales Order Date) and choose Remove.

4. Drag Finished_Good_Products. Product Description to the Rows shelf.


5. Drag Sales_Order_Lines.Sales Order Quantity Sold to the Text button in the
Marks shelf.
6. To show the totals, click the Analytics tab next to the Data pane and double-click
Totals.
7. Clean up your sheet. Right-click the Sheet 1 tab at the bottom of the screen and
choose Rename. Name the tab something more useful, like “Sales by Month”
and press Enter.
8. Take a screenshot (label it 2-2TB).
9. When you are finished answering the lab questions you may close Tableau
Desktop. Save your file as Lab 2-2 Slainte Pivot.twb.

Lab 2-2 Part 2 Objective Questions (LO 2-3)


OQ 1. What was the total sales volume for Imperial Stout in January 2020?
OQ 2. What was the total sales volume for all products in January 2020?
OQ 3. Which product is experiencing the worst sales performance in
January 2020?

Lab 2-2 Part 2 Analysis Questions (LO 2-3)


Now that you’ve completed a basic analysis to answer management’s question, take a
moment to think about how you could improve the report and anticipate questions your
manager might have.

AQ 1. If the owner of Sláinte wishes to identify which product sold the


most, how would you make this report more useful?
AQ 2. If you wanted to provide more detail, what other attributes would be
useful to add as additional rows or columns to your report, or what
other reports would you create?
AQ 3. Write a brief paragraph about how you would interpret the results of
your analysis in plain English. For example, which data points stand
out?
AQ 4. In Chapter 4, we’ll discuss some visualization techniques. Describe
a way you could present these data as a chart or graph.
Lab 2-2 Submit Your Screenshot Lab Document
Verify that you have answered any questions your instructor has assigned, then upload
your screenshot lab document to Connect or to the location indicated by your instructor.

page 84

Lab 2-3 Resolve Common Data Problems—


LendingClub

Lab Note: The tools presented in this lab periodically change. Updated instructions, if
applicable, can be found in the eBook and lab walkthrough videos in Connect.
Case Summary: LendingClub is a peer-to-peer marketplace where borrowers and
investors are matched together. The goal of LendingClub is to reduce the costs associated
with these banking transactions and make borrowing less expensive and investment more
engaging. LendingClub provides data on loans that have been approved and rejected
since 2007, including the assigned interest rate and type of loan. This provides several
opportunities for data analysis. There are several issues with this dataset that you have
been asked to resolve before you can process the data. This will require you to perform
some cleaning, reformatting, and other transformation techniques.
Data: Lab 2-3 Lending Club Approve Stats.zip - 120MB Zip / 120MB Excel

Lab 2-3 Example Output


By the end of this lab, you will clean data to prepare them for analysis. While your results
will include different data values, your work should look similar to this:

Microsoft | Excel + Power Query


Microsoft Excel

LAB 2-3M Example of Cleaned Data in Microsoft Excel

page 85

Tableau | Prep
Tableau Software, Inc. All rights reserved.

LAB 2-3T Example of Cleaned Data in Tableau Prep


Lab 2-3 Part 1 Identify Relevant Attributes
Before you begin the lab, you should create a new blank Word document where you will
record your screenshots and save it as Lab 2-2 [Your name] [Your email address].docx.
You’ve already identified some analysis questions for LendingClub in Chapter 1.
Here, you’ll focus on data quality. Think about some of the common issues with data you
receive from other people. For example, is the date field in the proper format? Do number
fields contain text or vice versa?
The LendingClub collects different sets of data, including LoanStats for approved
loans and RejectStats for rejected loans. There are significantly more data available for
LoanStats. There are 145 different attributes. To save some time, we’ve identified 20 of
the most interesting in Lab Exhibit 2-3A.

LAB EXHIBIT 3-3A


Source: LendingClub

Attribute Description
loan_amnt Requested loan amount
term Length of the loan in months
int_rate Interest rate of the loan
grade Quality of the loan: e.g. A, B, C

emp_length Employment length

page 86

home_ownership Whether the borrower rents or owns a home


annual_inc Annual income
issue_d Date of loan issue
loan_status Fully paid or charged off
title Loan purpose
zip_code The first three digits of the applicant’s zip code

addr_state State
dti Debt-to-income ratio
delinq_2y Late payments within the past 2 years
earliest_cr_line Oldest credit account
open_acc Number of open credit accounts
revol_bal Total balance of all credit accounts
revol_util Percentage of available credit in use
Attribute Description
total_acc Total number of credit accounts

application_type Individual or joint application

Lab 2-3 Part 1 Objective Questions (LO 2-3)


OQ 1. Which attributes would you expect to contain date values?
OQ 2. Which attributes would you expect to contain text values?
OQ 3. Which attributes would you expect to contain numerical values?
OQ 4. Which attribute most directly impacts a borrower’s cost of capital?

Lab 2-3 Part 1 Analysis Questions (LO 2-3)


AQ 1. What do you expect will be major data quality issues with
LendingClub’s data?
AQ 2. Given this list of attributes, what types of questions do you think
you could answer regarding approved loans? (If you worked through
Lab 1-2, what concerns do you have with the data’s ability to predict
answers to the questions you identified in Chapter 1)?
Lab 2-3 Part 2 Transform and Clean the Data
Let’s identify some issues with the LendingClub data that will make analysis difficult:
There are many attributes without any data, and that may not be necessary.
The int_rate values may be recorded as percentages (##.##%), but analysis will require
decimals (#.####).
The term values include the word months, which should be removed for numerical
analysis.
The emp_length values include “n/a”, “<”, “+”, “year”, and “years”—all of which
should be removed for numerical analysis.
Dates cause issues in general because different systems use different date formats (e.g.,
1/9/2009, Jan-2009, 9/1/2009 for European dates, etc.), so typically some conversion is
necessary.

Note: When you use Power Query or a Tableau Prep flow, you create a set of steps that
will be used to transform the data. When you receive new data, you can run those through
those same steps (or flows) without having to recreate them each time.

page 87

Microsoft | Excel + Power Query

1. Open a new blank workbook in Excel.


2. In the Data ribbon, click Get Data > From File > From Workbook.
3. Locate the Lab 2-3 Lending Club Approve Stats.xlsx file on your computer
and click Import (this is a large file, so it may take a few minutes to load).
4. Choose LoanStats3c and click Transform Data or Edit. Notice that all of the
column headers are incorrect.
First we have to fix the column headers and remove unwanted data.
5. In the Transform tab, click Use First Row as Headers to assign the correct
column titles.
6. Right-click the headers of any attribute that is not in the following list, and click
Remove. Hint: Once you get to initial_list_status, click that column header then
scroll to the right until you reach the end of the columns and Shift + Click the
last column (settlement_term). Then right-click and remove columns.

a. loan_amnt
b. term
c. int_rate
d. grade
e. emp_length
f. home_ownership
g. annual_inc
h. issue_d
i. loan_status
j. title
k. zip_code
l. addr_state
m. dti
n. delinq_2y
o. earliest_cr_line
p. open_acc
q. revol_bal
r. revol_util
s. total_acc
7. Take a screenshot (label it 2-3MA) of your reduced columns.
Next, remove text values from numerical values and replace values so we can do
calculations and summarize the data. These extraneous text values include
months, <1, n/a, +, and years:

page 88

8. Select the term column.


a. In the Transform tab, click Replace Values.
1. In the Value to Find box, type “months” with a space as the first character
(do not include the quotation marks).
2. Leave the Replace With box blank.
3. Click OK.

9. Select the emp_length column.


a. In the Transform tab, click Replace Values.
1. In the Value to Find box, type “years” with a space as the first character.
2. Leave the Replace With box blank.
3. Click OK.

b. In the Transform tab, click Replace Values.


1. In the Value to Find box, type “year” with a space as the first character.
2. Leave the Replace With box blank.
3. Click OK.

c. In the Transform tab, click Replace Values.


1. In the Value to Find box, type “<1” with a space between the two characters.
2. In the Replace With box, type “0”.
3. Click OK.

d. In the Transform tab, click Replace Values.


1. In the Value to Find box, type “n/a”.
2. In the Replace With box, type “0”.
3. Click OK.
e. In the Transform tab, click Extract > Text Before Delimiter.
1. In the Value to Find box, type “+”.
2. Click OK.

f. In the Transform tab, click Extract > Text Before Delimiter.


1. In the Value to Find box, type “ ” (a single space).
2. Click OK.

g. In the Transform tab, click Data Type > Whole Number.


10. Take a screenshot (label it 2-3MB) of your cleaned data file, showing the
term and emp_length columns.
11. Click the Home tab in the ribbon and then click Close & Load. It will take a
minute to clean the entire data file.
12. When you are finished answering the lab questions, you may close Excel. Save
your file as Lab 2-3 Lending Club Transform.xlsx.

page 89

Tableau | Prep
Lab Note: Tableau Prep takes extra time to process large datasets.
1. Open Tableau Prep Builder.
2. Click Connect to Data > To a File > Microsoft Excel.
3. Locate the Lab 2-3 Lending Club Approve Stats.xlsx file on your computer
and click Open.
4. Drag LoanStats3c to your flow. Notice that all of the Field Names are incorrect.
First we have to fix the column headers and remove unwanted data.
5. Check Use Data Interpreter in the pane on the left to automatically fix the Field
Names.
6. Uncheck the box next to any attribute that is NOT in the following list to remove
it from our analysis. Hint: Once you get to initial_list_status, all of the remaining
fields can be removed.

a. loan_amnt
b. term
c. int_rate
d. grade
e. emp_length
f. home_ownership
g. annual_inc
h. issue_d
i. loan_status
j. title
k. zip_code
l. addr_state
m. dti
n. delinq_2y
o. earliest_cr_line
p. open_acc
q. revol_bal
r. revol_util
s. total_acc

7. Take a screenshot (label it 2-3TA) of your corrected and reduced list of Field
Names.
Next, remove text values from numerical values and replace values so we can do
calculations and summarize the data. These extraneous text values include
months, <1, n/a, +, and years:

page 90

8. Click the + next to LoanStats3c in the flow and choose Add Clean Step. It may
take a minute or two to load.
9. An Input step will appear in the top half of the workspace, and the details of that
step are in the bottom of the workspace in the Input Pane. Every flow requires at
least one Input step at the beginning of the flow.
10. In the Input Pane, you can further limit which fields you bring into Tableau Prep,
as well as seeing details about each field including:

a. Type: this indicates the data type of each field (for example, numeric, date, or
short text).
b. Linked Keys: this indicates whether or not the field is a primary or a foreign
key.
c. Sample Values: provides a few example values from that field so you can see
how the data are formatted.
11. In the term pane:
a. Right-click the header or click the three dots and choose Clean > Remove
Letters.
b. Click the Data Type (Abc) button in the top-left corner and change the data
type to Number (Whole).

12. In the emp_length pane:


a. Right-click the header or click the three dots and choose Group Values >
Manual Selection.

1. Double-click <1 year in the list and type “0” to replace those values with 0.
2. Double-click n/a in the list and type “0” to replace those values with 0.
3. While you are in the Group Values window, you could quickly replace all of
the year values with single numbers (e.g., 10+ years becomes “10”) or you
can move to the next step to remove extra characters.
4. Click Done.

b. If you didn’t remove the “years” text in the previous step, right-click the
emp_length header or click the three dots and choose Clean > Remove
Letters and then Clean > Remove All Spaces.
c. Finally, click the Data Type (Abc) button in the top-left corner and change the
data type to Number (Whole).

13. In the flow pane, right-click Clean 1 and choose Rename and name the step
“Remove text”.
14. Take a screenshot (label it 2-3TB) of your cleaned data file, showing the
term and emp_length columns.
15. Click the + next to your Remove text task and choose Output.
16. In the Output pane, click Browse:
a. Navigate to your preferred location to save the file.
b. Name your file Lab 2-3 Lending Club Transform.hyper.
c. Click Accept.

page 91
17. Click Run Flow. When it is finished processing, click Done.
18. When you are finished answering the lab questions you may close Tableau Prep.
Save your file as Lab 2-3 Lending Club Transform.tfl.

Lab 2-3 Part 2 Objective Questions (LO 2-3)


OQ 1. How many records or rows appear in your cleaned dataset?
OQ 2. How many attributes or columns appear in your cleaned dataset?

Lab 2-3 Part 2 Analysis Questions (LO 2-3)


AQ 1. Why do you think it is important to remove text values from your
data before you conduct your analysis?
AQ 2. What do you think would happen in your analysis if you didn’t
remove the text values?
AQ 3. Did you run into any major issues when you attempted to clean the
data? How did you resolve those?
AQ 4. What are some steps you could take to clean the data and resolve the
difficulties you identified?

Lab 2-3 Submit Your Screenshot Lab Document


Verify that you have answered any questions your instructor has assigned, then upload
your screenshot lab document to Connect or to the location indicated by your instructor.

Lab 2-4 Generate Summary Statistics—


LendingClub

Lab Note: The tools presented in this lab periodically change. Updated instructions, if
applicable, can be found in the eBook and lab walkthrough videos in Connect.
Case Summary: When you’re working with a new or unknown set of data, validating
the data is very important. When you make a data request, the IT manager who fills the
request should also provide some summary statistics that include the total number of
records and mathematical sums to ensure nothing has been lost in the transmission. This
lab will help you calculate summary statistics in Power BI and Tableau Desktop.
Data: Lab 2-4 Lending Club Transform.zip - 29MB Zip / 26MB Excel / 6MB Tableau

Lab 2-4 Example Output


By the end of this lab, you will explore summary statistics. While your results will include
different data values, your work should look similar to this:

page 92

Microsoft | Power BI Desktop

Microsoft Excel

LAB 2-4M Example of Data Distributions in Microsoft Power Query

Tableau | Desktop
Tableau Software, Inc. All rights reserved.

LAB 2-4T Example of Data Distributions in Tableau Desktop

Lab 2-4 Calculate Summary Statistics


Before you begin the lab, you should create a new blank Word document where you will
record your screenshots and save it as Lab 2-4 [Your name] [Your email address].docx.

page 93
In this part we are interested in understanding more about the loan amounts, interest
rates, and annual income by looking at their summary statistics. This process can be used
for data validation and later for outlier detection.

Microsoft | Power BI Desktop


Lab Note: These instructions can also be performed in the Power Query included with Excel 365.

1. Open a new workbook in Power BI Desktop.


2. Click the Home tab in the ribbon and choose Get Data > Excel.
3. Navigate to your Lab 2-4 Lending Club Transform.xlsx file and click Open.
4. Check LoanStats3c and click Transform Data or Edit.
5. Click View in the ribbon, then check Column Distribution. A small frequency
distribution graph will appear at the top of each column.
6. Click the loan_amt column.
7. In the View tab, check Column Profile. You now see summary stats and a
frequency distribution for the selected column.
8. Note: To show profile for the entire data set instead of the top 1,000 values, go to
the bottom of the Power Query Editor window and click the title Column
profiling based on top 1000 rows and change it to Column profiling based on
entire data set.
9. Take a screenshot (label it 2-4MA) of the column statistics and value
distribution.
10. Click the int_rate column and the annual_inc column, noting the count, min,
max, average, and standard deviation of each.
11. Click the drop-down next to the addr_state column and uncheck (Select All).
12. Check PA and click OK to filter the loans.
13. In Power BI, click Home > Close & Apply. Note: You can always return to
Power Query by clicking the Transform button in the Home tab.
14. To show summary statistics in Power BI, go to the visualizations pane and click
Multi-row card.

a. Drag loan_amnt to the Fields box. Click the drop-down menu next to
loan_amnt and choose Sum.
b. Drag loan_amnt to the Fields box below the existing field. Click the drop-
down menu next to the new loan_amnt and choose Average.
c. Drag loan_amnt to the Fields box below the existing field. Click the drop-
down menu next to the new loan_amnt and choose Count.
d. Drag loan_amnt to the Fields box below the existing field. Click the drop-
down menu next to the new loan_amnt and choose Max.

15. Add two new Multi-row cards showing the same values (Sum, Average, Count,
Max) for int_rate and annual_inc.

page 94

16. Take a screenshot (label it 2-4MB) of the column statistics and value
distribution.
17. When you are finished answering the lab questions, you may close Power BI
Desktop. Save your file as Lab 2-4 Lending Club Summary.pbix.
Tableau | Desktop

1. Open Tableau Desktop.


2. Choose Connect > To a File > More…
3. Locate the Lab 2-4 Lending Club Transform.hyper and click Open.
4. Click the Sheet 1 tab.
5. Drag Addr State to the Filters shelf.
6. Click None, then check the box next to PA and click OK.
7. Click the drop-down on the Addr State filter and choose Apply to Worksheets
> All Using This Data Source.
8. Drag Loan Amnt to the Rows shelf.
9. To show each unique loan, you have to disable aggregate measures. From the
menu bar, click Analysis > Aggregate Measures to remove the check mark.
10. To show the summary statistics, go to the menu bar and click Worksheet > Show
Summary. A Summary card appears on the right side of the screen with the
Count, Sum, Average, Minimum, Maximum, and Median values.
11. Take a screenshot (label it 2-4TA).
12. Create two new sheets and repeat steps 5–7 for Int Rate and Annual Inc, noting
the count, sum, average, minimum, maximum, and median of each.
13. When you are finished answering the lab questions, you may close Tableau
Desktop. Save your file as Lab 2-4 Lending Club Summary.twb.

Lab 2-4 Objective Questions (LO 2-3)


OQ 1. What is the maximum loan amount that was approved for borrowers
from PA?
OQ 2. What is the average interest rate assigned to a loan to an approved
borrower from PA?
OQ 3. What is the average annual income of an approved borrower from
PA?
Lab 2-4 Analysis Questions (LO 2-3)
AQ 1. Compare the loan amounts to the validation given by LendingClub
for borrowers from PA: Funded loans: $123,262.53 Number of
approved loans: 8,427 Do the numbers in your analysis match the
numbers provided by LendingClub? What explains the discrepancy,
if any?
AQ 2. Does the Numerical Count provide a more useful/accurate value for
validating your data? Why or why not do you think that is the case?
AQ 3. Compare and contrast: Why do Power Query and Tableau
Desktop return different values for their summary statistics?
page 95
AQ 4. Compare and contrast: What are some of the summary
statistics measures that are unique to Power Query? To Tableau
Desktop?

Lab 2-4 Submit Your Screenshot Lab Document


Verify that you have answered any questions your instructor has assigned, then upload
your screenshot lab document to Connect or to the location indicated by your instructor.

Lab 2-5 Validate and Transform Data—College


Scorecard

Lab Note: The tools presented in this lab periodically change. Updated instructions, if
applicable, can be found in the eBook and lab walkthrough videos in Connect.
Case Summary: Your college admissions department is interested in determining the
likelihood that a new student will complete their 4-year program. They have tasked you
with analyzing data from the U.S. Department of Education to identify some variables that
may be predictive of the completion rate. The data used in this lab are a subset of the
College Scorecard dataset that is provided by the U.S. Department of Education. These
data provide federal financial aid and earnings information, insights into the performance
of schools eligible to receive federal financial aid, and the outcomes of students at those
schools.
Data: Lab 2-5 College Scorecard Dataset.zip - 0.5MB Zip / 1.4MB Txt

Lab 2-5 Example Output


By the end of this lab, you will have validated and transformed the College Scorecard
data. While your results will include different data values, your work should look similar
to this:

Microsoft | Excel + Power Query

Microsoft Excel

LAB 2-5M Example of Cleaned College Scorecard Data in Microsoft Excel

page 96

Tableau | Prep + Desktop


Tableau Software, Inc. All rights reserved.

LAB 2-5T Example of Cleaned College Scorecard Data in Tableau Prep

Lab 2-5 Load and Clean Data


Before you begin the lab, you should create a new blank Word document where you will
record your screenshots and save it as Lab 2-5 [Your name] [Your email address].docx.
Working with raw data can present interesting challenges, especially when it comes to
identifying attributes and data types. In this lab you will learn how to transform the raw
data into data models that are ready for analysis.

Microsoft | Excel + Power Query

1. Open a new blank spreadsheet in Excel.


2. From the Data tab in the ribbon, click Get Data > From File > From Text/CSV.
3. Navigate to your Lab 2-5 College Scorecard Dataset.txt file and click Open.
4. Verify that the data loaded correctly into tables and rows and then click
Transform Data or Edit.
5. Click through each of the 30 columns and from the Transform tab in the ribbon,
click Data Type > Whole Number or Data Type > Decimal Number where
appropriate. If prompted, click Replace Current. Because the original text file
replaced empty values with “NULL”, Power Query erroneously detected many of
the columns as Text. Hint: Hold the Ctrl key and click to select multiple columns.

page 97

6. Take a screenshot (label it 2-5MA) of your columns with the proper data
types.
7. From the Home tab, click Close & Load.
8. To ensure that you captured all of the data through the extraction from the txt file,
we need to validate them:

a. In the Queries & Connections pane, verify that there are 7,703 rows loaded.
b. Compare the attribute names (column headers) to the attributes listed in the data
dictionary (found in Appendix K of the textbook). There should be 30 columns
(the last column in Excel should be AD).
c. Click Column H for the SAT_AVG attribute. In the summary statistics at the
bottom of your worksheet, the overall average SAT score should be 1,059.07.

9. Take a screenshot (label it 2-5MB) of your data table in Excel.


10. When you are finished answering the lab questions, you may close Excel. Save
your file as Lab 2-5 College Scorecard Transform.xlsx. Your data are now
ready for the test plan. This lab will continue in Lab 3-3.

Tableau | Prep + Desktop

1. Open a new flow in Tableau Prep Builder.


2. Click Connect to Data > To a File > Text file.
3. Navigate to your Lab 2-5 College Scorecard Dataset.txt file and click Open.
4. Verify that the data types for each of the 30 attributes is detected as a Number
with the exception of INSTNM, CITY, and STABBR.
5. Take a screenshot (label it 2-5TA).
6. In the flow, click the + next to your Lab 2-5 College Scorecard Dataset and
choose Add > Clean Step.
7. Review the data and click the lightbulb icon in the CITY and STABBR
attributes to change the data roles to City and State/Province, respectively.
8. Click the + next to your Clean 1 task and choose Output.
9. In the Output pane, click Browse:
a. Navigate to your preferred location to save the file.
b. Name your file Lab 2-5 College Scorecard Transform.hyper.
c. Click Accept.

10. Click Run Flow. When it is finished processing, click Done.


11. Close Tableau Prep Builder. Save your file as Lab 2-5 College Scorecard
Transform.tfl.
12. Open Tableau Desktop.
13. Choose Connect > To a File > More…
14. Locate the Lab 2-5 College Scorecard Transform.hyper and click Open.
15. Click the Sheet 1 tab.

page 98

16. From the menu bar, click Analysis > Aggregate Measures to remove the check
mark. To show each unique entry, you have to disable aggregate measures.
17. To show the summary statistics, go to the menu bar and click Worksheet > Show
Summary. A Summary card appears on the right side of the screen with the
Count, Sum, Average, Minimum, Maximum, and Median values.
18. Drag Unitid to the Rows shelf and note the summary statistics.
19. Take a screenshot (label it 2-5TB) of the Unitid stats in your worksheet.
20. Create two new sheets and repeat steps 16–18 for Sat Avg and C150 4, noting
the count, sum, average, minimum, maximum, and median of each.
21. When you are finished answering the lab questions, you may close Tableau
Desktop. Save your file as Lab 2-5 College Scorecard Transform.twb. Your
data are now ready for the test plan. This lab will continue in Lab 3-3.

Lab 2-5 Objective Questions (LO 2-3)


OQ 1. How many schools report average SAT scores?
OQ 2. What is the average completion rate (C150 4) of all the schools?
OQ 3. How many schools report data to the U.S. Department of
Education?

Lab 2-5 Analysis Questions (LO 2-3)


AQ 1. In the checksums, you validated that the average SAT score for all
of the records is 1,059.07. When we work with the data more
rigorously, several tests will require us to transform NULL or blank
values. If you were to transform the NULL SAT values into 0, what
would happen to the average (would it stay the same, decrease, or
increase)?
AQ 2. How would that change to the average affect the way you would
interpret the data?
AQ 3. What would happen if we excluded all schools that don’t report an
average SAT score?

Lab 2-5 Submit Your Screenshot Lab Document


Verify that you have answered any questions your instructor has assigned, then upload
your screenshot lab document to Connect or to the location indicated by your instructor.

Lab 2-6 Comprehensive Case: Build Relationships


among Database Tables—Dillard’s

Lab Note: The tools presented in this lab periodically change. Updated instructions, if
applicable, can be found in the eBook and lab walkthrough videos in Connect.
Case Summary: You are a brand-new analyst and you just got assigned to work on the
Dillard’s account. You were provided an ER Diagram (available in Appendix J), but you
still aren’t sure what all of the different tables and fields represent. Before diving into
problem solving or even transforming the data to prepare them for analysis, it is important
to gain an understanding of what data are available to you. One of the steps in doing so is
connecting to the database and analyzing the way the tables relate.
Data: Dillard’s sales data are available only on the University of Arkansas Remote
Desktop (waltonlab.uark.edu). See your instructor for login credentials.
page 99

Lab 2-6 Example Output


By the end of this lab, you will explore how to define relationships between tables from
Dillard’s sales data. While your results will include different data values, your work
should look similar to this:

Microsoft | Power BI Desktop

Microsoft Excel

LAB 2-6M Example of Dillard’s Data Model in Microsoft Power BI

Tableau | Desktop
Tableau Software, Inc. All rights reserved.

LAB 2-6T Example of Dillard’s Data Model in Tableau Desktop

page 100

Lab 2-6 Build Relationships between Tables


Before you begin the lab, you should create a new blank Word document where you will
record your screenshots and save it as Lab 2-6 [Your name] [Your email address].docx.
Before you can analyze the data, you must first define the relationships that show how
the different tables are connected. Most tools will automatically detect primary key–
foreign key relationships, but you should always double-check to make sure your data
model is accurate.

Microsoft | Power BI Desktop

1. Open Power BI Desktop.


2. In the Home ribbon, click Get Data > SQL Server.
3. Enter the following and click OK (keep in mind that SQL Server is not just one
database, it is a collection of databases, so it is critical to indicate the server path
and the specific database):

a. Server: essql1.walton.uark.edu
b. Database: WCOB_Dillards
c. Data Connectivity: DirectQuery
4. If prompted to enter credentials, you can keep the default to “Use my current
credentials” and click Connect.
5. If prompted with an Encryption Support warning, click OK to move past it.
6. Take a screenshot (label it 2-6MA) of the navigator window.
Learn about Power BI!
There are two ways to connect to data, either Import or DirectQuery. There are
pros and cons for each, and it will always depend on a few factors, including the
size of the dataset and the type of analysis you intend to do.
Import: Will pull in all data at once. This can take a long time, but once they are
imported, your analysis can be more efficient if you know that you plan to use
each piece of data that you import. This is also beneficial for some of the
analyses you will learn about in future chapters, such as clustering.
DirectQuery: Only creates a connection to the data. This is more efficient if you
are exploring all of the tables in a large database and are comfortable working
with only a sample of data. Note: Unless directed otherwise, you should always
use DirectQuery with Dillard’s data to prevent the remote desktop from running
out of storage space.
7. Place a check mark next to each of the following tables and click Load:
a. Customer, Department, SKU, SKU_Store, Store, Transact
8. Click the Model button (the icon with three connected boxes) in the toolbar on
the left to view the tables and relationships and note the following:

a. All the tables that you selected should appear in the Modeling tab with table
names, attributes, and relationships.

page 101

b. When you hover over any of the relationships, the keys that are common
between the two tables highlight.

1. Something important to consider is that in the raw data, the primary key is
typically the first attribute listed. In this Power BI modeling window, the
attributes have been re-ordered to appear in alphabetical order. For example,
SKU is the primary key of the SKU table, and it exists in the Transact table
as a foreign key.
9. Take a screenshot (label it 2-6MB) of the All tables sheet.
10. When you are finished answering the lab questions, you may close Power BI
Desktop. Save your file as Lab 2-6 Dillard’s Diagram.pbix.
Note: While it may seem easier and faster to rely on the automatically created
data model in Power BI, you should review the table relationships to make sure
the appropriate keys match.

Tableau | Desktop

1. Open Tableau Desktop.


2. Go to Connect > To a Server > Microsoft SQL Server.
3. Enter the following and click Sign In:
a. Server: essql1.walton.uark.edu
b. Database: WCOB_DILLARDS

4. Take a screenshot (label it 2-6TA) of the blank Data Source tab.


In Tableau, you connect to individual tables and build relationships one at a time.
We will build each relationship, but we need to start with one table.
5. Double-click the TRANSACT table from the list on the left to add it to the top
pane.

a. The field names will appear in the data grid section on the bottom of the screen,
but the data themselves will not automatically load. If you click Update Now,
you can get a preview of the data held in the Transact table. You can do some
light data transformation at this point, but if your data transformation needs are
heavy, it would be better to perform that transformation in Tableau Prep before
bringing the data into Tableau Desktop.

6. Double-click the CUSTOMER table to add it to the data model in the top pane.
a. In the Edit Relationship window that pops up, confirm that the appropriate keys
are identified (Cust ID and Cust ID) and close the window.
7. Double-click each of the remaining tables that relate directly to the Transact table
from the list on the left:

a. SKU, STORE, DEPARTMENT


b. Note that DEPARTMENT will join with the SKU table, not TRANSACT.

page 102

8. Finally, double-click the SKU_STORE table from the list on the left.
a. The SKU_Store table is related to both the SKU and the Store tables, but
Tableau will likely default to connecting it to the Transact table, resulting in a
broken relationship.
b. To fix the relationship,
1. Close the Edit Relationships window without making changes.
2. Right-click SKU_STORE in the top pane and choose Move to > SKU.
3. Verify the related keys and close the Edit Relationships window.
4. Note: It is not necessary to also relate the SKU_Store table to the Store table
in Tableau; that is only a database requirement.

9. Take a screenshot (label it 2-6TB) of the Data Source tab.


10. When you are finished answering the lab questions, you may close Tableau
Desktop. Save your file as Lab 2-6 Dillard’s Diagram.twb.

Lab 2-6 Objective Questions (LO 2-2)


OQ 1. How many tables relate directly to the TRANSACT table?
OQ 2. Which table does the DEPARTMENT table relate to?
OQ 3. What is the name of the key that relates the TRANSACT and
CUSTOMER tables?

Lab 2-6 Analysis Questions (LO 2-2)


AQ 1. How would a view of the entire database or certain tables out of that
database allow us to get a feel for the data?
AQ 2. What types of data would you guess that Dillard’s, a retail store,
gathers that might be useful beyond the scope of the sales data
available on the remote desktop? How could Dillard’s suppliers use
these data to predict future purchases?
AQ 3. Compare and Contrast: Compare the methods for connecting to
data in Tableau versus Power BI. Which is more intuitive?
AQ 4. Compare and Contrast: Compare the methods for viewing (and
creating) relationships in Tableau versus Power BI. Which is easier
to work with? Which provides more insight and flexibility?

Lab 2-6 Submit Your Screenshot Lab Document


Verify that you have answered any questions your instructor has assigned, then upload
your screenshot lab document to Connect or to the location indicated by your instructor.

page 103

Lab 2-7 Comprehensive Case: Preview Data from


Tables—Dillard’s

Lab Note: The tools presented in this lab periodically change. Updated instructions, if
applicable, can be found in the eBook and lab walkthrough videos in Connect.
Case Summary: You are a brand-new analyst and you just got assigned to work on the
Dillard’s account. After analyzing the ER Diagram to gain a bird’s-eye view of all the
different tables and fields in the database, you are ready to further explore the data in each
table and how the fields are formatted. In particular, you will connect to the Dillard’s
database using Tableau Prep or Microsoft BI and you will explore the data types, the
primary and foreign keys, and preview individual tables.
In Lab 2-6, the Tableau Track had you focus on Tableau Desktop. In this lab, you will
connect to Tableau Prep instead. Tableau Desktop showcases the table relationships
quicker, but Tableau Prep makes it easier to preview and clean the data prior to analysis.
Data: Dillard’s sales data are available only on the University of Arkansas Remote
Desktop (waltonlab.uark.edu). See your instructor for login credentials.

Lab 2-7 Example Output


By the end of this lab, you will explore Dillard’s data and generate summary statistics.
While your results will include different data values, your work should look similar to
this:

Microsoft | Power BI Desktop

Microsoft Excel

LAB 2-7M Example of Summary Statistics in Microsoft Power BI

page 104

Tableau | Prep
Tableau Software, Inc. All rights reserved.

LAB 2-7T Example of Summary Statistics in Tableau Prep


Lab 2-7 Part 1 Preview Dillard’s Attributes and Data
Types
Before you begin the lab, you should create a new blank Word document where you will
record your screenshots and save it as Lab 2-7 [Your name] [Your email address].docx.
In this part of the lab you will load the data and explore the available attributes and
verify that the data types are correctly assigned.

Microsoft | Power BI Desktop

1. Create a new project in Power BI Desktop.


2. In the Home ribbon, click Get Data > SQL Server database.
3. Enter the following and click OK(keep in mind that SQL Server is not just one
database, it is a collection of databases, so it is critical to indicate the server path
and the specific database):

a. Server: essql1.walton.uark.edu
b. Database: WCOB_DILLARDS
c. Data Connectivity: DirectQuery
4. If prompted to enter credentials, keep the default of “Use my current credentials”
and click Connect.
5. If prompted with an Encryption Support warning, click OK to move past it.

page 105

6. In the Navigator window, click the TRANSACT table.


a. It may take a few moments for the data to load, but once they do, you will see a
preview of the data to the right. Scroll through the data to get a feel for the data
stored in the Transact table.
b. Unlike Tableau Prep, we cannot limit which attributes we pull in at this point in
the process. However, the preview pane to the right shows the first 27 records
in the table that you have selected. This gives you an idea of the different data
types and examples of the data stored in the table.
c. Scroll to the right in the preview pane to see several fields that do not have
preview data; instead each record is marked with the term Value. These fields
are indicating the tables that are related to the Transact table. In this instance,
they are CUSTOMER, SKU(SKU), and STORE(STORE).

7. Click the CUSTOMER table to preview that table’s data.


8. Take a screenshot (label it 2-7MA).
9. Answer the lab questions and continue to Part 2.

Tableau | Prep

1. Create a new flow in Tableau Prep.


2. Click Connect to Data.
3. Choose Microsoft SQL Server in the Connect list.
4. Enter the following and click Sign In:
a. Server: essql1.walton.uark.edu
b. Database: WCOB_DILLARDS

5. Double-click TRANSACT to add the table to your flow.


a. An Input step will appear in the top half of the workspace, and the details of
that step are in the bottom of the workspace in the Input Pane. Every flow
requires at least one Input step at the beginning of the flow.
b. In the Input Pane, you can further limit which fields you bring into Tableau
Prep, as well as seeing details about each field including:

1. Type: this indicates the data type of each field (for example, numeric, date,
or short text).
2. Linked Keys: this indicates whether or not the field is a primary or a foreign
key. In the Transact table, we can see that the Transaction_ID is the primary
key, and that there are three foreign keys in this table: Store, Cust_ID, and
SKU.
3. Sample Values: provides a few example values from that field so you can see
how the data are formatted.

6. Double-click the CUSTOMER table to add a new Input step to your flow.
7. Take a screenshot (label it 2-7TA).
8. Answer the lab questions and continue to Part 2.

page 106

Lab 2-7 Part 1 Objective Questions (LO 2-2, 2-3)


OQ 1. What is the primary key for the CUSTOMER table?
OQ 2. What is the primary key for the SKU table?
OQ 3. Which tables are related to the Customer table? (Hint: Do not forget
the foreign keys that you discovered in the Transact table.)
Lab 2-7 Part 2 Explore Dillard’s Data More In-Depth
In this part of the lab, you will explore the summary statistics to understand the properties
of the different attributes, such as the count and average.

Microsoft | Power BI Desktop

1. Place a check mark in the TRANSACT and CUSTOMER tables in the Navigator
window.
2. Click Transform Data.
a. This will open a new window for the Power Query Editor (this is the same
interface that you will encounter in Excel’s Get & Transform).
b. On the left side of the Power Query Editor, you can click through the different
queries to see previews of each table’s data. Similar to Tableau Prep, you are
provided only a sample of the dataset.
c. Click the Transact query to preview the data from the Transact table.
d. Scroll the main view to the right to see more of the attributes.

3. Power Query does not default to providing data profiling information the way
Tableau Prep’s Clean step does, but we can activate those options.
4. Click the View tab and place check marks in the Column Distribution and
Column Profile boxes.

a. Column distribution: Provides thumbnails of each column’s distribution above


the first row of data. However, it is limited to only the thumbnail—you cannot
hover over bars in the distribution charts to gain additional details or filter the
data.
b. Column profile: When you select a column, it will provide a more detailed
glimpse into the distribution of that particular column. You can click into a bar
in the distribution to filter the records based on that criteria. This will also cause
the Column distribution thumbnails to adjust.
c. Again—caution! The distributions and profiling are based on the top 1,000
rows from the table you have connected to.
5. Some of the attributes are straightforward in what they represent, but others
aren’t as clear. For instance, you may be curious about what TRAN_TYPE
represents.
6. Filter the purchases and returns:
a. Click the drop-down button next to Tran_Type and filter for just the records
with P values or click the bar associated with the P values in the page 107
Column profile. Scroll over and look at the results in the Tran_Amt
field and note whether they are positive or negative.
b. Now adjust the filter so that you see only R Tran_Types. Note the values in the
Tran_Amt field again.

7. Take a screenshot (label it 2-7MB).


8. When you are finished answering the lab questions, you may close the Power
Query Editor and Power BI Desktop. Save your file as Lab 2-7 Dillard’s
Data.pbix.

Tableau | Prep

1. Add a new Clean step extending from the TRANSACT table (click the + icon
next to TRANSACT and choose Clean Step from the menu). A phantom step for
View and Clean may already exist. If so, just click that step to add it:

a. The Clean step provides many different options for preparing your data, which
we will get to in future labs. In this lab, you will use it as a means for
familiarizing yourself with the dataset.
b. Beneath the Flow Pane, you can see two new panes: the Profile Pane and the
Data Grid.

1. The Data Grid provides a more robust sample of data values than you were
able to see in the Input Pane from the Input step.
2. The Profile Pane provides summary visualizations of each attribute in the
table. Note: When datasets are large, these summary values are calculated
only from the first several thousand records in the original table, so be
cautious about using these visualizations to drive insights! In this instance,
we can see a good example of this being merely a sample by looking at the
TRAN_DATE visual summary. It shows only dates from 12/30/2013 to
01/27/2014, but we know the dataset has transactions through 2016.

c. Some of the attributes are straightforward in what they represent, but others
aren’t as clear. For instance, you may be curious about what TRAN_TYPE
represents. Look at the data visualization provided for TRAN_TYPE in the
Profile Pane and click P. This will filter the results in the Data Grid.

1. Look at the results in the TRAN_AMT field and note whether they are
positive or negative (you can do so by looking at the data grid or by looking
at the filtered visualization for TRAN_AMT).
2. Adjust the filter so that you see only R transaction types. Note the values in
the Tran_Amt field again.

2. Take a screenshot (label it 2-7TB).


3. When you are finished answering the lab questions, you may close Tableau Prep.
Save your file as Lab 2-7 Dillard’s Data.tfl.

page 108

Lab 2-7 Part 2 Objective Questions (LO 2-2, 2-3)


OQ 1. What do you notice about the TRAN_AMT for transactions with
TRAN_TYPE “P”?
OQ 2. What do you notice about the TRAN_AMT for transactions with
TRAN_TYPE “R”?
OQ 3. What do “P” type transactions and “R” type transactions represent?

Lab 2-7 Part 2 Analysis Questions (LO 2-2, 2-3)


AQ 1. Compare and Contrast: Compare the methods for previewing data
in the tables in Tableau Prep versus Microsoft Power BI. Which
method is easier to interact with?
AQ 2. Compare and Contrast: Compare the methods for identifying data
types in each table in Tableau Prep versus Microsoft Power BI.
Which method is easier to interact with?
AQ 3. Compare and Contrast: Compare viewing the data distribution and
filtering data in Tableau Prep’s Clean step to Microsoft Power BI’s
Data Profiling options. Which method is easier to interact with?

Lab 2-7 Submit Your Screenshot Lab Document


Verify that you have answered any questions your instructor has assigned, then upload
your screenshot lab document to Connect or to the location indicated by your instructor.

Lab 2-8 Comprehensive Case: Preview a Subset of


Data in Excel, Tableau Using a SQL Query
—Dillard’s

Lab Note: The tools presented in this lab periodically change. Updated instructions, if
applicable, can be found in the eBook and lab walkthrough videos in Connect.
Case Summary: You are a brand-new analyst and you just got assigned to work on the
Dillard’s account. So far you have analyzed the ER Diagram to gain a bird’s-eye view of
all the different tables and fields in the database, and you have explored the data in each
table to gain a glimpse of sample values from each field and how they are all formatted.
You also gained a little insight into the distribution of sample values across each field, but
at this point you are ready to dig into the data a bit more. In the previous comprehensive
labs, you connected to full tables in Tableau or Power BI to explore the data. In this lab,
instead of connecting to full tables, we will write a SQL query to pull only a subset of data
into Tableau or Excel. This tactic is more effective when the database is very large and
you can derive insights from a sample of the data. We will analyze 5 days’ worth of
transaction data from September 2016. In this lab we will look at the distribution of
transactions across different states in order to get to know our data a little better.
Data: Dillard’s sales data are available only on the University of Arkansas Remote
Desktop (waltonlab.uark.edu). See your instructor for login credentials.

Lab 2-8 Example Output


By the end of this lab, you will create a dashboard that will let you explore financial
sentiment along many dimensions for SP100 companies. While your results will include
different data values, your work should look similar to this:
page 109

Microsoft | Excel + Power Query

Microsoft Excel

LAB 2-8M Example of a PivotTable and PivotChart Microsoft Power BI

Tableau | Desktop

Tableau Software, Inc. All rights reserved.


LAB 2-8T Example of Summary Chart in Tableau Desktop
Lab 2-8 Part 1 Connect to the Data with a SQL Query
Before you begin the lab, you should create a new blank Word document where you will
record your screenshots and save it as Lab 2-8 [Your name] [Your email address].docx.

page 110

Microsoft | Excel + Power Query

1. From Microsoft Excel, click the Data tab on the ribbon.


2. Click Get Data > From Database > From SQL Server Database.
a. Server: essql1.walton.uark.edu
b. Database: WCOB_Dillards
c. Expand Advanced Options and input the following query:
SELECT TRANSACT.*, STATE
FROM TRANSACT
INNER JOIN STORE
ON TRANSACT.STORE = STORE.STORE
WHERE TRAN_DATE BETWEEN ‘20160901’ AND ‘20160905’
d. Click Connect. Click OK if prompted about encryption.
e. Click Edit.

3. Take a screenshot (label it 2-8MA).


4. Click Close & Load > Close & Load To.
5. Choose Only Create Connection and check the box next to Add this data to
the Data Model, then click OK.

Tableau | Desktop

1. Open Tableau Desktop and click Connect to Data > To a Server > Microsoft
SQL Server.
2. Enter the following:
a. Server: essql1.walton.uark.edu
b. Database: WCOB_Dillards
c. All other fields can be left as is, click Sign In.
d. Instead of connecting to a table, you will create a New Custom SQL query.
Double-click New Custom SQL and input the following query:
SELECT TRANSACT.*, STATE
FROM TRANSACT
INNER JOIN STORE
ON TRANSACT.STORE = STORE.STORE
WHERE TRAN_DATE BETWEEN ‘20160901’ AND ‘20160905’
e. Click Preview Results… to test your query on a sample data set.
f. If everything looks good, close the preview and click OK.

3. Take a screenshot (label it 2-8TA).


4. Click Sheet 1.

page 111
Lab 2-8 Part 2 View the Distribution of Transaction
Amounts across States
In addition to data from the Transact table, our query also pulled in the attribute State from
the Store table. We can use this attribute to identify the sum of transaction amounts across
states.

Microsoft | Excel + Power Query

1. We will perform this analysis using a PivotTable. Return to the worksheet in your
Excel workbook titled Sheet1.
2. From the Insert tab in the ribbon, click PivotTable.
3. Check Use this workbook’s Data Model and click OK.
4. Expand the Query1 and place check marks next to TRAN_AMT and STATE.
5. The TRAN_AMT default aggregation will likely be SUM. Change it by right-
clicking one of the TRAN_AMT values in the PivotTable, selecting Summarize
Values By > Average.
6. To make this output easier to interpret, you can sort the data so that you see the
states that have the highest average transaction amount first. To do so, have your
active cell anywhere in the Average of TRAN_AMT column, right-click the cell,
select Sort, then select Sort Largest to Smallest.
7. To view a visualization of these results, click the PivotTable Analyze tab in the
ribbon and click PivotChart.
8. The default will be a column chart, which is great for visualizing these data.
Click OK.
9. Take a screenshot (label it 2-8MB) of your PivotTable and PivotChart and
click OK.
10. When you are finished answering the lab questions, you may close Excel. Save
your file as Lab 2-8 Dillard’s Stats.xlsx.
Tableau | Desktop

1. Within the same Tableau workbook open a new sheet.


2. Add the TRAN_AMT field to the worksheet by double-clicking it.
3. Add the STATE field to the worksheet by double-clicking it.
4. This result shows you the sum of the transaction amounts across each different
state, but it will be more meaningful to compare average transaction amount.
Click the drop-down arrow on the SUM(TRAN_AMT) pill in the rows shelf and
change Measure (Sum) to Average.
5. It will be easier to analyze the different averages quickly if you sort the data.
Click the icon for Sort Descending.
6. Take a screenshot (label it 2-8TB).
7. When you are finished answering the lab questions, you may close Tableau
Desktop. Save your file as Lab 2-8 Dillard’s Stats.twb.

page 112

Lab 2-8 Part 2 Objective Questions


OQ 1. Which state has the highest average transaction amount?
OQ 2. What is the average of transactions for North Carolina? Round your
answer to the nearest dollar.

Lab 2-8 Part 2 Analysis Questions


AQ 1. How does creating a query to connect to the data allow quicker and
more efficient access and analysis of the data than connecting to
entire tables?
AQ 2. Is 5 days of data sufficient to capture the statistical relationship
among and between different variables? What will Excel do if you
have more than 1 million rows? How might a query help?
If you have completed BOTH tracks,
AQ 3. Compare and Contrast: Compare the methods for analyzing
transactions across states in Excel versus Tableau. Which tool was
more intuitive for you to work with? Which provides more
interesting results?

Lab 2-8 Submit Your Screenshot Lab Document


Verify that you have answered any questions your instructor has assigned, then upload
your screenshot lab document to Connect or to the location indicated by your instructor.

page 113
1B. Resnick, “Researchers Just Released Profile Data on 70,000 OkCupid Users without Permission,” Vox,
2016, http://www.vox.com/2016/5/12/11666116/70000-okcupid-users-data-release (accessed October 31,
2016).
2N. Singer and A. Krolik, “Grindr and OkCupid Spread Personal Details, Study Says,” New York Times,
January 13, 2020, https://www.nytimes.com/2020/01/13/technology/grindr-apps-dating-data-tracking.html
(accessed December 2020).
3J. P. Isson and J. S. Harriott, Win with Advanced Business Analytics: Creating Business Value from Your
Data (Hoboken, NJ: Wiley, 2013).
4G. C. Simsion and G. C. Witt, Data Modeling Essentials (Amsterdam: Morgan Kaufmann, 2005).
5T. Singleton, “What Every IT Auditor Should Know about Data Analytics,” n.d.,
http://www.isaca.org/Journal/archives/2013/Volume-6/Pages/What-Every-IT-Auditor-Should-Know-About-
Data-Analytics.aspx#2.
6For a description of the audit data standards, please see this website:
http://www.aicpa.org/interestareas/frc/assuranceadvisoryservices/pages/assuranceandadvisory.aspx.
7S. White, “6 Ethical Questions about Big Data,” Financial Management, https://www.fm-
magazine.com/news/2016/jun/ethical-questions-about-big-data.html (accessed December 2020).

You might also like