0% found this document useful (0 votes)
53 views95 pages

Database Schema Types Explained

This document provides an overview of database schemas, including star, snowflake, flat, and semi-structured schemas, highlighting their structures and use cases. It also compares various database technologies such as OLAP and OLTP, row-based and columnar databases, and discusses business intelligence tools and ETL-specific tools for data processing. Additionally, it introduces Google Dataflow for creating data pipelines and offers resources for learning Python as a useful programming language in business intelligence.

Uploaded by

lalitgraphics
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views95 pages

Database Schema Types Explained

This document provides an overview of database schemas, including star, snowflake, flat, and semi-structured schemas, highlighting their structures and use cases. It also compares various database technologies such as OLAP and OLTP, row-based and columnar databases, and discusses business intelligence tools and ETL-specific tools for data processing. Additionally, it introduces Google Dataflow for creating data pipelines and offers resources for learning Python as a useful programming language in business intelligence.

Uploaded by

lalitgraphics
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Design efficient database systems with

schemas
You have been learning about how business intelligence professionals use data models and schemas to
organize and optimize databases. As a refresher, a schema is a way of describing the way something is
organized. Think about data schemas like blueprints of how a database is constructed. This is very
useful when exploring a new dataset or designing a relational database. A database schema represents
any kind of structure that is defined around the data. At the most basic level, it indicates which tables
or relations make up the database, as well as the fields included on each table.

This reading will explain common schema types you might encounter on the job.

Types of schemas
Star and snowflake
You’ve already learned about the relational models of star and snowflake schemas. Star and snowflake
schemas share some things in common, but they also have a few differences. For instance, although
they both share dimension tables, in snowflake schemas, the dimension tables are normalized. This
splits data into additional tables, which makes the schemas a bit more complex.

A star schema is a schema consisting of one or more fact tables referencing any number of
dimension tables. As its name suggests, this schema is shaped like a star. This type of schema is ideal
for high-scale information delivery and makes read output more efficient. It also classifies attributes
into facts and descriptive dimension attributes (product ID, customer name, sale date).

Here’s an example of a star schema:

In this example, this company uses a star schema to keep track of sales information within their tables.
This includes:
 Customer information
 Product information
 The time the sale is made
 Employee information
All the dimension tables link back to the sales_fact table at the center, which confirms this is a star
schema.

A snowflake schema is an extension of a star schema with additional dimensions and, often,
subdimensions. These dimensions and subdimensions create a snowflake pattern. Like snowflakes in
nature, a snowflake schema—and the relationships within it—can be complex. Snowflake schemas are
an organization type designed for lightning-fast data processing.

Below is an example of a snowflake schema:

Perhaps a data professional wants to design a snowflake schema that contains sports player/club
information. Start at the center with the fact table, which contains:

 PLAYER_ID
 LEAGUE_ID
 MATCH_TYPE
 CLUB_ID
This fact table branches out to multiple dimension tables and even subdimensions. The dimension
tables break out multiple details, such as player international and player club stats, transfer history,
and more.
Flat model
Flattened schemas are extremely simple database systems with a single table in which each record is
represented by a single row of data. The rows are separated by a delimiter, like a column, to indicate
the separations between records. Flat models are not relational; they can’t capture relationships
between tables or data items. Because of this, flat models are more often used as a potential source
within a data system to capture less complex data that doesn’t need to be updated.

Here is a flat table of runners and times for a 100-meter race:

This data isn’t going to change because the race has already occurred. And, it’s so simple, it’s not really
worth the effort of integrating it into a complex relational database when a simple flat model suffices.

As a BI professional, you may encounter flat models in data sources that you want to integrate into
your own systems. Recognizing that these aren’t already relational models is useful when considering
how best to incorporate the data into your target tables.

Semi-structured schemas
In addition to traditional, relational schemas, there are also semi-structured database schemas which
have much more flexible rules, but still maintain some organization. Because these databases have less
rigid organizational rules, they are extremely flexible and are designed to quickly access data.

There are four common semi-structured schemas:

Document schemas store data as documents, similar to JSON files. These documents store pairs
of fields and values of different data types.

Key-value schemas pair a string with some relationship to the data, like a filename or a URL,
which is then used as a key. This key is connected to the data, which is stored in a single collection.
Users directly request data by using the key to retrieve it.

Wide-column schemas use flexible, scalable tables. Each row contains a key and related
columns stored in a wide format.
Graph schemas store data items in collections called nodes. These nodes are connected by
edges, which store information about how the nodes are related. However, unlike relational databases,
these relationships change as new data is introduced into the nodes.

Conclusion
As a BI professional, you will often work with data that has been organized and stored in different ways.
Different database models and schemas are useful for different things, and knowing that will help you
design an efficient database system!

--------

Database comparison checklist


In this lesson, you have been learning about the different aspects of databases and how
they influence the way a business intelligence system functions. The database framework
—including how platforms are organized and how data is stored and processed—affects
how data is used. Therefore, understanding different technologies helps you make more
informed decisions about the BI tools and processes you create. This reading provides a
breakdown of databases including OLAP, OLTP, row-based, columnar, distributed, single-
homed, separated storage and compute, and combined.

OLAP versus OLTP


Database
Description Use
technology
 Provide user access to data from
a variety of source systems
 Used by BI and other data
professionals to support
Online Analytical Processing (OLAP) systems
decision-making processes
OLAP are databases that have been primarily
 Analyze data from multiple
optimized for analysis.
databases
 Draw actionable insights from
data delivered to reporting
tables
OLTP Online Transaction Processing (OLTP) systems  Store transaction data
are databases that have been optimized for  Used by customer-facing
data processing instead of analysis. employees or customer self-
service applications
 Read, write, and update single
rows of data
 Act as source systems that data
Database
Description Use
technology
pipelines can be pulled from for
analysis

Row-based versus columnar


Database
Description Use
technology
 Traditional, easy to write database
organization typically used in OLTP
Row-based databases are organized systems
Row-based
by rows.  Writes data very quickly
 Stores all of a row’s values together
 Easily optimized with indexing
 Newer form of database organization,
typically used to support OLAP systems
Columnar databases are organized
Columnar  Read data more quickly and only pull the
by columns instead of rows.
necessary data for analysis
 Stores multiple row’s columns together

Distributed versus single-homed


Database
Description Use
technology
 Easily expanded to address
increasing or larger scale business
Distributed databases are collections of data
needs
Distributed systems distributed across multiple physical
 Accessed from different networks
locations.
 Easier to secure than a single-
homed database system
 Data stored in a single location is
easier to access and coordinate
Single-homed databases are databases
cross-team
Single-homed where all of the data is stored in the same
 Cuts down on data redundancy
physical location.
 Cheaper to maintain than larger,
more complex systems

Separated storage and compute versus combined


Database
Description Use
technology
 Run analytical queries more
efficiently because the system only
Separated storage and computing systems needs to process the most relevant
Separated
are databases where less relevant data is data
storage and
stored remotely, and relevant data is stored  Scale computation resources and
compute
locally for analysis. storage systems separately based
on your organization’s custom
needs
 Traditional setup that allows users
Combined Combined systems are database systems to access all possible data at once
storage and that store and analyze data in the same  Storage and computation resources
compute place. are linked, so resource
management is straightforward

Business intelligence tools and their


applications
As you advance in your business intelligence career, you will encounter many different
tools. One of the great things about the skills you have been learning in these courses is
that they’re transferable between different solutions. No matter which tools you end up
using, the overall logic and processes will be similar! This reading provides an overview of
many of these business intelligence solutions.

Tool Uses
 Connect to a variety of data sources
Azure Analysis Service  Build in data security protocols
(AAS)  Grant access and assign roles cross-team
 Automate basic processes
 Connect to existing MySQL, PostgreSQL or SQL Server databases
 Automate basic processes
CloudSQL
 Integrate with existing apps and Google Cloud services, including Bi
 Observe database processes and make changes
 Visualize data with customizable charts and tables
 Connect to a variety of data sources
Looker Studio  Share insights internally with stakeholders and online
 Collaborate cross-team to generate reports
 Use report templates to speed up your reporting
Tool Uses
 Connect to multiple data sources and develop detailed models
 Create personalized reports
Microsoft PowerBI  Use AI to get fast answers using conversational languages
 Collaborate cross-team to generate and share insights on Microsoft
applications
 Develop pipelines with a codeless interface
 Connect to live data sources for updated reports
Pentaho
 Establish connections to an expanded library
 Access an integrated data science toolkit
 Access and analyze data across multiple online databases
 Integrate with existing Microsoft services including BI and data ware
SSAS SQL Server
tools and SSRS SQL Server
 Use built-in reporting tools
 Connect and visualize data quickly
 Analyze data without technical programming languages
 Connect to a variety of data sources including spreadsheets, databa
Tableau
cloud sources
 Combine multiple views of the data in intuitive dashboards
 Build in live connections with updating data sources

ETL-specific tools and their applications


In a previous reading, you were given a list of common business intelligence tools and some of their
uses. Many of them have built-in pipeline functionality, but there are a few ETL-specific tools you may
encounter. Creating pipeline systems—including ETL pipelines that move and transform data between
different data sources to the target database—is a large part of a BI professional's job, so having an
idea of what tools are out there can be really useful. This reading provides an overview.

Tool Uses
 Connect a variety of data sources
 Access a web-based user interface
Apache Nifi
 Configure and change pipeline systems as needed
 Modify data movement through the system at any time
 Synchronize or replicate data across a variety of data sources
 Identify pipeline issues with smart diagnostic features
 Use SQL to develop pipelines from the BigQuery UI
Google DataFlow
 Schedule resources to reduce batch processing costs
 Use pipeline templates to kickstart the pipeline creation process
systems across your organization
IBM InfoSphere  Integrate data across multiple systems
Information Server  Govern and explore available data
Tool Uses
 Improve business alignment and processes
 Analyze and monitor data from multiple data sources
 Connect data from a variety of sources integration
 Use built-in transformation tools
Microsoft SQL SIS
 Access graphical tools to create solutions without coding
 Generate custom packages to address specific business needs
 Connect data from a variety of sources
 Track changes and monitor system performance with built-in fea
Oracle Data Integrator
 Access system monitoring and drill-down capabilities
 Reduce monitoring costs with access to built-in Oracle services
 Connect data from a variety of sources
 Create codeless pipelines with drag-and-drop interface
Pentaho Data Integrator
 Access dataflow templates for easy use
 Analyze data with integrated tools
 Connect data from a variety of sources
 Design, implement, and reuse pipeline from a cloud server
Talend
 Access and search for data using integrated Talend services
 Clean and prepare data with built-in tools

Guide to Dataflow
As you have been learning, Dataflow is a serverless data-processing service that reads data from the
source, transforms it, and writes it in the destination location. Dataflow creates pipelines with open
source libraries, with which you can interact using different languages, including Python and SQL. This
reading provides information about accessing Dataflow and its functionality.

Navigate the homepage


If you completed the optional Create a Google Cloud account activity, you can follow along with the
steps of this reading in your Dataflow console. Go to the Dataflow Google Cloud homepage and sign in
to your account to access Dataflow. Then click the Go to Console button or the Console button.
Here, you will be able to create new jobs and access Dataflow tools.
Jobs
When you first open the console, you will find the Jobs page. The Jobs page is where your current
jobs are in your project space. There are also options to CREATE JOB FROM TEMPLATE or CREATE
MANAGED DATA PIPELINE from this page, so that you can get started on a new project in your Dataflow
console. This is where you will go anytime you want to start something new.

Pipelines
Open the menu pane to navigate through the console and find the other pages in Dataflow. The
Pipelines menu contains a list of all the pipelines you have created. If this is your first time using
Dataflow, it will also display the processes you need to enable before you can start building pipelines. If
you haven’t already enabled the APIs, click Fix All to enable the API features and set your location.
Workbench
The Workbench section is where you can create and save shareable Jupyter notebooks with live
code. This is helpful for first-time ETL tool users to check out examples and visualize the
transformations.

Snapshots
Snapshots save the current state of a pipeline to create new versions without losing the current
state. This is useful when you are testing or updating current pipelines so that you aren’t disrupting the
system. This feature also allows you to back up and recover old project versions. You may need to
enable APIs to view the Snapshots page; you will learn more about APIs in an upcoming activity.
SQL Workspace
Finally, the SQL Workspace is where you interact with your Dataflow jobs, connect to BigQuery
functionality, and write necessary SQL queries for your pipelines.

Dataflow also gives you the option to interact with your databases using other coding languages, but
you will primarily be using SQL for these courses.

Dataflow is a valuable way to start building pipelines and exercise some of the skills you have been
learning in this course. Coming up, you will have more opportunities to work with Dataflow, so now is a
great time to get familiar with the interface!

Python applications and resources


In this course, you will primarily be using BigQuery and SQL when interacting with databases in Google
DataFlow. However, DataFlow does have the option for you to work with Python, which is a widely used
general-purpose programming language. Python can be a great tool for business intelligence
professionals, so this reading provides resources and information for adding Python to your toolbox!
Elements of Python
There are a few key elements about Python that are important to understand:

 Python is open source and freely available to the public.


 It is an interpreted programming language, which means it uses another program to read and
execute coded instructions.
 Data is stored in data frames, similar to R.
 In BI, Python can be used to connect to a database system to work with files.
 It is primarily object-oriented.
 Formulas, functions, and multiple libraries are readily available.
 A community of developers exists for online code support.
 Python uses simple syntax for straightforward coding.
 It integrates with cloud platforms including Google Cloud, Amazon Web Services, and Azure.

Resources
If you’re interested in learning Python, there are many resources available to help. Here are just a few:

 The Python Software Foundation (PSF): a website with guides to help you get
started as a beginner
 Python Tutorial: a Python 3 tutorial from the PSF site
 Coding Club Python Tutorials: a collection of coding tutorials for Python

General tips for learning programming


languages
As you have been discovering, there are often transferable skills you can apply to a lot of different tools
—and that includes programming languages! Here are a few tips:

 Define a practice project and use the language to help you complete it. This makes the learning
process more practical and engaging.
 Keep in mind previous concepts and coding principles. After you have learned one language,
learning another tends to be much easier.
 Take good notes or make cheat sheets in whatever format (handwritten or typed) that works
best for you.
 Create an online filing system for information that you can easily access while you work in
various programming environments.

Merge data from multiple sources with


BigQuery
Previously, you started exploring Google Dataflow, a Google Cloud Platform (GCP) tool
that reads data from the source, transforms it, and writes it in the destination location. In
this lesson, you will begin working with another GCP data-processing tool: BigQuery. As
you may recall from the Google Data Analytics Certificate, BigQuery is a data warehouse
used to query and filter large datasets, aggregate results, and perform complex
operations.

As a business intelligence (BI) professional, you will need to gather and organize data from
stakeholders across multiple teams. BigQuery allows you to merge data from multiple
sources into a target table. The target table can then be turned into a dashboard, which
makes the data easier for stakeholders to understand and analyze. In this reading, you will
review a scenario in which a BI professional uses BigQuery to merge data from multiple
stakeholders in order to answer important business questions.

The problem
Consider a scenario in which a BI professional, Aviva, is working for a fictitious coffee shop
chain. Each year, the cafes offer a variety of seasonal menu items. Company leaders are
interested in identifying the most popular and profitable items on their seasonal menus so
that they can make more confident decisions about pricing; strategic promotion; and
retaining, expanding, or discontinuing menu items.

The solution
Data extraction
In order to obtain the information the stakeholders are interested in, Aviva begins
extracting the data. The data extraction process includes locating and identifying relevant
data, then preparing it to be transformed and loaded. To identify the necessary data,
Aviva implements the following strategies:

Meet with key stakeholders


Aviva leads a workshop with stakeholders to identify their objectives. During this
workshop, she asks stakeholders questions to learn about their needs:

 What information needs to be obtained from the data (for instance, performance of
different menu items at different restaurant locations)?
 What specific metrics should be measured (sales metrics, marketing metrics,
product performance metrics)?
 What sources of data should be used (sales numbers, customer feedback, point of
sales)?
 Who needs access to this data (management, market analysts)?
 How will key stakeholders use this data (for example, to determine which items to
include on upcoming menus, make pricing decisions)?
Observe teams in action
Aviva also spends time observing the stakeholders at work and asking them questions
about what they’re doing and why. This helps her connect the goals of the project with the
organization’s larger initiatives. During these observations, she asks questions about why
certain information and activities are important for the organization.

Organize data in BigQuery


Once Aviva has completed the data extraction process, she transforms the data she’s
gathered from different stakeholders and loads it into BigQuery. Then she uses BigQuery
to design a target table to organize the data. The target table helps Aviva unify the data.
She then uses the target table to develop a final dashboard for stakeholders to review.

The results
When stakeholders review the dashboard, they are able to identify several key findings
about the popularity and profitability of items on their seasonal menus. For example, the
data indicates that many peppermint-based products on their menus have decreased in
popularity over the past few years, while cinnamon-based products have increased in
popularity. This finding leads stakeholders to decide to retire three of their peppermint-
based drinks and bakery items. They also decide to add a selection of new cinnamon-
based offerings and launch a campaign to promote these items.

Key findings
Organizing data from multiple sources in a tool like BigQuery allows BI professionals to
find answers to business questions. Consolidating the data in a target table also makes it
easier to develop a dashboard for stakeholders to review. When stakeholders can access
and understand the data, they can make more informed decisions about how to improve
services or products and take advantage of new opportunities.

Activity Exemplar: Create a target table in


BigQuery
In this activity, you used BigQuery to create a target table to store data you pulled from a dataset of
street tree information from San Francisco, California. In your BI role, you’ll need to use programs such
as BigQuery and Dataflow to move and analyze data with SQL. Now, you’ve practiced a key part of the
Extraction stage of the BI pipeline: pulling data from a source and placing it into its own table.

The exemplar you’re about to review will help you evaluate whether you completed the activity
correctly. Because this activity involves copying, pasting, and executing a complete SQL query, you will
just need to check that your result matches this exemplar.
If you find that the result you received is different from the exemplar provided, double check the
formatting of the query you copied. Review the explanation of the SQL query in this activity to learn
more about how the SQL query works and how to write your own in your projects.

Access the exemplar


To explore the query result exemplar, download the following attachment:

SanFranciscoTrees
CSV File

Assessment of exemplar

In this activity, you ran the following SQL query to create a target table:

1
2
3
4
5
6
7
8
9
10
SELECT
address,
COUNT(address) AS number_of_trees
FROM
`bigquery-public-data.san_francisco_trees.street_trees`
WHERE
address != "null"
GROUP BY address
ORDER BY number_of_trees DESC
LIMIT 10;
 The SELECT clause selects the address of each tree. By using the COUNT function, you count
the number of trees at each address and return a single row of data per address, instead of per
tree. This data is saved as a new column.

 The FROM clause is straightforward as it specifies the street_trees table within the San
Francisco Street Trees dataset.

 The WHERE clause is necessary to ensure that your target table only includes rows that have a
value in the address column.

 The GROUP BY clause specifies that you’re grouping data by the address, and the ORDER BY
clause sorts the data in descending order by number_of_trees column.

 The LIMIT clause limits the query to return only the top ten rows of data. When workign with
large datasets, including a limit will decrease the processing time required to return the data.
If you need a refresher on SQL code, review some resources from the Google Data Analytics Certificate:
Review Google Data Analytics Certificate content about SQL and Review Google Data Analytics
Certificate content about SQL best practices.

The result of this query is a target table with two columns. It features the address column, as well as the
total number of trees planted at the address you calculated in the SELECT clause. If properly executed,
the first value in the address column is 100x Cargo Way. Next to it, the number_of_trees is 135. If you
didn’t receive this result, please review the code and run it again.

Furthermore, the target table shows the 10 addresses with the most trees planted by the Department
of Public Works in the city of San Francisco:

 100x Cargo Way

 700 Junipero Serra Blvd

 1000 San Jose Ave

 1200 Sunset Blvd

 1600 Sunset Blvd

 2301 Sunset Blvd

 1501 Sunset Blvd

 2401 Sunset Blvd

 100 STAIRWAY5

 2601 Sunset blvd

And the number of trees for each address is as follows:

 100x Cargo Way: 135

 700 Junipero Serra Blvd: 125

 1000 San Jose Ave: 113

 1200 Sunset Blvd: 110

 1600 Sunset Blvd: 102

 2301 Sunset Blvd: 94

 1501 Sunset Blvd: 93

 2401 Sunset Blvd: 92

 100 STAIRWAY5: 87

 2601 Sunset Blvd: 84


Key takeaways
Target tables are the destination for data during the Extraction stage of a pipeline. You’ll use them in
your role as a BI professional to store data after pulling it from their sources. Once they’re in a target
table, you can transform them with BigQuery or Dataflow and load them into reporting tables. You’ll
learn about the Transform and Load stages of data pipelines later in this course.

Case study: Wayfair - Working with


stakeholders to create a pipeline
Working with stakeholders while designing and iterating on a pipeline system is an
important strategy for ensuring that the BI systems you put in place answer their business
needs. In this case study, you’ll discover how the BI team at e-commerce home retailer
Wayfair, headquartered in Boston, Massachusetts, works with their stakeholders
throughout a project to create a pipeline system that works for them.

Company background
Longtime friends Niraj Shah and Steve Conine started the online-only company in 2002
after deciding they wanted to offer a larger selection of choices to customers—more than
could fit in a brick-and-mortar space. They started the company as a collection of more
than 200 e-commerce stores, each selling separate categories of products. In 2011, the
company combined these sites to establish wayfair.com.
Wayfair is now one of the world’s largest home retailers. The company’s goal is to help
everyone, anywhere, create their feeling of home. It empowers customers to create spaces
that reflect who they are, what they need, and what they value.

The challenge
The Wayfair pricing ecosystem includes thousands of different inputs and outputs across a
full catalog of products, which change multiple times a day. All of these inputs and
outputs are being generated in different ways from different sources. Because of this, the
BI team and other data professionals who needed to access pricing data were having
trouble locating, querying, and interpreting the complete dataset. This led to incomplete
and often inaccurate insights that weren’t useful for decision makers.

To address this, the BI team decided to design and implement a new pipeline system to
consolidate all the data stakeholders needed. They also needed to consider a few
additional challenges with their pipeline system:

 Monitoring and reporting around these processes would need to be included in the
design to track and manage errors.
 Data would need to be clean before it could be shared with downstream users.
 Due to the variety of data types being joined, the BI team also needed to better
understand the data relationships so they could accurately consolidate the data.
 Training sessions would be required to help educate users on how to best access
and use the new datasets.
These unique challenges meant that it was especially important for the BI team to work
closely with stakeholders while developing their new system to address their needs and
create something that worked across multiple teams.

The approach
Given the massive amount of data within the system, it was important for the BI team to
step back and work with stakeholders to really understand how they were using the data
currently. That included understanding the business problems they were trying to solve,
the data they were already using and how they were accessing it, and the data they
wanted to use but couldn’t access yet.

Once they had communicated with stakeholders, the team was able to design a pipeline
that achieved three key goals:

 All the required data could be made available and easy to understand and use
 The system was more efficient and could make data available without delays
 The system was designed to scale as the dataset expanded vertically and
horizontally to support future growth
After this initial design was completed, the system was presented to stakeholders for
review to ensure they understood the system and that it met all of their needs. This
project required collaboration across a variety of stakeholders and teams:

 Software engineers: The software engineer team were the primary owners
and generators of data, so they were key to understanding the current state of the
data and helped make it accessible for the BI team to work with.
 Data architects: The BI team consulted with data architects to ensure that the
pipeline design was all-encompassing, efficient, and scalable so the BI team could
handle the amount of data being ingested by the system and ensure that
downstream users would have access to the data as the system was being scaled.
 Data professionals: As the core users, these teams provided the use cases
and requirements for the system so that the BI team could ensure that the pipeline
addressed their needs. Because each of their respective teams’ needs were
different, it was important to ensure the system design and data included was wide
enough to account for allof those needs.
 Business stakeholders: As the end users of the insights generated by the
entire pipeline, the business stakeholders ensured all development work and use
cases were rooted with clear business problems to ensure what the BI team built
could be immediately applied to their work.
Communicating with all of the stakeholders throughout the design process ensured that
the Wayfair BI team created something useful and long-lasting for their organization.

The results
The final pipeline that the BI team implemented achieved a variety of key goals for the
entire organization:

 It enabled software engineering teams to publish data in real-time for the BI team
to use.
 It consolidated the different data components into one unified dataset for ease of
access and use.
 It allowed the BI team to store different data components in their own individual
staging layers.
 It included additional processes to monitor and report on the system’s
performance to inform users where failures were occurring and enable quick fixes.
 It created a unified dataset that users could leverage to build metrics and report on
data.
The greatest benefit of this pipeline solution was that Wayfair now had the ability to
provide accurate information in one place for users, eliminating the need to join different
sources themselves. This meant that the team could promote more accurate insights for
stakeholders and get rid of costly ad-hoc processes.

The response cross-team was very positive. The director of analytics at Wayfair said that
this was revolutionary for their team’s daily work because they had information on retail
price, cost inputs, and product status in the same place for the first time. This was a huge
benefit for their processes and to help them handle their data in a more intelligent way.

Conclusion
A significant benefit that business intelligence provides an organization is that it makes
the systems and processes more efficient and effective for users across the organization;
basically, BI makes everyone’s jobs a little easier. Ensuring that the BI team is tightly
aligned with the business stakeholders and other teams is critical to their success. Without
great partnership, problems can’t be solved correctly.
Glossary terms from week 1
Attribute: In a dimensional model, a characteristic or quality used to describe a
dimension

Columnar database: A database organized by columns instead of rows

Combined systems: Database systems that store and analyze data in the same
place

Compiled programming language: A programming language that compiles


coded instructions that are executed directly by the target machine

Data lake: A database system that stores large amounts of raw data in its original
format until it’s needed

Data mart: A subject-oriented database that can be a subset of a larger data


warehouse

Data warehouse: A specific type of database that consolidates data from multiple
source systems for data consistency, accuracy, and efficient access

Database migration: Moving data from one source platform to another target
database

Dimension (data modeling): A piece of information that provides more detail


and context regarding a fact

Dimension table: The table where the attributes of the dimensions of a fact are
stored

Design pattern: A solution that uses relevant measures and facts to create a model
in support of business needs

Dimensional model: A type of relational model that has been optimized to quickly
retrieve data from a data warehouse

Distributed database: A collection of data systems distributed across multiple


physical locations

Fact: In a dimensional model, a measurement or metric

Fact table: A table that contains measurements or metrics related to a particular event

Foreign key: A field within a database table that is a primary key in another table
(Refer to primary key)

Functional programming language: A programming language modeled


around functions
Google DataFlow: A serverless data-processing service that reads data from the
source, transforms it, and writes it in the destination location

Interpreted programming language: A programming language that uses an


interpreter, typically another program, to read and execute coded instructions

Logical data modeling: Representing different tables in the physical data model

Object-oriented programming language: A programming language modeled


around data objects

OLAP (Online Analytical Processing) system: A tool that has been


optimized for analysis in addition to processing and can analyze data from multiple
databases

OLTP (Online Transaction Processing) database: A type of database that


has been optimized for data processing instead of analysis

Primary key: An identifier in a database that references a column or a group of


columns in which each row uniquely identifies each record in the table (Refer to foreign
key)

Python: A general purpose programming language

Response time: The time it takes for a database to complete a user request

Row-based database: A database that is organized by rows

Separated storage and computing systems: Databases where data is


stored remotely, and relevant data is stored locally for analysis

Single-homed database: Database where all of the data is stored in the same
physical location

Snowflake schema: An extension of a star schema with additional dimensions and,


often, subdimensions

Star schema: A schema consisting of one fact table that references any number of
dimension tables

Target table: The predetermined location where pipeline data is sent in order to be
acted on

Terms and definitions from previous weeks


A
Application programming interface (API): A set of functions and procedures
that integrate computer programs, forming a connection that enables them to
communicate

Applications software developer: A person who designs computer or mobile


applications, generally for consumers

B
Business intelligence (BI): Automating processes and information channels in
order to transform relevant data into actionable insights that are easily available to
decision-makers

Business intelligence governance: A process for defining and implementing


business intelligence systems and frameworks within an organization

Business intelligence monitoring: Building and using hardware and software


tools to easily and rapidly analyze data and enable stakeholders to make impactful
business decisions

Business intelligence stages: The sequence of stages that determine both BI


business value and organizational data maturity, which are capture, analyze, and monitor

Business intelligence strategy: The management of the people, processes, and


tools used in the business intelligence process

D
Data analysts: People who collect, transform, and organize data

Data availability: The degree or extent to which timely and relevant information is
readily accessible and able to be put to use

Data governance professionals: People who are responsible for the formal
management of an organization’s data assets

Data integrity: The accuracy, completeness, consistency, and trustworthiness of


data throughout its life cycle

Data maturity: The extent to which an organization is able to effectively use its data
in order to extract actionable insights

Data model: A tool for organizing data elements and how they relate to one another

Data pipeline: A series of processes that transports data from different sources to
their final destination for storage and analysis
Data visibility: The degree or extent to which information can be identified,
monitored, and integrated from disparate internal and external sources

Data warehousing specialists: People who develop processes and procedures


to effectively store and organize data

Deliverable: Any product, service, or result that must be achieved in order to


complete a project

Developer: A person who uses programming languages to create, execute, test, and
troubleshoot software applications

E
ETL (extract, transform, and load): A type of data pipeline that enables data
to be gathered from source systems, converted into a useful format, and brought into a
data warehouse or other unified destination system

Experiential learning: Understanding through doing

I
Information technology professionals: People who test, install, repair,
upgrade, and maintain hardware and software solutions

Iteration: Repeating a procedure over and over again in order to keep getting closer to
the desired result

K
Key performance indicator (KPI): A quantifiable value, closely linked to
business strategy, which is used to track progress toward a goal

M
Metric: A single, quantifiable data point that is used to evaluate performance

P
Portfolio: A collection of materials that can be shared with potential employers

Project manager: A person who handles a project’s day-to-day steps, scope,


schedule, budget, and resources

Project sponsor: A person who has overall accountability for a project and
establishes the criteria for its success
S
Strategy: A plan for achieving a goal or arriving at a desired future state

Systems analyst: A person who identifies ways to design, implement, and advance
information systems in order to ensure that they help make it possible to achieve business
goals

Systems software developer: A person who develops applications and


programs for the backend processing systems used in organizations

T
Tactic: A method used to enable an accomplishment

Transferable skill: A capability or proficiency that can be applied from one job to
another

V
Vanity metric: Data points that are intended to impress others, but are not indicative
of actual performance and, therefore, cannot reveal any meaningful business insights

[Optional] Review Google Data Analytics


Certificate content about SQL best practices
You can save this reading for future reference. Feel free to download a PDF version of this
reading below:

DAC3 In-depth guide_ SQL best practices.pdf


PDF File

These best practices include guidelines for writing SQL queries, developing
documentation, and examples that demonstrate these practices. This is a great resource
to have handy when you are using SQL yourself; you can just go straight to the relevant
section to review these practices. Think of it like a SQL field guide!

Capitalization and case sensitivity


With SQL, capitalization usually doesn’t matter. You could write SELECT or select or
SeLeCT. They all work! But if you use capitalization as part of a consistent style your
queries will look more professional.
To write SQL queries like a pro, it is always a good idea to use all caps for clause starters
(e.g., SELECT, FROM, WHERE, etc.). Functions should also be in all caps (e.g., SUM()).
Column names should be all lowercase (refer to the section on snake_case later in this
guide). Table names should be in CamelCase (refer to the section on CamelCase later in
this guide). This helps keep your queries consistent and easier to read while not impacting
the data that will be pulled when you run them. The only time that capitalization does
matter is when it is inside quotes (more on quotes below).

Vendors of SQL databases may use slightly different variations of SQL. These variations
are called SQL dialects. Some SQL dialects are case sensitive. BigQuery is one of them.
Vertica is another. But most, like MySQL, PostgreSQL, and SQL Server, aren’t case
sensitive. This means if you searched for country_code = ‘us’, it will return all entries that
have 'us', 'uS', 'Us', and 'US'. This isn’t the case with BigQuery. BigQuery is case sensitive,
so that same search would only return entries where the country_code is exactly 'us'. If
the country_code is 'US', BigQuery wouldn’t return those entries as part of your result.

Single or double quotes: '' or " "


For the most part, it also doesn’t matter if you use single quotes ' ' or double quotes " "
when referring to strings. For example, SELECT is a clause starter. If you put SELECT in
quotes like 'SELECT' or "SELECT", then SQL will treat it as a text string. Your query will
return an error because your query needs a SELECT clause.

But there are two situations where it does matter what kind of quotes you use:

1. When you want strings to be identifiable in any SQL dialect


2. When your string contains an apostrophe or quotation marks
Within each SQL dialect there are rules for what is accepted and what isn’t. But a general
rule across almost all SQL dialects is to use single quotes for strings. This helps get rid of a
lot of confusion. So if we want to reference the country US in a WHERE clause (e.g.,
country_code = 'US'), then use single quotes around the string 'US'.

The second situation is when your string has quotes inside it. Suppose you have a column
of favorite foods in a table called FavoriteFoods and the other column corresponds to
each friend.

Friend Favorite_food
Rachel DeSantos Shepherd’s pie
Sujin Lee Tacos
Najil Okoro Spanish paella
You might notice how Rachel’s favorite food contains an apostrophe. If you were to use
single quotes in a WHERE clause to find the friend who has this favorite food, it would look
like this:
This won’t work. If you run this query, you will get an error in return. This is because
SQL recognizes a text string as something that starts with a quote 'and ends with another
quote '. So in the bad query above, SQL thinks that the Favorite_food you are looking for
is 'Shepherd'. Just 'Shepherd' because the apostrophe in Shepherd's ends the string.

Generally speaking, this should be the only time you would use double quotes instead of
single quotes. So your query would look like this instead:

SQL understands text strings as either starting with a single quote ' or double quote".
Since this string starts with double quotes, SQL will expect another double quote to signal
the end of the string. This keeps the apostrophe safe, so it will return "Shepherd's pie" and
not 'Shepherd'.

Comments as reminders
As you get more comfortable with SQL, you will be able to read and understand queries at
a glance. But it never hurts to have comments in the query to remind yourself of what you
are trying to do. And if you share your query, it also helps others understand it.

For example:
You can use # in place of the two dashes, --, in the above query but keep in mind that #
isn’t recognized in all SQL dialects (MySQL doesn’t recognize #). So it is best to use -- and
be consistent with it. When you add a comment to a query using --, the database query
engine will ignore everything in the same line after --. It will continue to process the query
starting on the next line.

Snake_case names for columns


It is important to always make sure that the output of your query has easy-to-understand
names. If you create a new column (say from a calculation or from concatenating new
fields), the new column will receive a generic default name (e.g., f0). For example:

The following table features the results of this query: f0: 8 f1: 4 total_tickets: 8
Number_of_purchases: 4
Results are:

f0 f1 total_tickets number_of_purchases
8 4 8 4
The first two columns are named f0 and f1 because they weren’t named in the above
query. SQL defaults to f0, f1, f2, f3, and so on. We named the last two columns
total_tickets and number_of_purchases so these column names show up in the query
results. This is why it is always good to give your columns useful names, especially when
using functions. After running your query, you want to be able to quickly understand your
results, like the last two columns we described in the example.

On top of that, you might notice how the column names have an underscore between the
words. Names should never have spaces in them. If 'total_tickets' had a space and looked
like 'total tickets' then SQL would rename SUM(tickets) as just 'total'. Because of the
space, SQL will use 'total' as the name and won’t understand what you mean by 'tickets'.
So, spaces are bad in SQL names. Never use spaces.

The best practice is to use snake_case. This means that 'total tickets', which has a space
between the two words, should be written as 'total_tickets' with an underscore instead of
a space.

CamelCase names for tables


You can also use CamelCase capitalization when naming your table. CamelCase
capitalization means that you capitalize the start of each word, like a two-humped
(Bactrian) camel. So the table TicketsByOccasion uses CamelCase capitalization. Please
note that the capitalization of the first word in CamelCase is optional; camelCase is also
used. Some people differentiate between the two styles by calling CamelCase,PascalCase,
and reserving camelCase for when the first word isn't capitalized, like a one-humped
(Dromedary) camel; for example, ticketsByOccasion.

At the end of the day, CamelCase is a style choice. There are other ways you can name
your tables, including:

 All lower or upper case, like ticketsbyoccasion or TICKETSBYOCCASION


 With snake_case, like tickets_by_occasion
Keep in mind, the option with all lowercase or uppercase letters can make it difficult to
read your table name, so it isn’t recommended for professional use.

The second option, snake_case, is technically okay. With words separated by underscores,
your table name is easy to read, but it can get very long because you are adding the
underscores. It also takes more time to write. If you use this table a lot, it can become a
chore.

In summary, it is up to you to use snake_case or CamelCase when creating table names.


Just make sure your table name is easy to read and consistent. Also be sure to find out if
your company has a preferred way of naming their tables. If they do, always go with their
naming convention for consistency.

Indentation
As a general rule, you want to keep the length of each line in a query <= 100 characters.
This makes your queries easy to read. For example, check out this query with a line with
>100 characters:
SELECT CASE WHEN genre = 'horror' THEN 'Will not watch' WHEN genre =
'documentary' THEN 'Will watch alone' ELSE 'Watch with others' END AS Watch_category,
COUNT(
This query is hard to read and just as hard to troubleshoot or edit. Now, here is a query
where we stick to the <= 100 character rule:

Now it is much easier to understand what you are trying to do in the SELECT clause. Sure,
both queries will run without a problem because indentation doesn’t matter in SQL. But
proper indentation is still important to keep lines short. And it will be valued by anyone
reading your query, including yourself!

Multi-line comments
If you make comments that take up multiple lines, you can use -- for each line. Or, if you
have more than two lines of comments, it might be cleaner and easier is to use /* to start
the comment and */ to close the comment. For example, you can use the -- method like
below:

-- Date: September 15, 2020 -- Analyst: Jazmin Cisneros -- Goal: Count the number of rows
in the table SELECT COUNT(*) number of rows -- the * stands for all so count all FROM
table
Or, you can use the /* */ method like below:

/* Date: September 15, 2020 Analyst: Jazmin Cisneros Goal: Count the number of rows in
the table */ SELECT COUNT(*) AS number_of_rows -- the * stands for all so count all
FROM table
In SQL, it doesn’t matter which method you use. SQL ignores comments regardless of
what you use: #, --, or /* and */. So it is up to you and your personal preference. The /* and
*/ method for multi-line comments usually looks cleaner and helps separate the
comments from the query. But there isn’t one right or wrong method.

SQL text editors


When you join a company, you can expect each company to use their own SQL platform
and SQL dialect. The SQL platform they use (e.g., BigQuery, MySQL, or SQL Server) is
where you will write and run your SQL queries. But keep in mind that not all SQL platforms
provide native script editors to write SQL code. SQL text editors give you an interface
where you can write your SQL queries in an easier and color-coded way. In fact, all of the
code we have been working with so far was written with an SQL text editor!

Examples with Sublime Text


If your SQL platform doesn’t have color coding, you might want to think about using a text
editor like Sublime Text or Atom. This section shows how SQL is displayed in Sublime
Text. Here is a query in Sublime Text:
With Sublime Text, you can also do advanced editing like deleting indents across multiple
lines at the same time. For example, suppose your query somehow had indents in the
wrong places and looked like this:

This is really hard to read, so you will want to eliminate those indents and start over. In a
regular SQL platform, you would have to go into each line and press BACKSPACE to delete
each indent per line. But in Sublime, you can get rid of all the indents at the same time by
selecting all lines and pressing Command (or CTRL in Windows) + [. This eliminates
indents from every line. Then you can select the lines that you want to indent (i.e., lines 2,
4, and 6) by pressing the Command key (or the CTRL key in Windows) and selecting those
lines. Then while still holding down the Command key (or the CTRL key in Windows),
press ] to indent lines 2, 4, and 6 at the same time. This will clean up your query and make
it look like this instead:
Sublime Text also supports regular expressions. Regular expressions (or regex)
can be used to search for and replace string patterns in queries. We won’t cover regular
expressions here, but you might want to learn more about them on your own because
they are a very powerful tool.

You can begin with these resources:

 Search and replace in Sublime Text


 Regex tutorial (if you don’t know what regular expressions are)
 Regex cheat sheet

ETL versus ELT


So far in this course, you have learned about ETL pipelines that extract, transform, and load data
between database storage systems. You have also started learning about newer pipeline systems like
ELT pipelines that extract, load, and then transform data. In this reading, you are going to learn more
about the differences between these two systems and the ways different types of database storage fit
into those systems. Understanding these differences will help you make key decisions that promote
performance and optimization to ensure that your organization’s systems are efficient and effective.

The primary difference between these two pipeline systems is the order in which they transform and
load data. There are also some other key differences in how they are constructed and used:

Differences ETL ELT


The order of extraction, Data is extracted, transformed in a Data is extracted, loaded into
transformation, and staging area, and loaded into the system, and transformed as ne
loading data target system analysis
Location of Data is moved to a staging area where Data is transformed in the des
transformations it is transformed before delivery system, so no staging area is r
Age of the technology ETL has been used for over 20 years, ELT is a newer technology with
Differences ETL ELT
and many tools have been developed support tools built-in to existin
to support ETL pipeline systems technology
ETL systems only transform and load
ELT systems load all of the dat
Access to data within the data designated when the
users to choose which data to
the system warehouse and pipeline are
any time
constructed
Calculations executed in an ETL
system replace or revise existing Calculations are added directl
Calculations
columns in order to push the results existing dataset
to the target table
ETL systems are typically integrated
Compatible storage ELT systems can ingest unstru
with structured, relational data
systems data from sources like data lak
warehouses
Sensitive information can be redacted
Data has to be uploaded befor
Security and or anonymized before loading it into
be anonymized, making it mor
compliance the data warehouse, which protects
vulnerable
data
ETL is great for dealing with smaller ELT is well-suited to systems u
Data size datasets that need to undergo amounts of both structured an
complex transformations unstructured data
ETL systems have longer load times, Data loading is very fast in ELT
but analysis is faster because data has because data can be ingested
Wait times
already been transformed when users waiting for transformations to
access it analysis is slower

Data storage systems


Because ETL and ELT systems deal with data in slightly different ways, they are optimized to work with
different data storage systems. Specifically, you might encounter data warehouses and data lakes. As a
refresher, a data warehouse is a type of database that consolidates data from multiple source systems
for data consistency, accuracy, and efficient access. And a data lake is a database system that stores
large amounts of raw data in its original format until it’s needed. While these two systems perform the
same basic function, there are some key differences:

Data warehouse Data lake


Data is raw and unprocessed until it is needed for analy
Data has already been processed and
additionally, it can have a copy of the entire OLTP or rel
stored in a relational system
database
The data’s purpose has already been
The data’s purpose has not been determined yet
assigned, and the data is currently in use
Making changes to the system can be Systems are highly accessible and easy to update
Data warehouse Data lake
complicated and require a lot of work
There is also a specific type of data warehouse you might use as a data source: data marts. Data marts
are very similar to data warehouses in how they are designed, except that they are much smaller.
Usually, a data mart is a single subset of a data warehouse that covers data about a single subject.

Key takeaways
Currently, ETL systems that extract, transform and load data, and ELT systems that extract, load, and
then transform data are common ways that pipeline systems are constructed to move data where it
needs to go. Understanding the differences between these systems can help you recognize when you
might want to implement one or the other. And, as business and technology change, there will be a lot
of opportunities to engineer new solutions using these data systems to solve business problems.

A guide to the five factors of database


performance
Database performance is an important consideration for BI professionals. As you have been learning,
database performance is a measure of the workload that can be processed by the database, as well as
associated costs. Optimization involves maximizing the speed and efficiency that data is retrieved in
order to ensure high levels of database performance. This means that your stakeholders have the
fastest access to the data they need to make quick and intelligent decisions. You have also been
learning that there are five factors of database performance: workload, throughput, resources,
optimization, and contention.

The five factors


In this reading, you will be given a quick overview of the five factors that you can reference at any time
and an example to help outline these concepts. In the example, you are a BI professional working with
the sales team to gain insights about customer purchasing habits and monitor the success of current
marketing campaigns.

Factor Definition Example


The combination of transactions, On a daily basis, your database needs to p
queries, data warehousing analysis, sales reports, perform revenue calculation
Workload and system commands being respond to real-time requests from stakeh
processed by the database system at of these needs represent the workload the
any given time. needs to be able to handle.
Throughput The overall capability of the The system’s throughput is the combinatio
database’s hardware and software to and output speed, the CPU speed, the mac
process requests. ability to run parallel processes, the datab
management system, and the operating sy
Factor Definition Example
system software.
The hardware and software tools The database system is primarily cloud-ba
Resources available for use in a database means it depends on online resources and
system. to maintain functionality.
Maximizing the speed and efficiency
Continually checking that the database is
with which data is retrieved in order
Optimization optimally is part of your job as the team's
to ensure high levels of database
professional.
performance.
Because this system automatically genera
When two or more components and responds to user-requests, there are t
Contention attempt to use a single resource in a it may be trying to run the queries on the s
conflicting way. datasets at the same time, causing slowdo
users.

Indexes, partitions, and other ways to


optimize
Optimization for data reading
One of the continual tasks of a database is reading data. Reading is the process of interpreting and
processing data to make it available and useful to users. As you have been learning, database
optimization is key to maximizing the speed and efficiency with which data is retrieved in order to
ensure high levels of database performance. Optimizing reading is one of the primary ways you can
improve database performance for users. Next, you will learn more about different ways you can
optimize your database to read data, including indexing and partitioning, queries, and caching.

Indexes
Sometimes, when you are reading a book with a lot of information, it will include an index at the back
of the book where that information is organized by topic with page numbers listed for each reference.
This saves you time if you know what you want to find– instead of flipping through the entire book, you
can go straight to the index, which will direct you to the information you need.

Indexes in databases are basically the same– they use the keys from the database tables to very quickly
search through specific locations in the database instead of the entire thing. This is why they’re so
important for database optimization– when users run a search in a fully indexed database, it can return
the information so much faster. For example, a table with columns ID, Name, and Department could
use an index with the corresponding names and IDs.
Now the database can easily locate the names in the larger table quickly for searches using those IDs
from the index.

Partitions
Data partitioning is another way to speed up database retrieval. There are two types of partitioning:
vertical and horizontal. Horizontal partitioning is the most common, and involves designing the
database so that rows are organized by logical groupings instead of stored in columns. The different
rows are stored in different tables– this reduces the index size and makes it easier to write and retrieve
data from the database.

Instead of creating an index table to help the database search through the data faster, partitions split
larger, unwieldy tables into much more manageable, smaller tables.

In this example, the larger sales table is broken down into smaller tables– these smaller tables are
easier to query because the database doesn’t need to search through as much data at one time.

Other optimization methods


In addition to making your database easier to search through with indexes and partitions, you can also
optimize your actual searches for readability or use your system’s cached memory to save time
retrieving frequently used data.

Queries
Queries are requests for data or information from a database. In many cases, you might have a
collection of queries that you run regularly; these might be automated queries that generate reports, or
regular searches made by users.

If these queries are not optimized, they can take a long time to return results to users and take up
database resources in general. There a few things you can do to optimize queries:

1. Consider the business requirements: Understanding the business requirements


can help you determine what information you really need to pull from the database and avoid
putting unnecessary strain on the system by asking for data you don’t actually need.
2. Avoid using SELECT* and SELECT DISTINCT: Using SELECT* and SELECT
DISTINCT causes the database to have to parse through a lot of unnecessary data. Instead, you
can optimize queries by selecting specific fields whenever possible.
3. Use INNER JOIN instead of subqueries: Using subqueries causes the database to
parse through a large number of results and then filter them, which can take more time than
simply JOINing tables in the first place.
Additionally, you can use pre-aggregated queries to increase database read functionality. Basically,
pre-aggregating data means assembling the data needed to measure certain metrics in tables so that
the data doesn’t need to be re-captured every time you run a query on it.

If you’re interested in learning more about optimizing queries, you can check out Devart’s article on
SQL Query Optimization.

Caching
Finally, the cache can be a useful way to optimize your database for readability. Essentially, the cache
is a layer of short-term memory where tables and queries can be stored. By querying the cache instead
of the database system itself, you can actually save on resources. You can just take what you need from
the memory.

For example, if you often access the database for annual sales reports, you can save those reports in
the cache and pull them directly from memory instead of asking the database to generate them over
and over again.

Key takeaways
This course has focused a lot on database optimization and how you, as a BI professional, can ensure
that the systems and solutions you build for your team continue to function as efficiently as possible.
Using these methods can be a key way for you to promote database speed and availability as team
members access the database system. And coming up, you’re going to have opportunities to work with
these concepts yourself!

Activity Exemplar: Partition data and create


indexes in BigQuery
Here is a completed exemplar along with an explanation of how the exemplar fulfills the expectations
for the activity.
Assessment of Exemplar

Compare the exemplar to your completed activity. Review your work using each of the criteria in the
exemplar. What did you do well? Where can you improve? Use your answers to these questions to guide
you as you continue to progress through the course.

In the previous activity, you ran SQL code that created tables with partitions and indexes. Partitions
and indexes help you create shortcuts to specific rows and divide large datasets into smaller, more
manageable tables. By creating partitions and indexes, you can build faster and more efficient
databases, making it easier to pull data when you need to analyze or visualize it.

After creating the tables, you ran queries on those tables to compare their performance and
demonstrate how useful partitions and indexes can be.

At each step, you took screenshots of the Details or Execution Details pane to compare to the
following exemplar images. This will help you ensure that you completed the activity properly. It will
also explain the context of why the tables you created and the queries you ran differ from each other.
By the end of this reading, you will understand how this activity demonstrates that partitions and
clusters speed up queries and optimize database performance.

Note that the answers for these queries might differ depending on whether you’re using the sandbox or
free trial/full version of BigQuery. The sandbox version might not read the full dataset, so the table size
you receive might not match the results you would get from the query in the full version. This reading
explains the results for both the sandbox and the full version so you can check your work regardless of
how you’re using BigQuery.

Explore the exemplar


Table details
This is the Details pane for the table you created without partitions or indexes. It simply describes
the table size (4.37MB of logical and active bytes) and the number of rows (41,025).
This is the Details pane for the table you created with a partition. It has the same details as the first
details pane, but it also includes details about table type (partitioned), as well as the field on which the
partition was created (year).

The sandbox limitations mean that this table won’t have a size, but it will still be created with the
query. The table size is 0B and there is a section that includes “Table Type: Partitioned,” “Partitioned
by: Integer Range,” “Partitioned on field: year,” and “Partition filter: Not required.” The partition range
start (2015), end (2022) and interval is also shown. The full version has a table size of 4.37MB, but it has
the same additional partition section.
This is the Details pane for the table you created with a partition and clusters. It has the same details
as the two previous images, but also includes that you clustered the data by the type column. The
sandbox version of the table might not have a table size, but the full version has 4.37MB total logical
and active bytes.
Execution details
Then, the Execution Details panes compare the query performance for each table. In the
sandbox, these details won’t appear for queries on the partitioned and partitioned and clustered
tables. If you’re using the sandbox, take note of the screenshots in the section.

Note: The Working timing section on your screen might vary in color or duration. Your SQL query
might take longer or shorter to run depending on differing BigQuery engine server speeds. Your screen
might not match the following screenshot, but the records read and records written should match with
the Rows section.

This is the Execution Details pane for the query on the table you created without partitions or
clusters. The number of rows read is the total number of the rows on the table. You’ll find this in the
S00:Input section, where Records read: 41,025 and Records written: 3.

This is the Execution Details pane for the query on the table you created partitioned by an
integer range. You’ll notice that the number of records read is less. Now, Records read: 16,953 and
Records written: 3. In this query, the database processes only the records from the partitions filtered by
the where clause (type). When choosing a column to partition on, it is most effective to choose one that
would frequently be used in the where clause.
This is the Execution Details pane for the query on the table you created that is clustered by the
type column. Records read: 16,953 and Records written: 3. Typically, a query on the clustered table
would process fewer records than the partitioned one. However, the dataset you are using in this
activity is too small to properly demonstrate that difference. In other projects, you might find that
clustering a table leads to a significant decrease in records read.

Key takeaways
This activity demonstrates the impact of using partitions and indexes (known as clusters in BigQuery)
in database tables. You can use them to optimize query performance and minimize processing costs. In
this exercise, applying partitions and clustering means that BigQuery can break all 41,025 records into
smaller, more manageable tables to read. The benefits of partitioning will be even more evident with
larger datasets. Use this technique to optimize database performance in your future projects.

--
Case study: Deloitte - Optimizing outdated
database systems
In this part of the course, you have been learning about the importance of database optimization. This
basically means maximizing the speed and efficiency with which data is retrieved in order to ensure
high levels of database performance. Part of a BI professional’s job is examining resource-use and
identifying better data sources and structures. In this case study, you will have an opportunity to
explore an example of how the BI team at Deloitte handled optimization when they discovered their
current database was difficult for users to query.

Company background
Deloitte collaborates with independent firms to provide audit and assurance, consulting, risk and
financial advisory, risk management, tax, and related services to select clients. Deloitte’s brand vision
is to be a standard of excellence within the field and to uphold their brand values as they develop
leading strategies and cutting edge tools for clients to facilitate their business. These values include
integrity, providing outstanding value to clients, commitment to community, and strength from
cultural diversity.

The challenge
Because of the size of the company and their ever-evolving data needs, the database grew and
changed to match current problems without time to consider long-term performance. Because of this,
the database eventually grew into a collection of unmanaged tables without clear joins or consistent
formatting. This made it difficult to query the data and transform it into information that could
effectively guide decision making.

The need for optimization appeared gradually as the team had to continually track data and had to
repeatedly test and prove the validity of the data. With a newly optimized database, the data could
more easily be understood, trusted, and used to make effective business decisions.

Primarily, this database contained marketing and financial data that would ideally be used to connect
marketing campaigns and sales leads to evaluate which campaigns were successful. But because of the
current state of the database, there was no clear way to tie successes back to specific marketing
campaigns and evaluate their financial performance. The biggest challenge to this initiative was
programming the external data sources to feed data directly into the new database, rather than into
the previous tables that were scheduled to be deprecated. Additionally, the database design needed
to account for tables that represented the lifecycle of the data and designed with joins that could easily
and logically support different data inquiries and requests.

The approach
Because of the scale of the project and the specific needs of the organization, the BI team decided to
design their own database system that they could implement across the entire organization. That way,
the architecture of the database would really capture their data needs and connect tables thoughtfully
so they were easier to query and use.

For example, the team wanted to be able to easily connect the initial estimate of a marketing
campaign’s financial success with its ending value and how well internal processes were able to predict
the success of a campaign. Increases from the initial estimate were good, but if estimates were
frequently much higher than actual outcomes, it could indicate an issue with the tooling used to
develop those estimates. But in the database’s current state, there were dozens of tables across
accounting groups that were creating access issues that were preventing these insights from being
made. Also, the different accounting groups had a lot of overlap that the team hoped to more
thoughtfully structure for more long-term use.

To achieve these goals, the team strategized the architecture, developed checkpoints to determine if
the required functionality could be developed and eventually scaled up, and created an iterative
system wherein regular updates to the database system could be made to continue refining it moving
forward.

In order to consider the database optimization project a success, the BI team wanted to address the
following issues:

 Were the necessary tables and columns consolidated in a more useful way?
 Did the new schema and keys address the needs of analyst teams?
 Which tables were being queried repeatedly and were they accessible and logical?
 What sample queries could promote confidence in the system for users?
A variety of partners and stakeholders had to be involved in the optimization project because so many
users across the organization would be affected. The database administrators and engineers working
with the BI team were particularly key for this project because they led the design and creation of the
database, mapped the life cycle of the data as it matured and changed over time and used that as a
framework to construct a logical data-flow design.

These engineers then conducted interviews with various stakeholders to understand the business
requirements for teams across the entire organization, trained a team of analysts on the new system,
and deprecated the old tables that weren’t working.

The results
Deloitte’s BI team recognized that, while the database had been continually updated to address
evolving business needs, it had grown harder to manage over time. In order to promote greater
database performance and ensure their database could meet their needs, the BI team collaborated
with database engineers and administrators to design a custom database architecture that
thoughtfully addressed the business needs of the organization. For example, the new database
structure helped build connections between tables tracking marketing campaigns over time and their
successes, including revenue data and regional locations.

This database optimization effort had a lot of benefits. The greatest benefit was the organization’s
ability to trust their data–the analyst team didn’t have to spend as much time validating the data
before use because the tables were now organized and joined in more logical ways. The new
architecture also promoted simpler queries. Instead of having to write hundreds of lines of
complicated code to return simple answers, the new database was optimized for simpler, shorter
queries that took less time to run.

This provided benefits for teams across the organization:

 The marketing team was able to get better feedback on the value created by specific
campaigns.
 The sales team could access specific information about their regions and territories, giving
them insights about possible weaknesses and opportunities for expansion.
 The strategy team was able to bridge the gap between the marketing and sales teams, and
ultimately create actionable OKRs (Objectives and Key Results) for the future.
However, as you have been learning, database optimization is an iterative process. The BI team’s work
didn’t stop once they implemented the new database design. They also designated a team to oversee
data governance to ensure the quality of the data and prevent the same problems from happening
again. This way, the data remains organized and also this team can continue refining the developed
databases based on evolving business needs.

Conclusion
The databases where your organization stores their data are a key part of the BI processes and tools
you create–if the database isn’t performing well, it will affect your entire organization and make it more
difficult to provide stakeholders with the data they need to make intelligent business decisions.
Optimizing your database promotes high performance and a better user experience for everyone on
your team.

Determine the most efficient query


So far, you have learned about the factors that affect database performance and database
query optimization. This is an important part of a BI professional’s work because it allows
them to ensure that the tools and systems their team is using are as efficient as possible.
Now that you’re more familiar with these concepts, you’ll review an example.

The scenario
Francisco’s Electronics recently launched its home office product line on its e-commerce
site. After receiving a positive response from customers, company decision-makers chose
to add the rest of their products to the site. Since launch, they have received more than
10,000,000 sales. While this is great for the business, such a massive catalog of sales
records has affected the speed of their database. The sales manager, Ed, wanted to run a
query for the number of sales created after November 1, 2021, in the “electronics”
category but was unable to because the database was too slow. He asked Xavier, a BI
analyst, to work with the database and optimize a query to speed up the sales report
generation.

To begin, Xavier examined the sales_warehouse database schema shown below. The
schema contains different symbols and connectors that represent two important pieces of
information: the major tables within the system and the relationships among these tables.
The sales_warehouse database schema contains five tables—Sales, Products, Users,
Locations, and Orders—which are connected via keys. The tables contain five to eight
columns (or attributes) ranging in data type. The data types include varchar or char (or
character), integer, decimal, date, text (or string), timestamp, and bit.

The foreign keys in the Sales table link to each of the other tables. The “product_id”
foreign key links to the Products table, the “user_id” foreign key links to the Users table,
the “order_id” foreign key links to the Orders table, and the “shipping_address_id” and
“billing_address_id” foreign keys link to the Locations table.

Examining the SQL query


After considering the details, Xavier found that the following request needed optimization:
Optimizing the query
To make this query more efficient, Xavier started by checking if the "date" and "category"
fields were indexed. He did this by running the following queries:
Without indexes in the columns used for query restrictions, the engine did a full table scan
that processed all several million records and checked which ones had date >= “2021-11-
01” and category = “electronics.”

Then, he indexed the “date” field in the Sales table and the “category” field in the
Products table using the following SQL code:

Unfortunately, the query was still slow, even after adding the indices. Assuming that there
were only a few thousand sales created after “2021-11-01,” the query still created a very
large virtual table (joining Sales and Products). It had millions of records before filtering
out sales with a date after “2021-11-01” and sales with products in “electronics”
categories. This resulted in an inefficient and slow query.

To make the query faster and more efficient, Xavier modified it to first filter out the sales
with dates after “2021-11-01.” The query “(SELECT product_id FROM Sales WHERE
date > '2021-11-01') AS oi” returned only a few thousand records, rather than millions of
records. He then joined these records with the Products table.
Xavier’s final optimized query was:

Key takeaways
Optimizing queries will make your pipeline operations faster and more efficient. In your
role as a BI professional, you might work on projects with extremely large datasets. For
these projects, it’s important to write SQL queries that are as fast and efficient as possible.
Otherwise, your data pipelines might be slow and difficult to work with.

Glossary terms from week 2


Contention: When two or more components attempt to use a single resource in a
conflicting way

Data partitioning: The process of dividing a database into distinct, logical parts in
order to improve query processing and increase manageability
Database performance: A measure of the workload that can be processed by a
database, as well as associated costs

ELT (extract, load, and transform): A type of data pipeline that enables data
to be gathered from data lakes, loaded into a unified destination system, and transformed
into a useful format

Fragmented data: Data that is broken up into many pieces that are not stored
together, often as a result of using the data frequently or creating, deleting, or modifying
files

Index: An organizational tag used to quickly locate data within a database system

Optimization: Maximizing the speed and efficiency with which data is retrieved in
order to ensure high levels of database performance

Query plan: A description of the steps a database system takes in order to execute a
query

Resources: The hardware and software tools available for use in a database system

Subject-oriented: Associated with specific areas or departments of a business

Throughput: The overall capability of the database’s hardware and software to


process requests

Workload: The combination of transactions, queries, data warehousing analysis, and


system commands being processed by the database system at any given time

Terms and definitions from previous weeks


A
Application programming interface (API): A set of functions and procedures
that integrate computer programs, forming a connection that enables them to
communicate

Applications software developer: A person who designs computer or mobile


applications, generally for consumers

Attribute: In a dimensional model, a characteristic or quality used to describe a


dimension

B
Business intelligence (BI): Automating processes and information channels in
order to transform relevant data into actionable insights that are easily available to
decision-makers

Business intelligence governance: A process for defining and implementing


business intelligence systems and frameworks within an organization

Business intelligence monitoring: Building and using hardware and software


tools to easily and rapidly analyze data and enable stakeholders to make impactful
business decisions

Business intelligence stages: The sequence of stages that determine both BI


business value and organizational data maturity, which are capture, analyze, and monitor

Business intelligence strategy: The management of the people, processes, and


tools used in the business intelligence process

C
Columnar database: A database organized by columns instead of rows

Combined systems: Database systems that store and analyze data in the same
place

Compiled programming language: A programming language that compiles


coded instructions that are executed directly by the target machine

D
Data analysts: People who collect, transform, and organize data

Data availability: The degree or extent to which timely and relevant information is
readily accessible and able to be put to use

Data governance professionals: People who are responsible for the formal
management of an organization’s data assets

Data integrity: The accuracy, completeness, consistency, and trustworthiness of


data throughout its life cycle

Data lake: A database system that stores large amounts of raw data in its original
format until it’s needed

Data mart: A subject-oriented database that can be a subset of a larger data


warehouse

Data maturity: The extent to which an organization is able to effectively use its data
in order to extract actionable insights
Data model: A tool for organizing data elements and how they relate to one another

Data pipeline: A series of processes that transports data from different sources to
their final destination for storage and analysis

Data visibility: The degree or extent to which information can be identified,


monitored, and integrated from disparate internal and external sources

Data warehouse: A specific type of database that consolidates data from multiple
source systems for data consistency, accuracy, and efficient access

Data warehousing specialists: People who develop processes and procedures


to effectively store and organize data

Database migration: Moving data from one source platform to another target
database

Deliverable: Any product, service, or result that must be achieved in order to


complete a project

Developer: A person who uses programming languages to create, execute, test, and
troubleshoot software applications

Dimension (data modeling): A piece of information that provides more detail


and context regarding a fact

Dimension table: The table where the attributes of the dimensions of a fact are
stored

Design pattern: A solution that uses relevant measures and facts to create a model
in support of business needs

Dimensional model: A type of relational model that has been optimized to quickly
retrieve data from a data warehouse

Distributed database: A collection of data systems distributed across multiple


physical locations

E
ETL (extract, transform, and load): A type of data pipeline that enables data
to be gathered from source systems, converted into a useful format, and brought into a
data warehouse or other unified destination system

Experiential learning: Understanding through doing

F
Fact: In a dimensional model, a measurement or metric

Fact table: A table that contains measurements or metrics related to a particular event

Foreign key: A field within a database table that is a primary key in another table
(Refer to primary key)

Functional programming language: A programming language modeled


around functions

G
Google DataFlow: A serverless data-processing service that reads data from the
source, transforms it, and writes it in the destination location

I
Information technology professionals: People who test, install, repair,
upgrade, and maintain hardware and software solutions

Interpreted programming language: A programming language that uses an


interpreter, typically another program, to read and execute coded instructions

Iteration: Repeating a procedure over and over again in order to keep getting closer to
the desired result

K
Key performance indicator (KPI): A quantifiable value, closely linked to
business strategy, which is used to track progress toward a goal

L
Logical data modeling: Representing different tables in the physical data model

M
Metric: A single, quantifiable data point that is used to evaluate performance

O
Object-oriented programming language: A programming language modeled
around data objects
OLAP (Online Analytical Processing) system: A tool that has been
optimized for analysis in addition to processing and can analyze data from multiple
databases

OLTP (Online Transaction Processing) database: A type of database that


has been optimized for data processing instead of analysis

P
Portfolio: A collection of materials that can be shared with potential employers

Primary key: An identifier in a database that references a column or a group of


columns in which each row uniquely identifies each record in the table (Refer to foreign
key)

Project manager: A person who handles a project’s day-to-day steps, scope,


schedule, budget, and resources

Project sponsor: A person who has overall accountability for a project and
establishes the criteria for its success

Python: A general purpose programming language

R
Response time: The time it takes for a database to complete a user request

Row-based database: A database that is organized by rows

S
Separated storage and computing systems: Databases where data is
stored remotely, and relevant data is stored locally for analysis

Single-homed database: Database where all of the data is stored in the same
physical location

Snowflake schema: An extension of a star schema with additional dimensions and,


often, subdimensions

Star schema: A schema consisting of one fact table that references any number of
dimension tables

Strategy: A plan for achieving a goal or arriving at a desired future state


Systems analyst: A person who identifies ways to design, implement, and advance
information systems in order to ensure that they help make it possible to achieve business
goals

Systems software developer: A person who develops applications and


programs for the backend processing systems used in organizations

T
Tactic: A method used to enable an accomplishment

Target table: The predetermined location where pipeline data is sent in order to be
acted on

Transferable skill: A capability or proficiency that can be applied from one job to
another

V
Vanity metric: Data points that are intended to impress others, but are not indicative
of actual performance and, therefore, cannot reveal any meaningful business insights

Seven elements of quality testing


In this part of the course, you have been learning about the importance of quality testing in your
ETL system. This is the process of checking data for defects in order to prevent system failures.
Ideally, your pipeline should have checkpoints built-in that identify any defects before they arrive
in the target database system. These checkpoints ensure that the data coming in is already clean
and useful! In this reading, you will be given a checklist for what your ETL quality testing should
be taking into account.

When considering what checks you need to ensure the quality of your data as it moves through
the pipeline, there are seven elements you should consider:

 Completeness: Does the data contain all of the desired components or measures?
 Consistency: Is the data compatible and in agreement across all systems?
 Conformity: Does the data fit the required destination format?
 Accuracy: Does the data conform to the actual entity being measured or described?
 Redundancy: Is only the necessary data being moved, transformed, and stored for
use?
 Timeliness: Is the data current?
 Integrity: Is the data accurate, complete, consistent, and trustworthy? (Integrity is
influenced by the previously mentioned qualities.)
Common issues
There are also some common issues you can protect against within your system to ensure the
incoming data doesn’t cause errors or other large-scale problems in your database system:

 Check data mapping: Does the data from the source match the data in the target
database?
 Check for inconsistencies: Are there inconsistencies between the source system
and the target system?
 Check for inaccurate data: Is the data correct and does it reflect the actual entity
being measured?
 Check for duplicate data: Does this data already exist within the target system?
To address these issues and ensure your data meets all seven elements of quality testing, you
can build intermediate steps into your pipeline that check the loaded data against known
parameters. For example, to ensure the timeliness of the data, you can add a checkpoint that
determines if that data matches the current date; if the incoming data fails this check, there’s an
issue upstream that needs to be flagged. Considering these checks in your design process will
ensure your pipeline delivers quality data and needs less maintenance over time.

Key takeaways
One of the great things about BI is that it gives us the tools to automate certain processes that
help save time and resources during data analysis– building quality checks into your ETL
pipeline system is one of the ways you can do this! Making sure you are already considering the
completeness, consistency, conformity, accuracy, redundancy, integrity, and timeliness of the
data as it moves from one system to another means you and your team don’t have to check the
data manually later on.

Monitor data quality with SQL


As you’ve learned, it is important to monitor data quality. By monitoring your data, you
become aware of any problems that may occur within the ETL pipeline and data warehouse
design. This can help you address problems as early as possible and avoid future problems.

In this reading, you’ll follow a fictional scenario where a BI engineer performs quality testing
on their pipeline and suggests SQL queries that one could use for each step of testing.

The scenario
At Francisco’s Electronics, an electronics manufacturing company, a BI engineer named
Sage designed a data warehouse for analytics and reporting. After the ETL process design,
Sage created a diagram of the schema.
The diagram of the schema of the sales_warehouse database contains different
symbols and connectors that represent two important pieces of information: the major tables
within the system and the relationships among these tables.

The sales_warehouse database schema contains five tables:

 Sales
 Products
 Users
 Locations
 Orders

These tables are connected via keys. The tables contain five to eight columns (or attributes)
ranging in data type. The data types include varchar or char (or character), integer, decimal,
date, text (or string), timestamp, and bit.

The foreign keys in the Sales table link to each of the other tables:

 The “product_id” foreign key links to the Products table


 The “user_id” foreign key links to the Users table
 The “order_id” foreign key links to the Orders table
 The “shipping_address_id” and “billing_address_id” foreign keys link to the
Locations table

After Sage made the sales_warehouse database, the development team made changes
to the sales site. As a result, the original OLTP database changed. Now, Sage needs to ensure
the ETL pipeline works properly and that the warehouse data matches the original OLTP
database.

Sage used the original OLTP schema from the store database to design the warehouse.

The store database schema also contains five tables—Sales, Products, Users, Locations,
and Orders—which are connected via keys. The tables contain four to eight columns ranging
in data type. The data types include varchar or char, integer, decimal, date, text, timestamp,
bit, tinyint, and datetime.

Every table in the store database has an id field as a primary key. The database contains
the following tables:

 The Sales table has price, quantity, and date columns. It references a user who made
a sale (UserId), purchased a product (ProductId), and a related order
(OrderId). Also, it references the Locations table for shipping and billing
addresses (ShippingAddressId and BillingAddressId, respectively).
 The Users table has FirstName, LastName, Email, Password, and
other user-related columns.
 The Locations table contains address information (Address1, Address2,
City, State, and Postcode).
 The Products table has Name, Price, InventoryNumber, and
Category of products.
 The Orders table has OrderNumber and purchase information (Subtotal,
ShippingFee, Tax, Total, and Status).

Using SQL to find problems


Sage compared the sales_warehouse database to the original store database to check
for completeness, consistency, conformity, accuracy, redundancy, integrity, and timeliness.
Sage ran SQL queries to examine the data and identify quality problems. Then Sage prepared
the following table of lists, which include the types of quality issues found, the quality
strategies that were violated, the SQL codes used to find the issues, and specific descriptions
of the issues.
Quality testing sales_warehouse

Tested Quality
SQL query Sage’s observation
quality strategy
Is the data
accurate,
In the sales_warehouse
complete, SELECT * FROM
Integrity database, the order with ID 7 has
consistent, Orders
the incorrect total value.
and
trustworthy?
Does the The Locations table of the
data contain sales_warehouse database
all of the has an extra address. In the store
SELECT COUNT(*)
Completeness desired database there are 60 records,
FROM Locations
components whereas the
or sales_warehouse database
measures? table has 61.
Is the data
compatible Several users within the
and in SELECT Phone FROM sales_warehouse database
Consistency
agreement Users have phones without the "+"
across all prefix.
systems?
The location ZIP code for the
record with ID 6 in the
Does the sales_warehouse database
SELECT id, postcode
data fit the is 722434213, which is wrong.
FROM
Conformity required The United States postal code
sales_warehouse.Loc
destination contains either five digits or five
ations
format? digits followed by a hyphen (dash)
and another four digits (e.g.,
12345-1234).

Quality testing store

Quality SQL
Feature Sage’s Observation
Strategy query
Users.Status from the store database and
Is the data Users.is_active from the
accurate, DESCRI sales_warehouse database seem to be related
Integrity complete, BE fields. However, it is not obvious how the Status
consistent, and Users column is transformed into the is_active boolean
trustworthy? column. Is it possible that with a new status value,
the ETL pipeline will fail?
Is the data Products.Inventory from the store
DESCRI
compatible and database has the varchar type instead of the
BE
Consistency in agreement int(10) in the sales_warehouse database
Product
across all Products.inventory field. This can be a
s
systems? problem if there is a value with characters.
Quality SQL
Feature Sage’s Observation
Strategy query
Does the data The data type of Sales.Date in the store
conform to the DESCRI database is different from its data type in
Accuracy actual entity BE sales_warehouse (date vs datetime). It
being measured Sales might not be a problem if time is not important for
or described? the sales_warehouse database fact table.
Is only the
necessary data The table Sales from the sales_warehouse
DESCRI
being moved, database has a unique index constraint on
Redundancy BE
transformed, OrderId, ProductId, UserId columns. It can
Sales
and stored for be added to the warehouse schema.
use?

Key takeaways
Testing data quality is an essential skill of a BI professional that ensures good analytics and
reporting. Just as Sage does in this example, you can use SQL commands to examine BI
databases and find potential problems. The sooner you know the problems in your system, the
sooner you can fix them and improve your data quality.

Sample data dictionary and data


lineage
As you have been learning in this course, business intelligence professionals have three primary
tools to help them ensure conformity from source to destination: schema validation, data
dictionaries, and data lineages. In this reading, you’re going to explore some examples of data
dictionaries and lineages to get a better understanding of how these items work.

Data dictionaries
A data dictionary is a collection of information that describes the content, format, and structure of
data objects within a database, as well as their relationships. This can also be referred to as a
metadata repository because data dictionaries use metadata to define the use and origin of other
pieces of data. Here’s an example of a product table that exists within a sales database:

Product Table

Item_I Departmen Number_of_Sal Number_in_Stoc Season


Price
D t es k al
47257 $33.00 Gardening 744 598 Yes
39496 $82.00 Home Decor 383 729 Yes
73302 $56.00 Furniture 874 193 No
16507 $100.00 Home Office 310 559 Yes
1232 $125.00 Party Supplies 351 517 No
3412 $45.00 Gardening 901 942 No
54228 $60.00 Party Supplies 139 520 No
Item_I Departmen Number_of_Sal Number_in_Stoc Season
Price
D t es k al
66415 $38.00 Home Decor 615 433 Yes
78736 $12.00 Grocery 739 648 No
34369 $28.00 Gardening 555 389 Yes
This table is actually the final target table for data gathered from multiple sources. It's important
to ensure consistency from the sources to the destination because this data is coming from
different places within the system. This is where the data dictionary comes in:

Data dictionary

Data
Name Definition
Type
Item_ID ID number assigned to all product items in-store Integer
Price Current price of product item Integer
Department Which department the product item belongs to Character
Number_of_Sales The current number of product items sold Integer
Number_in_Stock The current number of product items in stock Integer
Seasonal Whether or not the product item is only seasonally available Boolean
You can use the properties outlined in the dictionary to compare incoming data to the destination
table. If any data objects don’t match the entries in the dictionary, then the data validation will flag
the error before the incorrect data is ingested.

For example, if incoming data that is being delivered to the Department column contains
numerical data, you can quickly identify that there has been an error before it gets delivered
because the data dictionary states Department data should be character-type.

Data lineages
A data lineage describes the process of identifying the origin of data, where it has moved
throughout the system, and how it has transformed over time. This can be really helpful for BI
professionals, because when they do encounter an error, they can actually track it to the source
using the lineage. Then, they can implement checks to prevent the same issue from occuring
again.

For example, imagine your system flagged an error with some incoming data about the number
of sales for a particular item. It can be hard to find where this error occurred if you don’t know the
lineage of that particular piece of data– but by following that data’s path through your system,
you can figure out where to build a check.
By tracking the sales data through its life cycle in the system, you find that there was an issue
with the original database it came from and that data needs to be transformed before it’s
ingested into later tables.

Key takeaways
Tools such as data dictionaries and data lineages are useful for preventing inconsistencies as
data is moved from source systems to its final destination. It is important that users accessing
and using that data can be confident that it is correct and consistent. This depends on the data
being delivered into the target systems has already been validated. This is key for building
trustworthy reports and dashboards as a BI professional!

Schema-validation checklist
In this course, you have been learning about the tools business intelligence professionals use to
ensure conformity from source to destination: schema validation, data dictionaries, and data
lineages. In another reading, you already had the opportunity to explore data dictionaries and
lineages. In this reading, you are going to get a schema validation checklist you can use to guide
your own validation process.

Schema validation is a process used to ensure that the source system data schema matches the
target database data schema. This is important because if the schemas don’t align, it can cause
system failures that are hard to fix. Building schema validation into your workflow is important to
prevent these issues.

Common issues for schema validation


 The keys are still valid: Primary and foreign keys build relationships between
tables in relational databases. These keys should continue to function after you have
moved data from one system into another.
 The table relationships have been preserved: The keys help preserve the
relationships used to connect the tables so that keys can still be used to connect tables.
It’s important to make sure that these relationships are preserved or that they are
transformed to match the target schema.
 The conventions are consistent: The conventions for incoming data must be
consistent with the target database’s schema. Data from outside sources might use
different conventions for naming columns in tables– it’s important to align these before
they’re added to the target system.

Using data dictionaries and lineages


You’ve already learned quite a bit about data dictionaries and lineages. As a refresher, a data
dictionary is a collection of information that describes the content, format, and structure of data
objects within a database, as well as their relationships. And a data lineage is the process of
identifying the origin of data, where it has moved throughout the system, and how it has
transformed over time. These tools are useful because they can help you identify what standards
incoming data should adhere to and track down any errors to the source.

The data dictionaries and lineages reading provided some additional information if more review is
needed.

Key takeaways
Schema validation is a useful check for ensuring that the data moving from source systems to
your target database is consistent and won’t cause any errors. Building in checks to make sure
that the keys are still valid, the table relationships have been preserved, and the conventions are
consistent before data is delivered will save you time and energy trying to fix these errors later
on.

Activity Exemplar: Evaluate a schema


using a validation checklist
Here is a completed exemplar of an ideal version of the schema you evaluated in this activity, as
well as an explanation of why it is ideal.

Completed Exemplar

To review the exemplar for this course item, click the following link and select Use Template.

Link to exemplar: Database schema exemplar

OR

If you don’t have a Google account, you can download the exemplar directly from the following
attachment.

Activity Exemplar_ Database schema


PPTX File

Assessment of Exemplar
Compare the exemplar to your completed activity. Review your work using each of the criteria in
the exemplar. What did you do well? Where can you improve? Use your answers to these
questions to guide you as you continue to progress through the course.

In the schema you evaluated in this activity, the Sales Fact table is a central table that contains
key figures from the transactions. It also contains an internal key to the dimension it’s linked to.
This is a common schema structure in BI data warehouse systems.

The original schema contains eight tables: Sales Fact, Shipments, Billing, Order Items, Product,
Product Price, Order Details, and Customer, which are connected via keys.

The central table is Sales Fact. The foreign keys in the Sales Fact table link to the other tables as
follows:

 “order_sid” key links to the Order Items,Order Details, Shipments, and Billing tables
 “customer_sid” links to Order Details; “order_item_sid” links to Order Items, Shipments,
and Billing
 “shipment_sid” links to Shipments; and “billing_sid” links to Billing
 “product_id” from the Product table links to Order Items and Product Price
The Customer table currently doesn’t have any links to other tables. It contains the following
columns: “customer_sid,” “customer_name,” and “customer_type.”

This schema chart includes the following problems:

 The Customer table is not linked to any other tables. It should be linked to Sales Facts
and Order Details tables. This violates the “Keys are still valid” and Table relationships
have been preserved” checks.
 The Shipments table should be connected to the Order Items table through the
“order_sid” dimension. This violates the “Keys are still valid” and Table relationships have
been preserved” checks.
The exemplar in this reading is an example of the schema you evaluated, but with its errors fixed.
It links the Customer table to the Sales Facts and Order Details tables through the customer_sid
dimension. It has a connection between the Shipment and Order Items tables. It also has
consistent naming conventions for “product_sid.” Consistent naming for column titles is not
mandatory, but it is a best practice to keep titles as consistent as possible.

The important dimensions that are connections in this schema are order_sid, order_item_sid,
customer_sid, product_sid, shipment_sid, and billing_sid.

 Order_sid is present in the Sales Facts, Order Items, Shipments, Billing, and Order
Details tables.
 Order_item_sid is present in the Sales Facts, Order Items, Shipments, and Billing tables.
 Customer_sid is present in the Sales Facts, Order Details, and Customer tables.
 Product_sid is present in the Order Items, Product, and Product Price tables.
 Shipment_sid is present in the Sales Facts and Shipment tables.
 Billing_sid is present in the Sales Facts and Billing tables.

Database performance testing in an ETL


context
In previous lessons, you learned about database optimization as part of the database building
process. But it’s also an important consideration when it comes to ensuring your ETL and
pipeline processes are functioning properly. In this reading, you are going to return to
database performance testing in a new context: ETL processes.

How database performance affects your pipeline


Database performance is the rate that a database system is able to provide information to
users. Optimizing how quickly the database can perform tasks for users helps your team get
what they need from the system and draw insights from the data that much faster.

Your database systems are a key part of your ETL pipeline– these include where the data in
your pipeline comes from and where it goes. The ETL or pipeline is a user itself, making
requests of the database that it has to fulfill while managing the load of other users and
transactions. So database performance is not just key to making sure the database itself can
manage your organization’s needs– it’s also important for the automated BI tools you set up
to interact with the database.

Key factors in performance testing


Earlier, you learned about some database performance considerations you can check for when
a database starts slowing down. Here is a quick checklist of those considerations:

 Queries need to be optimized


 The database needs to be fully indexed
 Data should be defragmented
 There must be enough CPU and memory for the system to process requests
You also learned about the five factors of database performance: workload, throughput,
resources, optimization, and contention. These factors all influence how well a database is
performing, and it can be part of a BI professional’s job to monitor these factors and make
improvements to the system as needed.

These general performance tests are really important– that’s how you know your database
can handle data requests for your organization without any problems! But when it comes to
database performance testing while considering your ETL process, there is another important
check you should make: testing the table, column, row counts, and Query Execution Plan.

Testing the row and table counts allows you to make sure that the data count matches
between the target and source databases. If there are any mismatches, that could mean that
there is a potential bug within the ETL system. A bug in the system could cause crashes or
errors in the data, so checking the number of tables, columns, and rows of the data in the
destination database against the source data can be a useful way to prevent that.

Key takeaways
As a BI professional, you need to know that your database can meet your organization’s
needs. Performance testing is a key part of the process. Not only is performance testing useful
during database building itself, but it’s also important for ensuring that your pipelines are
working properly as well. Remembering to include performance testing as a way to check
your pipelines will help you maintain the automated processes that make data accessible to
users!

Business rules
As you have been learning, a business rule is a statement that creates a restriction on specific
parts of a database. These rules are developed according to the way an organization uses data.
Also, the rules create efficiencies, allow for important checks and balances, and also sometimes
exemplify the values of a business in action. For instance, if a company values cross-functional
collaboration, there may be rules about at least 2 representatives from two teams checking off
completion on some data set. They affect what data is collected and stored, how relationships
are defined, what kind of information the database provides, and the security of the data. In this
reading, you will learn more about the development of business rules and see an example of
business rules being implemented in a database system.

Imposing business rules


Business rules are highly dependent on the organization and their data needs. This means
business rules are different for every organization. This is one of the reasons why verifying
business rules is so important; these checks help ensure that the database is actually doing the
job you need it to do. But before you can verify business rules, you have to implement them.

For example, let’s say the company you work for has a database that manages purchase order
requests entered by employees. Purchase orders over $1,000 dollars need manager approval. In
order to automate this process, you can impose a ruleset on the database that automatically
delivers requests over $1,000 to a reporting table pending manager approval. Other business
rules that may apply in this example are: prices must be numeric values (data type should be
integer); or for a request to exist, a reason is mandatory (table field may not be null).

In order to fulfill this business requirement, there are three rules at play in this system:

1. Order requests under $1,000 are automatically delivered to the approved product order
requests table
2. Requests over $1,000 are automatically delivered to the requests pending approval table
3. Approved requests are automatically delivered to the approved product order requests
table
These rules inherently affect the shape of this database system to cater to the needs of this
particular organization.

Verifying business rules


Once the business rules have been implemented, it’s important to continue to verify that they are
functioning correctly and that data being imported into the target systems follows these rules.
These checks are important because they test that the system is doing the job it needs to, which
in this case is delivering product order requests that need approval to the right stakeholders.

Key takeaways
Business rules determine what data is collected and stored, how relationships are defined, what
kind of information the database provides, and the security of the data. These rules heavily
influence how a database is designed and how it functions after it has been set up.
Understanding business rules and why they are important is useful as a BI professional because
this can help you understand how existing database systems are functioning, design new
systems according to business needs, and maintain them to be useful in the future.

Defend against known issues


In this reading, you’ll learn about a defensive check applied to a data pipeline. Defensive
checks help you prevent problems in your data pipeline. They are similar to performance
checks but focus on other kinds of problems. The following scenario will provide an example
of how you can implement different kinds of defensive checks on a data pipeline.

Scenario
Arsha, a Business Intelligence Analyst at a telecommunications company, built a data
pipeline that merges data from six sources into a single database. While building her pipeline,
she incorporated several defensive checks that ensured that the data was moved and
transformed properly.

Her data pipeline used the following source systems:

1. Customer details
2. Mobile contracts
3. Internet and cable contracts
4. Device tracking and enablement
5. Billing
6. Accounting

All of these datasets had to be harmonized and merged into one target system for business
intelligence analytics. This process required several layers of data harmonization, validation,
reconciliation, and error handling.
Pipeline layers
Pipelines can have many different stages of processing. These stages, or layers, help ensure
that the data is collected, aggregated, transformed, and staged in the most effective and
efficient way. For example, it’s important to make sure you have all the data you need in one
place before you start cleaning it to ensure that you don’t miss anything. There are usually
four layers to this process: staging, harmonization, validation, and reconciliation. After these
four layers, the data is brought into its target database and an error handling report
summarizes each step of the process.

Staging layer

First, the original data is brought from the source systems and stored in the staging
layer. In this layer, Arsha ran the following defensive checks:

 Compared the number of records received and stored


 Compared rows to identify if extra records were created or records were lost
 Checked important fields, such as amounts, dates, and IDs

Arsha moved the mismatched records to the error handling report. She included each
unconverted source record, the date and time of its first processing, its last retry date and
time, the layer where the error happened, and a message describing the error. By collecting
these records, Arsha was able to find and fix the origin of the problems. She marked all of the
records that moved to the next layer as “processed.”

Harmonization layer

The harmonization layer is where data normalization routines and record enrichment
are performed. This ensures that data formatting is consistent across all the sources. To
harmonize the data, Arsha ran the following defensive checks:

 Standardized the date format


 Standardized the currency
 Standardized uppercase and lowercase stylization
 Formatted IDs with leading zeros
 Split date values to store the year, month, and day in separate columns
 Applied conversion and priority rules from the source systems
When a record couldn’t be harmonized, she moved it to Error Handling. She marked all of
the records that moved to the next layer as “processed.”

Validations layer

The validations layer is where business rules are validated. As a reminder, a


business rule is a statement that creates a restriction on specific parts of a database.
These rules are developed according to the way an organization uses data. Arsha ran the
following defensive checks:

 Ensured that values in the “department” column were not null, since “department” is a
crucial dimension
 Ensured that values in the “service type” column were within the authorized values to
be processed
 Ensured that each billing record corresponded to a valid processed contract

Again, when a record couldn’t be harmonized, she moved it to error handling. She marked all
the records that moved to the next layer as “processed.”

Reconciliation layer

The reconciliation layer is where duplicate or illegitimate records are found. Here,
Arsha ran defensive checks to find the following types of records:

 Slow-changing dimensions
 Historic records
 Aggregations

As with the previous layers, Arsha moved the records that didn't pass the reconciliation rules
to Error Handling. After this round of defensive checks, she brought the processed records
into the BI and Analytics database (OLAP).

Error handling reporting and analysis


After completing the pipeline and running the defensive checks, Arsha made an error
handling report to summarize the process. The report listed the number of records from the
source systems, as well as how many records were marked as errors or ignored in each layer.
The end of the report listed the final number of processed records.
Key takeaways
Defensive checks are what ensure that a data pipeline properly handles its data. Defensive
checks are an essential part of preserving data integrity. Once the staging, harmonization,
validations, and reconciliation layers have been checked, the data brought into the target
database is ready to be used in a visualization.

ystems
This case study with FeatureBase will focus on the Analyze stage of the BI process, where you
examine relationships in the data, draw conclusions, make predictions, and drive informed
decision-making. This follows an earlier case study, where you explored the Capture phase of
FeatureBase’s project. In a follow-up case study, you’ll learn about how FeatureBase addressed
the Monitor stages of this project to solve their business problem. Like the previous FeatureBase
case study, you’ll consider the problem, process, and solution for this stage of the project.

In a previous reading, you were introduced to FeatureBase, the OLAP database company. For a
quick refresher, you can review part one of this case study. Their core technology, FeatureBase,
is the first OLAP database built entirely on bitmaps that power real-time analytics and machine
learning applications by simultaneously executing low latency, high throughput, and highly
concurrent workloads. Last time, you learned about a business problem the FeatureBase team
was facing: they realized that customers were falling off during the sales cycle, but that their data
collection didn’t have the necessary measurements to investigate when and why this was
happening. The first step to addressing this issue was collaborating across sales, marketing, and
leadership teams to determine what data they needed to understand when customers were
falling off. Then, they could use that insight to investigate and address those issues for future
sales. In this reading, you’re going to focus on the database tools FeatureBase uses to collect
data for monitoring and reporting.
Feature-oriented databases
Throughout this program, you have been learning about different database technologies that use
pipeline systems to ingest, transform, and deliver data to target databases. This setup is fairly
common across many different organizations, and you will often work with pipelines as a BI
professional. However, there are also other types of database systems that use different
technology to make data accessible and useful to users.

FeatureBase is one example of an alternative solution to traditional databases– it is built on top


of bitmaps, which is a format that stores data as raw features. To build predicting models, the AI
depends on the feature-oriented database (and not the other way around) to find patterns which
can be used to guide decision making. These models are fed measurable data points, or
features, that help it learn how to parse the data more effectively over time. Feature-oriented
databases provide an alternate approach to data prep by automating the feature extraction as
the first step. The feature-oriented approach enables real-time analytics and AI initiatives
because the data or "features" are already in a model-ready data format that is instantly
accessible and reusable across the organization, without the need to re-copy or re-pre-process.

Here’s an example of this system: imagine you were trying to predict a type of animal based on
traits. Humans would have a list of traits they might think about: wings, snouts, or the number of
legs, for example. But these traits aren’t actionable like that– but if they’re converted into features
like “has_wings” or “has_4_legs” with yes or no values coded as 1 or 0, they can be fed into a
model and processed more quickly. This is why FeatureBase is built on bitmaps; the arrays of 1s
and 0s are easily actionable by machines and models.

Fine-tuning features
In the last reading, you followed along as FeatureBase leadership considered how to approach
their ultimate question: Why are customers dropping off during the sales cycle?

At that point in this project’s cycle, the team didn’t have metrics built into their system to find out
when customers weren’t completing the sales process, which was key to investigating why
customers were dropping off. Having determined what their system was lacking, the team
encoded these new features into the collection process– they recreated their original sales funnel
with new attributes about customers at every stage of the sales cycle. These new features were
fed into their database model, which began training to identify patterns,allowing them to
immediately draw insights from their pool of data.
One of the attributes they added to the data collection process tracked exactly when customers
were dropping off. They discovered that most customers who didn’t complete the sale dropped
off during the technical validation stage. This is the point at which the FeatureBase team would
set up FeatureBase within the customer’s system so they can try the product for themselves. The
team realized that this was the critical stage they needed to investigate more. They theorized that
customers weren’t leaving the technical validation stage confident about their ability to adopt this
new technology. They also wondered if customers were having concerns about the reliability and
stability of the product since FeatureBase would replace their current, still-functioning database
system. To understand this better, they would need to explore the customer data gathered at this
stage in their dashboard.

The next step


As a BI professional, you might find yourself working with a variety of database technologies
connected with pipeline systems like more traditional row based technologies and newer
alternatives such as FeatureBase. Understanding your tools and how they operate will help you
focus on what’s most important–empowering your team with access to the answers they need.
As they continued investigating their problem, the FeatureBase team found that most customers
who did not complete the sales process were falling off in the technical validation stage.

This is the point at which FeatureBase was being implemented in the customer’s data
environment to determine if it was actually functional for them. This is how the FeatureBase team
can showcase FeatureBase’s utility and provide proof that it is a workable solution for a
customer’s needs.

The team theorized that, while many customers loved the capabilities of the service, they had
some anxieties about how adoptable it would be for their organization and how reliable it would
be. But before the team can confirm this theory or make any decisions based on this potential
answer, they would need to be able to explore dashboard reports and investigate their metrics
deeply. Which is what you will focus on when you return to this case study next time!

As BI professional, your job doesn't end once you've built the database systems and
pipeline tools for your organization.
It's also important that you ensure they continue to work as intended and
handle potential errors before they become problems
in order to address those ongoing needs.
You've been learning a lot.
First, you explored the importance of quality testing in an ETL system.
This involved checking incoming data for completeness, consistency,
conformity, accuracy, redundancy, integrity, and timeliness.
You also investigated schema governance and
how schema validation can prevent incoming data from causing errors in the system by
making sure it conforms to the scheme of properties of the destination database.
After that, you discovered why verifying business rules is an important step in
optimization because it ensures that the data coming in meets the business
needs of the organization using it.
Maintaining the storage systems that users interact with is an important part
of ensuring that your system is meeting the business's needs.
This is why database optimization is so important, but
it's just as important to ensure that the systems that move data from place to
place are as efficient as possible.
And that's where optimizing pipelines and ETL systems comes in.
Coming up, you have another assessment.
I know you can do this and just as a reminder,
you can review any of the material as you get ready as well as the latest glossary.
So feel free to revisit any videos or
readings to get a refresher before the assessment.
After that, you'll have the chance to put everything you've been learning into
practice by developing BI tools and processes yourself.
You're making excellent progress toward a career in BI.

Glossary terms from week 3


Accuracy: An element of quality testing used to confirm that data conforms to the actual entity
being measured or described

Business rule: A statement that creates a restriction on specific parts of a database

Completeness: An element of quality testing used to confirm that data contains all desired
components or measures

Conformity: An element of quality testing used to confirm that data fits the required
destination format

Consistency: An element of quality testing used to confirm that data is compatible and in
agreement across all systems

Data dictionary: A collection of information that describes the content, format, and structure
of data objects within a database, as well as their relationships

Data lineage: The process of identifying the origin of data, where it has moved throughout
the system, and how it has transformed over time

Data mapping: The process of matching fields from one data source to another

Integrity: An element of quality testing used to confirm that data is accurate, complete,
consistent, and trustworthy throughout its life cycle

Quality testing: The process of checking data for defects in order to prevent system failures;
it involves the seven validation elements of completeness, consistency, conformity, accuracy,
redundancy, integrity, and timeliness

Redundancy: An element of quality testing used to confirm that no more data than necessary
is moved, transformed, or stored

Schema validation: A process to ensure that the source system data schema matches the
target database data schema

Timeliness: An element of quality testing used to confirm that data is current


Terms and definitions from previous
weeks
A
Application programming interface (API): A set of functions and procedures that
integrate computer programs, forming a connection that enables them to communicate

Applications software developer: A person who designs computer or mobile


applications, generally for consumers

Attribute: In a dimensional model, a characteristic or quality used to describe a dimension

B
Business intelligence (BI): Automating processes and information channels in order to
transform relevant data into actionable insights that are easily available to decision-makers

Business intelligence governance: A process for defining and implementing business


intelligence systems and frameworks within an organization

Business intelligence monitoring: Building and using hardware and software tools to
easily and rapidly analyze data and enable stakeholders to make impactful business decisions

Business intelligence stages: The sequence of stages that determine both BI business
value and organizational data maturity, which are capture, analyze, and monitor

Business intelligence strategy: The management of the people, processes, and tools
used in the business intelligence process

C
Columnar database: A database organized by columns instead of rows

Combined systems: Database systems that store and analyze data in the same place

Compiled programming language: A programming language that compiles coded


instructions that are executed directly by the target machine

Contention: When two or more components attempt to use a single resource in a conflicting
way

D
Data analysts: People who collect, transform, and organize data

Data availability: The degree or extent to which timely and relevant information is readily
accessible and able to be put to use
Data governance professionals: People who are responsible for the formal management
of an organization’s data assets

Data integrity: The accuracy, completeness, consistency, and trustworthiness of data


throughout its life cycle

Data lake: A database system that stores large amounts of raw data in its original format until
it’s needed

Data mart: A subject-oriented database that can be a subset of a larger data warehouse

Data maturity: The extent to which an organization is able to effectively use its data in order
to extract actionable insights

Data model: A tool for organizing data elements and how they relate to one another

Data partitioning: The process of dividing a database into distinct, logical parts in order to
improve query processing and increase manageability

Data pipeline: A series of processes that transports data from different sources to their final
destination for storage and analysis

Data visibility: The degree or extent to which information can be identified, monitored, and
integrated from disparate internal and external sources

Data warehouse: A specific type of database that consolidates data from multiple source
systems for data consistency, accuracy, and efficient access

Data warehousing specialists: People who develop processes and procedures to


effectively store and organize data

Database migration: Moving data from one source platform to another target database

Database performance: A measure of the workload that can be processed by a database,


as well as associated costs

Deliverable: Any product, service, or result that must be achieved in order to complete a
project

Developer: A person who uses programming languages to create, execute, test, and
troubleshoot software applications

Dimension (data modeling): A piece of information that provides more detail and context
regarding a fact

Dimension table: The table where the attributes of the dimensions of a fact are stored

Design pattern: A solution that uses relevant measures and facts to create a model in
support of business needs

Dimensional model: A type of relational model that has been optimized to quickly retrieve
data from a data warehouse
Distributed database: A collection of data systems distributed across multiple physical
locations

E
ELT (extract, load, and transform): A type of data pipeline that enables data to be
gathered from data lakes, loaded into a unified destination system, and transformed into a useful
format

ETL (extract, transform, and load): A type of data pipeline that enables data to be
gathered from source systems, converted into a useful format, and brought into a data
warehouse or other unified destination system

Experiential learning: Understanding through doing

F
Fact: In a dimensional model, a measurement or metric

Fact table: A table that contains measurements or metrics related to a particular event

Foreign key: A field within a database table that is a primary key in another table (Refer to
primary key)

Fragmented data: Data that is broken up into many pieces that are not stored together,
often as a result of using the data frequently or creating, deleting, or modifying files

Functional programming language: A programming language modeled around


functions

G
Google DataFlow: A serverless data-processing service that reads data from the source,
transforms it, and writes it in the destination location

I
Index: An organizational tag used to quickly locate data within a database system

Information technology professionals: People who test, install, repair, upgrade, and
maintain hardware and software solutions

Interpreted programming language: A programming language that uses an interpreter,


typically another program, to read and execute coded instructions

Iteration: Repeating a procedure over and over again in order to keep getting closer to the
desired result

K
Key performance indicator (KPI): A quantifiable value, closely linked to business
strategy, which is used to track progress toward a goal
L
Logical data modeling: Representing different tables in the physical data model

M
Metric: A single, quantifiable data point that is used to evaluate performance

O
Object-oriented programming language: A programming language modeled around
data objects

OLAP (Online Analytical Processing) system: A tool that has been optimized for
analysis in addition to processing and can analyze data from multiple databases

OLTP (Online Transaction Processing) database: A type of database that has been
optimized for data processing instead of analysis

Optimization: Maximizing the speed and efficiency with which data is retrieved in order to
ensure high levels of database performance

P
Portfolio: A collection of materials that can be shared with potential employers

Primary key: An identifier in a database that references a column or a group of columns in


which each row uniquely identifies each record in the table (Refer to foreign key)

Project manager: A person who handles a project’s day-to-day steps, scope, schedule,
budget, and resources

Project sponsor: A person who has overall accountability for a project and establishes the
criteria for its success

Python: A general purpose programming language

Q
Query plan: A description of the steps a database system takes in order to execute a query

R
Resources: The hardware and software tools available for use in a database system

Response time: The time it takes for a database to complete a user request

Row-based database: A database that is organized by rows

S
Separated storage and computing systems: Databases where data is stored
remotely, and relevant data is stored locally for analysis

Single-homed database: Database where all of the data is stored in the same physical
location

Snowflake schema: An extension of a star schema with additional dimensions and, often,
subdimensions

Star schema: A schema consisting of one fact table that references any number of dimension
tables

Strategy: A plan for achieving a goal or arriving at a desired future state

Subject-oriented: Associated with specific areas or departments of a business

Systems analyst: A person who identifies ways to design, implement, and advance
information systems in order to ensure that they help make it possible to achieve business goals

Systems software developer: A person who develops applications and programs for the
backend processing systems used in organizations

T
Tactic: A method used to enable an accomplishment

Target table: The predetermined location where pipeline data is sent in order to be acted on

Throughput: The overall capability of the database’s hardware and software to process
requests

Transferable skill: A capability or proficiency that can be applied from one job to another

V
Vanity metric: Data points that are intended to impress others, but are not indicative of
actual performance and, therefore, cannot reveal any meaningful business insights

W
Workload: The combination of transactions, queries, data warehousing analysis, and system
commands being processed by the database system at any given time

Explore Course 2 end-of-course project


scenarios
Overview
When you approach a project using structured thinking, you will often find that there are specific
steps you need to complete in a specific order. The end-of-course projects in the Google
Business Intelligence certificate were designed with this in mind. The challenges presented in
each course represent a single milestone within an entire project, based on the skills and
concepts learned in that course.

The certificate program allows you to choose from different workplace scenarios to complete the
end-of-course projects: the Cyclistic bike share company or Google Fiber. Each scenario offers
you an opportunity to refine your skills and create artifacts to share on the job market in an online
portfolio.

You will be practicing similar skills regardless of which scenario you choose, but you must
complete at least one end-of-course project for each course to earn your Google
Business Intelligence certificate. To have a cohesive experience, it is recommended that you
choose the same scenario for each end-of-course project. For example, if you chose the Cyclistic
scenario to complete in Course 1, we recommend completing this same scenario in Course 2
and 3 as well. However, if you are interested in more than one workplace scenario or would like
more of a challenge, you are welcome to do more than one end-of-course project. Completing
multiple projects offers you additional practice and examples you can share with prospective
employers.

Course 2 end-of-course project scenarios


Cyclistic bike-share

Background:

In this fictitious workplace scenario, the imaginary company Cyclistic has partnered with the city
of New York to provide shared bikes. Currently, there are bike stations located throughout
Manhattan and neighboring boroughs. Customers are able to rent bikes for easy travel among
stations at these locations.

Scenario:

You are a newly hired BI professional at Cyclistic. The company’s Customer Growth Team is
creating a business plan for next year. They want to understand how their customers are using
their bikes; their top priority is identifying customer demand at different station locations.
Previously, you gathered information from your meeting notes and completed important project
planning documents. Now you are ready for the next part of your project!

Course 2 challenge:

 Use project planning documents to identify key metrics and dashboard requirements
 Observe stakeholders in action to better understand how they use data
 Gather and combine necessary data
 Design reporting tables that can be uploaded to Tableau to create the final dashboard

Note: The story, as well as all names, characters, and incidents portrayed, are fictitious. No
identification with actual people (living or deceased) is intended or should be inferred. The data
shared in this project has been created for pedagogical purposes.
Google Fiber

Background:

Google Fiber provides people and businesses with fiber optic internet. Currently, the customer
service team working in their call centers answers calls from customers in their established
service areas. In this fictional scenario, the team is interested in exploring trends in repeat calls
to reduce the number of times customers have to call in order for an issue to be resolved.

Scenario:

You are currently interviewing for a BI position on the Google Fiber call center team. As part of
the interview process, they ask you to develop a dashboard tool that allows them to explore
trends in repeat calls. The team needs to understand how often customers call customer support
after their first inquiry. This will help leadership understand how effectively the team can answer
customer questions the first time. Previously, you gathered information from your meeting notes
and completed important project planning documents. Now you’re ready for the next part of your
project!

Course 2 challenge:

 Use project planning documents to identify key metrics and dashboard requirements
 Consider best tools to execute your project
 Gather and combine necessary data
 Design reporting tables that can be uploaded to Tableau to create the final dashboard

Key Takeaways
In Course 2, The Path to Insights: Data Models and Pipelines, you focused on understanding
how data is stored, transformed, and delivered in a BI environment.

Course 2 skills:
 Combine and transform data
 Identify key metrics
 Create target tables
 Practice working with BI tools

Course 2 end-of-course project deliverables:


 The necessary target tables

Now that you have completed this step of your project and developed the target tables, you are
ready to work on your final dashboard in the next course!

decisions-decisions-dashboards-and-report
Take a tour of Tableau
You have started exploring Tableau as a data visualization tool in business intelligence
dashboards to convey insights with stakeholders. Throughout this program, you will continue to
use and access Tableau—eventually using it to create your own dashboards. This reading will
enable you to familiarize yourself with Tableau's interface and functionality.

Create a profile on Tableau Public


With Tableau Public, you can create and share visualizations. If you don’t already have an
account, make one on the Tableau Public site. Note that trying to make an account from the
main page will sign you up for a Tableau Free Trial rather than a Tableau Public account.

The difference between these two options is that a Free Trial lasts for 14 days, whereas Tableau
Public gives you long-term access through the web version of the program. It has some
limitations compared to the other versions of Tableau, but it is free to use and will enable you to
complete the upcoming activities. You can also use your Tableau credentials to access Tableau
Public if you already have an account! You are welcome to try the free trial or purchase Tableau,
but it is not required for this program.

Complete the information in the signup form. When you click the Create My Profile button, you’ll
be transferred to your profile page. This is where your Tableau Public visualizations can be made
public to share with your peers. In the tabs on this page, you can access lists of visualizations
you’ve made, visualizations you’ve favorited, authors you are following, and authors who are
following you. By clicking Edit Profile, you can add additional information like your bio, title,
organization, and links to social media accounts. This is also where you can enable Tableau
Public’s Hire Me Button. The Hire Me Button will indicate to potential hiring managers that your
Tableau skills are available for hire.

Optional: Download the desktop version


With the desktop application, you can use features from Tableau Public without connecting to the
internet. It is free to use, just like Tableau Public’s online version. Keep in mind that this
application cannot be used on the Chromebook operating system and is not required for this
course. If you are using Windows or Mac OS, this desktop application will enable you to
complete upcoming activities that use Tableau Public. To download Tableau Public Desktop
Edition (this is optional), log into your account and review the system requirements for your
operating system.
Loading and linking data
Tableau enables you to load in your own data and link it to other datasets directly in the platform.
When you log in, choose to Create a Viz. This will open a new worksheet where you can upload
data or connect to online sources, such as your Google Drive.

Once you upload data to your worksheet, it will populate the Connections pane.

You can add more connections to other data sources in order to build visualizations that compare
different datasets. Simply drag and drop tables from the Sheets section in order to join tables and
generate those connections:
Dimensions and measures
Tableau uses dimensions and measures to generate customized charts. For example, check out
this chart focusing on CO2 emissions per country. The Country Name dimension can be used to
show a map of the countries on the planet with dots indicating which countries are represented in
the data.

The dots are all the same size because—with no measure selected—Tableau defaults to scale
each country equally. If you want to scale by CO2 emissions, you need to include a specific
measure. Here is the same chart with a measure for CO2 kiloton (kt). This changes the size of
the dots to be proportional to the amount of CO2 emitted:
Tableau has a wide variety of options for depicting the measure for a given dimension. Most of
these options are contained near the main display and the column with dimensions and
measures.

Tableau allows you to customize measures with options such as Color, Size, and Label, which
change those aspects of the measure’s visualization on the chart. As you customize measures in
Tableau, you will want to consider accessibility for your audience. As a refresher, you can check
out this video on accessible visualizations from the Google Data Analytics Certificate program.

Types of visualizations in Tableau


In addition to more traditional charts, Tableau also offers some more specific visualizations that
you can use in your dashboard design:

 Highlight tables appear like tables with conditional formatting. Review the steps to
build a highlight table.
 Heat maps show intensity or concentrations in the data. Review the steps to build a
heat map.
 Density maps illustrate concentrations (such as a population density map). Refer to
instructions to create a heat map for density.
 Gantt charts demonstrate the duration of events or activities on a timeline. Review the
steps to build a Gantt chart.
 Symbol maps display a mark over a given longitude and latitude. Learn more from
this example of a symbol map.
 Filled maps are maps with areas colored based on a measurement or dimension.
Explore an example of a filled map.
 Circle views show comparative strength in data. Learn more from this example of a
circle view.
 Box plots, also known as box and whisker charts, illustrate the distribution of
values along a chart axis. Refer to the steps to build a box plot.
 Bullet graphs compare a primary measure with another and can be used instead of
dial gauge charts. Review the steps to build a bullet graph.
 Packed bubble charts display data in clustered circles. Review the steps to build a
packed bubble chart.

Tableau resources
As you continue to explore Tableau and prepare to make your own dynamic dashboards, here
are a few useful links within Tableau Public:

 Tableau Public Channels: Explore data visualizations created by others across a


variety of different topics.
 Viz of the Day: Tableau Public features a new data viz every day; check back for new
visualizations daily or subscribe to receive updates directly to your inbox.
 Google Career Certificates page on Tableau Public: This gallery contains all
the visualizations created in the video lessons so you can explore these examples more
in-depth.
 Tableau Public resources page: This links to the resources page, including some
how-to videos and sample data.
 Tableau Accessibility FAQ: Access resources about accessibility in Tableau
visualizations using the FAQ; it includes links to blog posts, community forums, and tips
for new users.
 Tableau community forum: Search for answers and connect with other users in
the community on the forum page.
 Data Literacy Course: Build your data literacy skills in order to interpret, explore,
and communicate effectively with data.

Glossary terms from week 1


Audience problem: A dashboard issue caused by failing to adequately consider the needs
of the user

Data problem: A dashboard issue caused by the data being used

Low-fidelity mockup: A simple draft of a visualization that is used for planning a dashboard
and evaluating its progress

Tool problem: A dashboard issue involving the hardware or software being used

Terms and definitions from previous


weeks
A
Accuracy: An element of quality testing used to confirm that data conforms to the actual entity
being measured or described

Application programming interface (API): A set of functions and procedures that


integrate computer programs, forming a connection that enables them to communicate

Applications software developer: A person who designs computer or mobile


applications, generally for consumers

Attribute: In a dimensional model, a characteristic or quality used to describe a dimension

B
Business intelligence (BI): Automating processes and information channels in order to
transform relevant data into actionable insights that are easily available to decision-makers

Business intelligence governance: A process for defining and implementing business


intelligence systems and frameworks within an organization

Business intelligence monitoring: Building and using hardware and software tools to
easily and rapidly analyze data and enable stakeholders to make impactful business decisions

Business intelligence stages: The sequence of stages that determine both BI business
value and organizational data maturity, which are capture, analyze, and monitor

Business intelligence strategy: The management of the people, processes, and tools
used in the business intelligence process

Business rule: A statement that creates a restriction on specific parts of a database

C
Columnar database: A database organized by columns instead of rows

Combined systems: Database systems that store and analyze data in the same place
Compiled programming language: A programming language that compiles coded
instructions that are executed directly by the target machine

Completeness: An element of quality testing used to confirm that data contains all desired
components or measures

Conformity: An element of quality testing used to confirm that data fits the required
destination format

Contention: When two or more components attempt to use a single resource in a conflicting
way

Consistency: An element of quality testing used to confirm that data is compatible and in
agreement across all systems

D
Data analysts: People who collect, transform, and organize data

Data availability: The degree or extent to which timely and relevant information is readily
accessible and able to be put to use

Data dictionary: A collection of information that describes the content, format, and structure
of data objects within a database, as well as their relationships

Data governance professionals: People who are responsible for the formal management
of an organization’s data assets

Data integrity: The accuracy, completeness, consistency, and trustworthiness of data


throughout its life cycle

Data lake: A database system that stores large amounts of raw data in its original format until
it’s needed

Data lineage: The process of identifying the origin of data, where it has moved throughout
the system, and how it has transformed over time

Data mapping: The process of matching fields from one data source to another

Data mart: A subject-oriented database that can be a subset of a larger data warehouse

Data maturity: The extent to which an organization is able to effectively use its data in order
to extract actionable insights

Data model: A tool for organizing data elements and how they relate to one another

Data partitioning: The process of dividing a database into distinct, logical parts in order to
improve query processing and increase manageability

Data pipeline: A series of processes that transports data from different sources to their final
destination for storage and analysis

Data visibility: The degree or extent to which information can be identified, monitored, and
integrated from disparate internal and external sources
Data warehouse: A specific type of database that consolidates data from multiple source
systems for data consistency, accuracy, and efficient access

Data warehousing specialists: People who develop processes and procedures to


effectively store and organize data

Database migration: Moving data from one source platform to another target database

Database performance: A measure of the workload that can be processed by a database,


as well as associated costs

Deliverable: Any product, service, or result that must be achieved in order to complete a
project

Developer: A person who uses programming languages to create, execute, test, and
troubleshoot software applications

Dimension (data modeling): A piece of information that provides more detail and context
regarding a fact

Dimension table: The table where the attributes of the dimensions of a fact are stored

Design pattern: A solution that uses relevant measures and facts to create a model in
support of business needs

Dimensional model: A type of relational model that has been optimized to quickly retrieve
data from a data warehouse

Distributed database: A collection of data systems distributed across multiple physical


locations

E
ELT (extract, load, and transform): A type of data pipeline that enables data to be
gathered from data lakes, loaded into a unified destination system, and transformed into a useful
format

ETL (extract, transform, and load): A type of data pipeline that enables data to be
gathered from source systems, converted into a useful format, and brought into a data
warehouse or other unified destination system

Experiential learning: Understanding through doing

F
Fact: In a dimensional model, a measurement or metric

Fact table: A table that contains measurements or metrics related to a particular event

Foreign key: A field within a database table that is a primary key in another table (Refer to
primary key)
Fragmented data: Data that is broken up into many pieces that are not stored together,
often as a result of using the data frequently or creating, deleting, or modifying files

Functional programming language: A programming language modeled around


functions

G
Google DataFlow: A serverless data-processing service that reads data from the source,
transforms it, and writes it in the destination location

I
Index: An organizational tag used to quickly locate data within a database system

Information technology professionals: People who test, install, repair, upgrade, and
maintain hardware and software solutions

Integrity: An element of quality testing used to confirm that data is accurate, complete,
consistent, and trustworthy throughout its life cycle

Interpreted programming language: A programming language that uses an interpreter,


typically another program, to read and execute coded instructions

Iteration: Repeating a procedure over and over again in order to keep getting closer to the
desired result

K
Key performance indicator (KPI): A quantifiable value, closely linked to business
strategy, which is used to track progress toward a goal

L
Logical data modeling: Representing different tables in the physical data model

M
Metric: A single, quantifiable data point that is used to evaluate performance

O
Object-oriented programming language: A programming language modeled around
data objects

OLAP (Online Analytical Processing) system: A tool that has been optimized for
analysis in addition to processing and can analyze data from multiple databases

OLTP (Online Transaction Processing) database: A type of database that has been
optimized for data processing instead of analysis
Optimization: Maximizing the speed and efficiency with which data is retrieved in order to
ensure high levels of database performance

P
Portfolio: A collection of materials that can be shared with potential employers

Primary key: An identifier in a database that references a column or a group of columns in


which each row uniquely identifies each record in the table (Refer to foreign key)

Project manager: A person who handles a project’s day-to-day steps, scope, schedule,
budget, and resources

Project sponsor: A person who has overall accountability for a project and establishes the
criteria for its success

Python: A general purpose programming language

Q
Quality testing: The process of checking data for defects in order to prevent system failures;
it involves the seven validation elements of completeness, consistency, conformity, accuracy,
redundancy, integrity, and timeliness

Query plan: A description of the steps a database system takes in order to execute a query

R
Redundancy: An element of quality testing used to confirm that no more data than necessary
is moved, transformed, or stored

Resources: The hardware and software tools available for use in a database system

Response time: The time it takes for a database to complete a user request

Row-based database: A database that is organized by rows

S
Schema validation: A process to ensure that the source system data schema matches the
target database data schema

Separated storage and computing systems: Databases where data is stored


remotely, and relevant data is stored locally for analysis

Single-homed database: Database where all of the data is stored in the same physical
location

Snowflake schema: An extension of a star schema with additional dimensions and, often,
subdimensions

Star schema: A schema consisting of one fact table that references any number of dimension
tables
Strategy: A plan for achieving a goal or arriving at a desired future state

Subject-oriented: Associated with specific areas or departments of a business

Systems analyst: A person who identifies ways to design, implement, and advance
information systems in order to ensure that they help make it possible to achieve business goals

Systems software developer: A person who develops applications and programs for the
backend processing systems used in organizations

T
Tactic: A method used to enable an accomplishment

Target table: The predetermined location where pipeline data is sent in order to be acted on

Throughput: The overall capability of the database’s hardware and software to process
requests

Timeliness: An element of quality testing used to confirm that data is current

Transferable skill: A capability or proficiency that can be applied from one job to another

V
Vanity metric: Data points that are intended to impress others, but are not indicative of
actual performance and, therefore, cannot reveal any meaningful business insights

W
Workload: The combination of transactions, queries, data warehousing analysis, and system
commands being processed by the database system at any given time

You might also like