Geospatial Analysis With SQL: A Hands-On Guide To Performing Geospatial Analysis
Geospatial Analysis With SQL: A Hands-On Guide To Performing Geospatial Analysis
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or
transmitted in any form or by any means, without the prior written permission of the publisher,
except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information
presented. However, the information contained in this book is sold without warranty, either express or
implied. Neither the author(s), nor Packt Publishing or its dealers and distributors, will be held liable
for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and
products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot
guarantee the accuracy of this information.
Grosvenor House
11 St Paul's Square
Birmingham
B3 1RB, UK.
ISBN: 978-1-83508-314-7
www.packtpub.com
To my geospatial colleagues from around the world: thank you for the warm welcome into this
vibrant community—always learning, always curious.
To my husband Steve, and boys Harrison and Ryland—your ongoing support means the oblate
spheroid to me!
Contributors
Bonny is a popular conference keynote and workshop leader. Her professional goals include
exploring large datasets and curating empathetic answers to larger questions, making a big world
seem smaller.
Thanks to Lily, my puppy. Without her, the review process would’ve been done three times faster.
Emmanuel Jolaiya is a software engineer with a background in remote sensing and GIS from the
Federal University of Technology, Akure, Nigeria. He has consulted for several leading world
organizations, including the World Bank, and currently, he consults for Integration Environment and
Energy, a German-based organization, on the Nigeria Energy Support Programme (NESP), where
he uses geospatial technology to support electrification planning. He is a 2020 YouthMappers
Research Fellow and Esri Young Scholar. As a young innovator, he is currently building Spatial node,
a platform where geospatial professionals can showcase their works and discover opportunities. His
current interests include Docker, spatial SQL, DevOps, and mobile and web GIS.
Table of Contents
Preface
Section 1: Getting Started with Geospatial Analytics
At its core, geospatial technology provides an opportunity to explore location intelligence and how
it informs the data we collect. First, the reader will see the relevance of geographic information
systems (GIS) and geospatial analytics, revealing the fundamental foundation and capabilities of
spatial SQL.
Information and instruction will be formatted as case studies that highlight open source data and
analysis across big ideas that you can distill into workable solutions. These vignettes will correlate
with an examination of publicly available datasets selected to demonstrate the power of storytelling
with geospatial insights.
Open source GIS, combined with PostgreSQL database access and plugins that expand QGIS
functionality, have made QGIS integration an important tool for analyzing spatial information.
Geospatial analysis serves a range of learners across a wide spectrum of skill sets and industries.
This book is mindful of expert data scientists who are just being introduced to geospatial skills, as
well as the geospatial expert discovering SQL and analysis for the first time.
Chapter 2, Conceptual Framework for SQL Spatial Data Science – Geometry Versus Geography,
explains how to create a spatial database to enable you to import datasets and begin analysis. You
will also learn the fundamentals of writing query-based syntax.
Chapter 3, Analyzing and Understanding Spatial Algorithms, shows you how to connect databases
created in pgAdmin to QGIS, where you will learn how to join tables and visualize the output by
selecting layers and viewing them on the QGIS canvas.
Chapter 4, An Overview of Spatial Statistics, covers working with spatial vectors and running SQL
queries while introducing you to US Census datasets.
Chapter 5, Using SQL Functions – Spatial and Non-Spatial, demonstrates how to use spatial
statistics in PostGIS to explore land use characteristics and population data. You will learn how to
write a user-defined function and run the query.
Chapter 6, Building SQL Queries Visually in a Graphical Query Builder, contains examples of how
to access a graphical query builder and begin building more complex frameworks by taking a
deeper dive and bringing together skills learned earlier.
Chapter 7, Exploring PostGIS for Geographic Analysis, looks at pgAdmin more closely to
customize workflows by running SQL queries in pgAdmin and visualizing the output within the
geometry viewer.
Chapter 8, Integrating SQL with QGIS, moves spatial queries back to QGIS as you work with DB
Manager and are introduced to raster data and functions.
pgAdmin
PostgreSQL
If you are using the digital version of this book, we advise you to type the code yourself or
access the code from the book’s GitHub repository (a link is available in the next section).
Doing so will help you avoid any potential errors related to the copying and pasting of code.
Code in text: Indicates code words in the text, folder names, filenames, file extensions, pathnames,
dummy URLs, user input, and Twitter handles. Here is an example: “To explore the mining areas in
Brazil, we can use the ST_Area function.”
When we wish to draw your attention to a particular part of a code block, the relevant lines or items
are set in bold: “The synopsis of ST_Within includes the following:”
Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words
in menus or dialog boxes appear in bold. Here is an example: “In pgAdmin, we can observe these
results in Geometry Viewer.”
Get in touch
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, email us at
[email protected] and mention the book title in the subject of your message.
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do
happen. If you have found a mistake in this book, we would be grateful if you would report this to
us. Please visit www.packtpub.com/support/errata and fill in the form.
Piracy: If you come across any illegal copies of our works in any form on the internet, we would
be grateful if you would provide us with the location address or website name. Please contact us at
[email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you
are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Your review is important to us and the tech community and will help us make sure we’re delivering
excellent quality content.
Do you like to read on the go but are unable to carry your print books everywhere?
Is your eBook purchase not compatible with the device of your choice?
Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.
Read anywhere, any place, on any device. Search, copy, and paste code from your favorite
technical books directly into your application.
The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free
content in your inbox daily
https://packt.link/free-ebook/9781835083147
Chapter 2, Conceptual Framework for SQL Spatial Data Science – Geometry Versus Geography
The process of detecting and quantifying patterns in datasets requires data exploration, visualization,
data engineering, and the application of analysis and spatial techniques. At its core, geospatial
technology provides an opportunity to explore location intelligence and how it informs the data we
collect.
Geospatial information is location data that allows the assessment of geographically linked activities
and locations on the Earth’s surface. Often called the science of “where,” geospatial analysis can
provide insight into events occurring within geographic areas such as flooding but also patterns in the
spread of wildfires and urban heat islands (increased temperatures in cities compared to rural areas).
Collecting and analyzing geographic information may also include non-spatial data providing
opportunities to alter the appearance of a map, based on non-spatial attributes associated with a
location. Attributes in geographic information systems (GIS) refer to data values that describe
spatial entities. For example, perhaps an attribute table stores information not only on the building
polygon footprint but also indicates that the building is a school or residence.
On your map, you may be interested in viewing wastewater treatment plants or buildings complying
with green energy requirements within a city or neighborhood boundary. Although this information is
non-spatial, you can view and label information associated with specific locations. You are able to
direct map applications to render a map layer based on a wide variety of features.
As you learn how to query your data with Structured Query Language (SQL), the advantages will
include flexibility in accessing and analyzing your data. This chapter will introduce SQL and expand
upon concepts throughout the remaining chapters.
Before we explore the syntax of SQL, let’s introduce or review a few concepts unique to geospatial
data.
In this chapter, we will cover Installation: pgadmin, postgreSQL, QGIS and will cover the following
topics:
Spatial databases
Exploring SQL
First, let’s become familiar with a few characteristics of geospatial data. You will set up working
environments later in the chapter.
Technical requirements
The following are the technical requirements for this chapter:
PostgreSQL installation for your operating system (macOS, Windows, or Linux)
PostGIS extension
pgAdmin
QGIS
Spatial databases
Geospatial data is vast and often resides in databases or data warehouses. Data stored in a filesystem
is only accessible through your local computer and is limited to the speed and efficiency of your
work space. Collecting and analyzing geographic information within storage systems of relationally
structured data increases efficiency.
Spatial databases are optimized for storing and retrieving your data. Data stored in databases are
accessed through client software. The client is a computer or hostname (where the database is
located).
Next, the server listens for requests on a port. There are a variety of ports, but with PostgreSQL, the
default is 5432—but more on that later. The final pieces of information are about security and
accessibility. You will select a username and password to access the database. Now, the client has all
it needs to access the database. The data within a database is formatted as tables and resides with the
host either locally or in an external server linked to a port that listens for requests. This book will
focus on local instances with a mention or two about other options in later chapters.
SQL is a standardization for interacting with databases that abides by American National Standards
Institute (ANSI) data standards. Additionally, SQL has also been adopted as the international
standard by the International Organization for Standardization (ISO), the International
Electrotechnical Commission (IEC), and the Federal Information Processing Standard (FIPS).
SRIDs
Each spatial instance or location has an SRID due to the earth being a non-standard ellipsoid. We talk
about the surface of the earth, but what does this mean exactly? Do we plunge to the depths of the
ocean floor or the highest mountains? The surface of the earth varies depending on where you are on
the surface of the earth.
Spatial reference systems (SRS) allow you to compare different maps. If they are superimposable,
they have the identical SRS. The European Petroleum Survey Group (EPSG) is the most popular.
Figure 1.1 visually demonstrates the discrepancy in mapping distances on the surface of the earth.
Rotational forces at the poles flatten the earth, creating bulging at the equator. This phenomenon
complicates the accurate calculation of the distance between two points located at different latitudes
and longitudes on the Earth’s surface. The coordinate system helps to standardize spatial instances:
Figure 1.1 – Ellipsoid shape of the earth
When working with location data, the use of reference ellipsoids helps to specify point coordinates
such as latitude and longitude, for example. The Global Positioning System (GPS) is based on the
World Geodetic System (WGS 84). You will notice different IDs based on the types of datasets you
are working with—EPSG 4326 is for WGS 84. WGS 84 is the most used worldwide. A spatial column
within a dataset can contain objects with different SRIDs, however, only spatial instances with the
same SRID can be used when performing operations with SQL Server spatial data methods. Luckily,
PostgreSQL will raise an exception when an incompatibility is detected.
SRIDs are stored in a Postgres table visible in the public schema named spatial_ref_sys. You can
discover all of the SRIDs available within PostGIS (the spatial extension for PostgreSQL) with the
following SQL query:
Figure 1.3 shows the SRID for data in the NYC database. These are two distinct datasets, but you can
see how variation in SRID might be a problem if you would like to combine data from one table with
data in another. When working with a graphical user interface (GUI) such as QGIS, for example,
SRIDs are often automatically recognized.
You can check the geometry and SRID of the tables in your database by executing the command in
the query that follows. Don’t worry if this is your first time writing SQL queries—these early
statements are simply for context. The dataset included here is a standard practice dataset, but
imagine if you wanted to combine this data with another dataset in a real-world project:
Next to the srid column in Figure 1.3, you can see geometry types. The next section will introduce
you to the different geometries.
Collectively, we call these different geometries vector data. Think of two-dimensional space as x- and
y-coordinates when thinking about a specific geometry. Geography represents data on a round-earth
coordinate system as latitude and longitude. I tend to use geometry as the convention across the board
but will clarify when relevant. You will see actual data in this chapter, but it is simply illustrative. In
later chapters, you will use a data resource, but until then, if you would like to explore, the data in
this chapter can be found at NYC Open Data: https://opendata.cityofnewyork.us/.
Figure 1.4 displays the output of a query below written to explore the geometry of our data. In spatial
queries, geometry is simply another data type in addition to the types we might be familiar with—
integer, string, or float. This becomes important when exploring non-spatial data such as
demographics. When performing analyses, the data type is relevant as only integer and floating
numbers can be used with mathematical functions. Not to worry—there are options to transform
numerical properties into integers when needed. You are also able to identify where your table is
located (schema) and the name of your geometry column (you will need this information when
referring to the geometry in a table). Additional information visible includes the dimensions (two
dimensions), the SRID values, and the data type.
Figure 1.4 – Discovering the geometric types in your data
In the type column in Figure 1.4, we see three unique types: MULTIPOLYGON, POINT, and
MULTILINESTRING. Geospatial data is most often represented by two primary categories: vector data
and raster data.
Roadways are visible in Figure 1.5 as lines or multiline strings. These are a collection of LineString
geometries:
Figure 1.5 – NYC boroughs with shapes and lines
Figure 1.6 displays the five NYC boroughs with neighborhoods shown as polygons. The points
represent subway stations. These are a few examples of how vector geometries are displayed. Most of
these features are customizable, as we will see when exploring layer options and symbology in QGIS.
Figure 1.6 – Map of NYC visualizing points, lines, and polygons
Another data type we will be exploring later is raster data. Although PostGIS has raster extensions
and functions available, you will need to visualize the output in QGIS. For now, let’s understand the
characteristics of raster data.
Raster models
Raster data is organized as a collection or grid. Each cell contains a value. Bands are collected as
boolean or numerical values coordinating per pixel. Briefly, multidimensional arrays of pixels or
picture elements in images are basic units of programmable information. What is measured is actually
the intensity of the pixel within a band. The data is in the bands. The intensity varies depending on
the relative reflected light energy for that element of the image. Raster data is often considered
continuous data. Unlike the points, lines, and polygons of vector data, raster data represents the slope
of a surface, elevation, precipitation, or temperature, for example. Analyzing raster data requires
advanced analytics that will be introduced in Chapter 8, Integrating SQL with QGIS. Figure 1.7 is
displaying Cape Town, South Africa as a digital elevation model(DEM). The darker areas represent
lower elevations:
Figure 1.7 – Raster image of Shuttle Radar Topography Mission (SRTM) elevation and digital topography over Cape
Town, South Africa
The complex composition in Figure 1.7 is representative of raster images and the detailed gradations
visible in land use, scanned maps or photographs, and satellite imagery. Raster data is often used as a
basemap for feature layers, as you will see in QGIS.
Let's use a few simple SQL queries to understand geographical data. PostgreSQL is the open source
relational database management system (RDBMS) you will explore in addition to PostGIS—the
spatial extension—and QGIS, an open source GIS application. You may be familiar with commercial
offerings such as Oracle, SQL Server, or SQLite. Open source PostgreSQL has the PostGIS spatial
extension, which easily transforms data for different projections and works with a wide variety of
geometries.
Exploring SQL
Let’s get a feel for the capabilities and advantages of working with datasets in a spatial database.
SQL is described as a declarative programming language. If you have experience in a coding
environment, this might be a fairly big departure.
In a declarative programming language, you write what you want to do, and the program figures it
out. SQL is a query language that sends queries to manipulate and retrieve data from the database. In
an object-oriented programming (OOP) language such as Python, for example, you need to specify
system calls incorporating libraries and various packages. Although Python’s syntax is readable, a
familiarity with built-in data structures, powerful libraries, frameworks, and a relative commitment to
scripting expertise is required. PostgreSQL includes extensions to write procedures in additional
programming languages such as Python and R, for example, but the focus here is on SQL and spatial
SQL.
In Figure 1.8, you will see a simple SQL statement. Statements have clauses, (SELECT, FROM, or
WHERE), expressions, and predicates. Although capitalization is a common practice to assist in
readability, SQL is not case-sensitive.
First, let’s examine the statement in more detail. The data is from the Department of Environmental
Protection (DEP) publicly available data (https://data.cityofnewyork.us/City-Government/DEP-s-
Citywide-Parcel-Based-Impervious-Area-GIS-St/uex9-rfq8). The objective is to examine NYC-wide
parcel-based impervious surfaces. In geospatial analysis, the structure and characteristics of land use,
land cover, and surfaces are often explored. Impervious surfaces describe hard surfaces—primarily
man-made—that prevent rainwater from reaching groundwater or flowing into waterways and
contribute to increased surface temperatures in urban cities. Asphalt parking lots, concrete, and
compacted soil in urbanized development are familiar examples.
SELECT statements
Entering SELECT * FROM public."table name" in the code displays all of the columns in the table.
The asterisk saves time as you will not need to ask for all of the columns by name. They will all be
returned with the query. Looking at the query in the pgAdmin console in figure 1.8, FROM MN_pIAPP
after the SELECT statement is referring to the table of percent impervious surfaces in the borough of
Manhattan in NYC.
Often when referring to your table in a SQL query, double quotes will be needed with single quotes
referring to the values in a column. When you use lowercase as the default, there is an option to omit
the double quotes, for example, renaming the table as mn_piapp.
WHERE statements
The WHERE statement introduces an expression, to return values less than 47.32. The statement will
return values for surfaces less than 47.32% impervious. If the predicate is True, the value will be
returned in the table. A predicate evaluates which rows are returned in a specific query. Figure 1.8
illustrates the impervious surfaces in the geometry viewer in pgAdmin:
Figure 1.8 – Viewing Manhattan impervious surfaces in Geometry Viewer in pgAdmin
The impervious surfaces are highlighted in the graphic viewer. The output is surfaces less than
47.32% impervious. Change the percentage, and the output will update once you run the following
code and view the previous graphical image:
The polygons are visible, but what about our basemap? In our GUI, pgAdmin (introduced next in the
chapter), we are able to see a basemap if we transform the coordinates to the 4326 SRS or SRID. This
is because of a reliance on OpenStreetMap projected in 3857, and it will also yield a basemap if this
is the SRID of your data (https://www.openstreetmap.org/#map=5/38.007/-95.844). QGIS is a far
more powerful visualization tool, and you will explore the PostGIS integration in later chapters. This
may be updated to include other projections by the publication of the book so be on the look-out.
ST_Transform is a SQL expression allowing you to assign a different SRID once we have the SRID
changed to 4326:
We are able to view the basemap in Geometry Viewer in pgAdmin, shown in Figure 1.9:
Figure 1.9 – Viewing Manhattan impervious surfaces in Geometry Viewer in pgAdmin with a basemap
The rapidly expanding field of geospatial analytics is evolving across a wide variety of disciplines,
specialties, and even programming languages. The potential insights and deeper queries will likely be
at the forefront of skill acquisition; however, it is critical that you first explore fundamental
geospatial techniques regardless of your preferred computer language or even if you currently aren’t
working within a particular coding practice.
This book will assume no prior knowledge in either programming or geospatial technology, but at the
same time offers professionals anywhere the ability to dive deeper into detecting and quantifying
patterns visible through data exploration, visualization, data engineering, and the application of
analysis and spatial techniques.
At its core, geospatial technology provides an opportunity to explore location intelligence and how it
informs the data we collect. To get started, you will first need to download a few tools.
Installation of PostgreSQL
PostgreSQL is an open source relational database server with multiple ways to connect. PostgreSQL
is also known as Postgres, and I will be referring to the database server as Postgres. Remember
—“client” is another word for computer or host. You can access Postgres with a command line in
terminal for macOS or in the command line in Windows, through programming languages such as
Python or R, or by using a GUI. You will be running Postgres locally and will need both the client
and the server (Postgres).
Installation instructions here are specific for macOS but are also pretty straightforward for other
operating systems. Download from https://www.postgresql.org/ to select system-specific options, as
shown in Figure 1.10:
Selecting your operating system of choice will yield a series of options. The Postgres.app installation
for macOS also includes PostGIS, which is the spatial data extension. The examples that follow do
not utilize the interactive installer. As I am working on a late model operating system, the potential
incompatibilities down the road pointed toward using Postgres.app. There are advantages either way,
and you can make a choice that works best for your workflow.
When I first began working with PostgreSQL, I noticed many resources for Windows-compatible
installations—they are everywhere—but what was lacking, it seemed, was instruction for those of us
working with macOS. The focus here will be primarily on macOS because there is not a compatible
import function directly for macOS at the time of this printing.
See the following tip box for the EnterpriseDB (EDB) PostgreSQL installer information and other
downloads selected for the installer option.
TIP
The installer option will download a setup wizard to walk through the steps of the installation. EDB includes additional
extensions to simplify the use of PostgreSQL but for now, download the software onto your local computer.
Windows users and even macOS operating systems can opt for the installation wizard. Follow these
steps to set up the installation:
1. Navigate to the option shown in Figure 1.11 if you are selecting this option:
2. Stack Builder simplifies downloading datasets and other features we will explore. You can see it being launched in Figure 1.12:
Figure 1.12 – Launching Stack Builder located inside the PostgreSQL application folder
Depending on your setup options or work environment, select the components you would like to
download, as given in Figure 1.13.
3. Select the four default components, as shown in Figure 1.13:
Figure 1.13 – Default components in the installation wizard
Windows has an extensive presence for guiding the installation and a separate installer that includes
pgAdmin for managing and developing PostgreSQL, and Stack Builder. Stack Builder is also
available in the interactive installer for macOS, but the utility may depend on which version of
macOS you have installed.
Unless you are connecting to an existing server or host, localhost should be entered, as seen in
Figure 1.14. Regardless of your selection of operating system, the default port is 5432. This setting
can be changed, but I would suggest keeping the default unless there is a compelling reason to create
a different port—for example, if it is already in use. The superuser is postgres. This identity is a
superuser because you will have access to all of the databases created. You would customize this if
different databases had different authorization and access criteria. Select your password carefully for
the database superuser. It is a master password that you will need again.
Figure 1.14 – Creating a server instance in Postgres
Now that you have a database server installed locally (downloaded to your computer), you will need
a GUI management tool to interact with Postgres. Only a single instance of Postgres is able to run on
each port, but you can have more than one open connection to the same database.
Installation of pgAdmin
pgAdmin is the GUI selected as a graphical interface or command-line interface (CLI) for writing
and executing SQL commands. This is by no means the only option, but pgAdmin works across
operating systems and is available for download here: https://www.pgAdmin.org/.
2. Figure 1.15 shows the dashboard you will see once you download the software:
4. Next, you will need a name for your server, as shown in Figure 1.16. You have the ability to add servers as needed. The default
settings are fine, and the default is Postgres:
2. Right-click on Databases, and you will see the option to create a new database. Figure 1.17 shows the steps for adding a
database. You are able to customize, but for now, this is introductory for familiarizing you with options. When selecting a longer
name, you will need to use underscores between words—for example, long_database_name:
By default, the new database defaults to a public schema, as shown in Figure 1.18. Schemas can be
customized, as you will see. The tables and functions folders are also empty. The PostGIS extension
has not been added yet and will need to be integrated with each database to introduce spatial
functionality:
Figure 1.18 – Customizing databases in pgAdmin
The PostGIS extension will need to be added to each database you create.
CREATE statements
Let’s go through the following set of instructions:
1. Select the Extensions option, right-click, and then write CREATE EXTENSION postgis into the query tool, as shown in
Figure 1.19.
2. Click the arrow and run the cell. You will see the extension added on the left once you click Extensions and select Refresh.
3. Also, add CREATE EXTENSION postgis_raster as we will need this extension when we work with raster models. Be
certain to refresh after running the code:
4. Returning to the Functions option, we now see 1171 functions have been added. Figure 1.20 shows a few of the new spatial
functions updated once the PostGIS spatial extension has been added to the database. You will also observe a spatial-
ref_sys table has been added. Explore the columns and note the available SRIDs within the database:
You are now ready to begin working with spatial SQL in the Postgres database.
QGIS
QGIS is an open source GIS. Although you will be able to view maps on the pgAdmin dashboard, the
advanced capabilities of GIS viewing of maps, ease of data capture, and ability to integrate with
Postgres will be explored. Download QGIS to your desktop to complete the installation of tools by
using the following link: https://qgis.org/en/site/index.html. Figure 1.21 is the same map created
earlier in the graphical interface in pgAdmin rendered in QGIS.
Figure 1.22 demonstrates the additional viewing capabilities when QGIS renders the output of the
same query and the same data.
Figure 1.22 – QGIS map showing the location of where in Manhattan the percentage of impervious soil < 47.32
Summary
In this chapter, you were introduced to geospatial fundamentals to help understand the graphical
syntax of spatial data science. You discovered a few SQL basic queries and set up your workflow for
the chapters that follow. Important concepts about the properties and characteristics of vector and
raster data were also introduced.
In Chapter 2, Conceptual Framework for SQL Spatial Data Science – Geometry Versus Geography,
you will begin learning about SQL by working with the database you created in this chapter.
The fundamentals of a query-based syntax will become clearer as you discover spatial functions for
discovering patterns and testing hypotheses in the datasets.
2
This is the same challenge when bringing multiple datasets together. We need to have confidence in
our source of “truth.”
In this chapter, we will explore the challenges of bringing datasets together, learn about a few psql
meta-commands in terminal, begin writing SQL functions, and begin to think spatially. Specifically,
we will be covering the following topics:
Creating a spatial database
Importing data
Technical requirements
I invite you to find your own data if you are comfortable or access the data recommended in this
book’s GitHub repository at: https://github.com/PacktPublishing/Geospatial-Analysis-with-SQL.
The following datasets are being used in the examples in this chapter:
NYC Open Data: https://opendata.cityofnewyork.us
2. Navigate to your Browser panel and right-click on Databases. You can see the databases listed on the left side of Figure 2.2.
Here, you will name the database and select Save. You can create as many as you would like, but for our purposes, one will
suffice.
Now, you are able to create a database. Perhaps you’ve already accessed pre-existing databases. If
this is the case, you will be including the server address as the host when logging in. Remember, you
must enable PostGIS in any database requiring spatial functions. This is achieved with the following
command:
You write this code into Query Tool, available in each database you create to access spatial
functions.
3. Run the code within the database you are creating. Installing the extension into a schema will help to keep the functions listed in
their own schema (or container), and there will be less information you need to scroll through when working with tables and
different databases.
4. Name your database for the book or create your own hierarchy of files. I am creating a schema for the data being uploaded for
each chapter. You may not require this operational level, but it works when creating files and creating folder access. One distinct
advantage of creating unique schemas instead of simply relying on the public schema is accessibility. You won’t need to scroll
through your public schema for all of the table instances. Instead, select a specific schema and go directly to your data.
5. Right-click on the public schema within the database you are working in. Select Create | Schema and add the new schema. In
Figure 2.2, the schema has been added, and we will upload data and run queries inside the schema. The advantage for me is that
each schema will hold the data for each chapter in the book.
Before we import data into our new database schema, let’s review a few important details about
spatial functions. What is the difference between geometry and geography? Let’s find out in the next
section.
PostGIS distinguishes between geometry and geography—geometry being Cartesian for flat surfaces,
and geography adding additional calculations for the curvature of the earth. Think back to working
with x- and y-coordinates on a plane. If considering magnitude and direction, we now have a vector.
We can also talk about vectors in three-dimensional space by including a z-axis. In general, if you’re
dealing with small areas such as a city or building, you don’t need to add in the extra computing
overhead for geography, but if you’re trying to calculate something larger where measurements
would be influenced by the earth’s curved surface, such as shipping routes, for example, you need to
think about geography. It would not be accurate to only consider a planar Cartesian geometry.
This is where spatial PostGIS functions can help us. The data stored in a geometry column is often a
string of alphanumeric characters, known as extended well-known binary (EWKB) notation and
shown in the geom column in Figure 2.3:
Figure 2.3 –SPACE EWKB notation for geometry
It is clear in these two instances of geometry (all the data is included on a plane, such as a map) and
geography (all the data is included as points on the surface of the Earth and reported as latitude and
longitude), the difference becomes relevant depending on our data questions. The location-aware
ST_AsText(geom) function turns binary information into geometry points.
If you want to see the geographic coordinate system (GCS)—actual latitude longitude information
—you’ll need to execute SELECT gid, boroname, name, ST_AsText(geom) FROM nyc_neighborhoods
to see the actual latitude/longitude data that’s being rendered on your screen, as shown in Figure 2.4.
The new column is now MULTIPOLYGON:
Figure 2.4 – Looking at the geometries of a single table
You will need to know the types of geometries listed in your database as well as their projections.
Every database includes a spatial_ref_sys table and will define the SRS known for each database.
This will matter a little later. There is also a geometry_columns table or a view that shows all of the
features of the f_table_schema designation on the columns in your database, with the following
query:
Certain functions need to be in a particular format. Mathematical calculations, for instance, require
integer or floating-number formats. SQL CASE statements are useful in addressing mixed-use columns
in SQL tables. This is the basic format for CASE summaries:
SELECT
CASE
WHEN GeometryType(geom) = 'POLYGON' THEN ST_Area(geom)
WHEN GeometryType(geom) = 'LINESTRING' THEN ST_Length(geom)
ELSE NULL
END As measure
FROM sometable;
When joining different tables, this will matter. If the tables are not consistent, you will get an error.
You might be curious about the template0 and template1 databases. These are used by the CREATE
DATABASE command. The postgres default database is also listed in Figure 2.6:
If nothing is returned, you aren’t actually connected to a database, or the database (likely the
postgres default) does not have any tables.
\c is a shortcut for \connect, and it allows you to switch to a different database, as illustrated in the
following code snippet. Switching to a different database lists the tables and schema for each file in a
format similar to what is shown in Figure 2.7, depending on the number of tables:
\c nyc
The output is shown, along with the database where the tables are located:
Figure 2.7 – Listing the files in your database
You are also able to see the schemas connected to your database by writing the following command:
\dn
I tend to rely on terminal for troubleshooting, so these are the most useful commands for my
workflow. For example, I had two versions of PostgreSQL in pgAdmin. I forgot that this is absolutely
possible but they would need to have different ports. I thought my databases were connected, but
checking in terminal, I realized that it was a port issue and was able to update.
NOTE
If you are interested in more advanced queries, the documentation for PostgreSQL includes a complete list of these
options and commands:
https://www.postgresql.org/docs/current/app-psql.html
There are many ways to import data into the database. The next section will present a popular way if
you do not have the Windows environment on your computer.
Importing data
The simplest way to import data, if you are on a Mac especially, is by using QGIS.
You should now have your database created in pgAdmin. Remember you will need to have
PostgreSQL open to connect to the server before launching pgAdmin.
Now, open QGIS. The server port and the QGIS server port should both be 5432 (or whichever port
you have assigned). Scroll to the Browser panel. If it isn’t already docked on your canvas, select
View >> Panels and select Browser and Layer to add them to your workspace.
Figure 2.8 shows the Browser panel. On the left of the screenshot, you can scroll to your Home
folder to see your folder hierarchy from your local computer. I download datasets to my downloads
folder. Scroll to your downloads folder:
Right-click on the folder and open it to view a shape file or other geometry. Choose, and add "layer
to project". Next, I always set the CRS by right-clicking on the data layer and selecting layer CRS.
The data is loaded in Figure 2.8 but we have not filtered or queried it—we are simply adding the data
layer to the map. Observing DB Manager, you can change the name of any of the files. I suggest
shorter names but informative ones. You will understand why when we begin applying functions to
the tables and the column names they contain. When you add the data layer to the project, you can
navigate to DB Manager and import the shape file to your database. There is a Database option on
the menu across the top of the console and also an icon if you have these functions included in your
toolbar setup.
We will repeat this with different file types in later chapters, but for now, we are working with shape
files. The layers in your layers panel are accessible here, and you can select a schema (public, or
another you created) and the name of your table (self-populates with layers listed in the window)—
see Figure 2.9. QGIS is going to populate your Source SRID and Target SRID values but I always
check Create spatial index. The spatial index allows queries to run faster and becomes important
when working with larger datasets in PostGIS databases:
When you are first creating a connection to PostgreSQL, you will notice the connection option in the
Browser panel when you scroll. Right-click on the icon and create a new PostGIS connection. You
can name the connection anything, but it will be listed in the Browser panel, so choose something
relevant.
The Host value is localhost or the IP address of a server. These credentials should be the same as
your pgAdmin login information unless you are working with multiple servers. Each server has a
unique port, so make sure you have selected port 5432.
Figure 2.10 demonstrates the PostGIS connection. Database is the name of the database you are
connecting with and where your tables will be imported. You can select whichever name you would
like for this connection. The only information that is shared with pgAdmin is the database and the data
you are uploading. The other defaults are fine (we will explore more of these options in later
chapters), although I suggest the Configurations authentication. This will keep your credentials
encrypted, unlike Basic—although this lets you store your credentials, they will be discoverable.
Perhaps this is okay if it is your personal computer, but I encourage you to habitually protect your
credentials:
Figure 2.10 – Creating a PostGIS connection in QGIS
Now that we have some data to explore, go to pgAdmin and refresh the databases. The datasets you
upload in QGIS will be visible in the schema under the database you accessed.
Next, let’s explore the SQL syntax and how to write queries.
This will make sense when we begin thinking about data questions and what information to consider.
The data in Figure 2.11 is from the Census Reporter database (https://censusreporter.org/2020/) The
2020 census, although completed, does not have all of the data tables available for download, at time
of publication. Here, you can select data from the redistricting release. We are going to download H1
or occupancy status data for New York, NY Community Districts. Occupancy status is an important
measure of the economies of communities:
Figure 2.11 – Census Reporter 2020 census data
The folder that you downloaded contained a metadata JSON file that will explain the data you are
downloading.
Repeat the following code for each column name you would like to change:
The metadata provides the information necessary for us to understand the data within each column.
We now know what the identifiers are referring to in Figure 2.12:
Notice the quotations included in the table call. There are reasons for using and not using quotations.
In the early coding examples, I have included a variety of habits to bring them to your attention.
There is nothing wrong with quotations, but you really only need them if you want to preserve the
capitalization of the table. Postgres will convert names into lowercase unless you wrap them in
quotes. More importantly, if you are using reserved words, you will need to use quotes. It is better to
simply avoid using them. You will also see AS required for some. A few of the most common (USER,
WHEN, TO, TABLE, CREATE, and SELECT) and a more exhaustive list can be found here:
https://www.postgresql.org/docs/current/sql-keywords-appendix.html.
The following code snippet shows how we are renaming the columns to better reflect what they are --
occupied status, and vacancy status in 2010 and 2020:
This chapter is focused on introducing the foundational SQL syntax. Figure 2.13 shows the results of
renaming columns. This is an important skill for enhancing readability and is especially helpful when
you build out more lines of SQL syntax:
Figure 2.13 – Renamed occupancy status table
Organizing the table names and columns, especially when working with census data, will help your
continued exploration of the dataset. Not all columns are relevant to each query, and in the next
section, you will learn how to select only columns of interest.
Identifying tables
When you have a large dataset with numerous columns, you will want to select only those columns of
interest. The basic structure of a query to select columns is shown here:
Using the rubric for selecting columns and populating with our columns of interest, we can see the
visual in Figure 2.14 by selecting Geometry Viewer:
Figure 2.14 – Occupancy status
Hovering and clicking on a block assignment will also reveal data when entering the following code:
Simple statistical questions such as the average of a particular value are easy to ask in the query
builder. The answer is provided in the output:
Let’s get curious. Figure 2.15 shows us where vacancy rates are between 5,520 and 9,108. To
continue exploring, include the percentage change columns and rename them. It might be more
accurate to know the percent change between the years instead of simply the raw number.
For now, let’s see the vacancy distribution by running the following code:
We can now see the block assignments with vacancy rates within our selected range:
Figure 2.15 – Filtering occupancy vacancy rates by total count
SELECT statements are a powerful tool and an important piece of syntax to locate and retrieve data.
These basic queries are helpful to master as they provide the framework for spatial functions.
Experiment with a few to see how your skills are progressing:
The ability to translate the sample syntax and apply it to a dataset is a worthwhile endeavor. Look for
datasets that include integers, and even if the output isn’t a valid question, simply learning how the
syntax is executed can be helpful when applied outside of a practice dataset.
How can we find out where the type and frequency of these complaints are located? Are there areas
where the complaints are handled more efficiently?
Let’s learn how to join tables and analyze data based on additional columns and data types. Run the
following code to use INNER JOIN on the two tables:
Although tables are not visually informative in the same way as maps, they can be instrumental in
formulating questions and bringing datasets together. Figure 2.16 shows a join of tables in pgAdmin:
Figure 2.16 – Join of tables in pgAdmin
You can call INNER JOIN simply JOIN. You can find more information about JOIN available in the
documentation, at https://www.postgresql.org/docs/current/tutorial-join.html.
Opening QGIS and adding the panel for QGIS SQL queries, we can view the data in Figure 2.17.
Then, locate the datasets within QGIS and drag them onto the canvas:
The utility of SQL queries to ask specific questions that filter data down to address the impacted
communities is clearly observed:
Historically, Brownsville was identified as the most dangerous neighborhood in Brooklyn. Additional
data questions might include current rate of violence, sources of indoor air quality being problematic
throughout the identified neighborhood location, type of housing, and proximity to roadways or
sources of pollution. A final JOIN example unites the table geometries, returning a geometry that is
the union of two input tables. Figure 2.19 is displaying the information in Geometry Viewer from
pgAdmin:
Let’s continue to explore SQL expressions within the console in QGIS. I use pgAdmin as a powerful
query editor to familiarize myself with data but to ask bigger questions, I move over to QGIS.
NOTE
Chapter 3, Analyzing and Understanding Spatial Algorithms, will highlight the advantages of using a GUI.
We can union the geometry so that we can locate the complaint of Mold and see where complaints are
still open in the system. Locate the blue dot in Figure 2.19 after running the following SQL query:
You now know how to create a union between two different tables in PostGIS. The aforementioned
mold complaint is still an open complaint in the system. The ability to explore non-spatial
information and location data has the opportunity to provide additional clues. Are there neighborhood
characteristics that might influence how quickly complaints are managed? We could also look for
patterns in the data to see what we might learn about the outcomes of complaint_status.
Viewing tables in pgAdmin is not the same as being able to locate a precise location on a map. Often,
visualizing information brings more questions to mind for further analysis. Geospatial analysis is the
tool for digging deeper.
Summary
In this chapter, we were introduced to additional spatial considerations and how to differentiate
between a Cartesian coordinate system and a spatial coordinate system. Creating and exploring
databases is fundamental to building SQL skills, and both were introduced as foundational to
working with datasets in the chapters that follow. We learned how to import data into QGIS that
synchronizes with the database(s) we created in pgAdmin. And with this, the application of SQL
fundamentals to query datasets will prepare us for problem-solving in the next chapter, using
geometric data and integrating GIS functions into a modern SQL environment.
3
The next steps include interfacing with QGIS and running geospatial queries in a GUI that will allow
advanced filtering and layer styling, across databases.
This chapter will cover importing a variety of file types into databases along with connecting to
databases and executing SQL queries in both pgAdmin and QGIS. We will also cover Spatial joins and
how to visualize the output.
Technical Requirements
I invite you to find your own data if you are comfortable or access the data recommended in this
book’s GitHub repository at: https://github.com/PacktPublishing/Geospatial-Analysis-with-SQL.
The following datasets are being used in the examples in this chapter:
Affordable Housing Production by Building (https://data.cityofnewyork.us/Housing-Development/Affordable-Housing-
Production-by-Building/hg8x-zxpr)
Let’s begin by understanding spatial indexing. Unique to spatial databases, indexing is necessary to
expedite searches and improve the speed of our queries.
But wait – you might be thinking – how do you index geometries? Isn’t that what makes a spatial
database different? The simple answer is a spatial index is looking at bounding boxes—not simply
the lines generated from their edges. Bounding boxes are rectangular polygons containing an object
or an area of interest. Identified by xmax, xmin, ymax, and ymin, a bounding box is slightly different
from an extent, as it can contain an extent but does not have to be the same. The xmin and ymin
coordinates describe the top-left corner of the bounding box or rectangle while xmax and ymax are
the coordinates of the bottom-right corner of the bounding box or rectangle.
When we write functions about place and location it matters how efficient (and quick) the data
processes are at generating their results, especially when analyzing large datasets. The following
figure is a simplified graphic showing how intersecting lines generate bounding boxes and how this
expedites processing.
This evolution is aligned with the ISO standard SQL-MM defining the spatial type and associated
routines. In this case, MM stands for multimedia.
This is all you need to know about it because all prior formats have been deprecated (and are
tolerated but not supported). Not all of the functions will automatically spatially index but fortunately
for us, the most popular ones do. The && operator in Figure 3.3 selects bounding boxes that overlap or
touch. The operator is an index-only query.
Now run the following code and notice how many rows are returned. The output for Figure 3.3 yields
2,790 rows that meet the criteria of having extremely low-income units.
SELECT * FROM ch3."Affordable_Housing_Production_by_Building" borough
JOIN ch3."DOHMH_Indoor_Environmental_Complaints" Incident_Address_Borough
ON borough.geom && Incident_Address_Borough.geom
WHERE 'extremely low income units' NOT LIKE '%0%'
Now, what happens when we add the ST_Intersects function? The query in Figure 3.4 applies the
ST_Intersects function so instead of selecting each bounding box that is intersecting with another
bounding box, the more accurate function sums up boroughs that are actually intersecting the
borough where the complaints are being generated:
When you run the function, you will notice in Figure 3.4 that only 2769 rows are returned. This may
not be evident visually when working with a smaller set of data but this matters when we begin
working with more complex data questions.
The ANALYZE and VACUUM functions can also be run individually or together against a database (I don’t
recommend it, as it takes forever), a table, or even a column. The work case for me is when I have
been manipulating the data by deleting data and inserting data into a table.
PostgreSQL doesn’t actually remove a deleted row from the table but simply masks the row so
queries don’t return that row. The VACUUM function marks these spaces as reusable.
The ANALYZE function collects statistics based on the values in column tables. How are the values
distributed? You can run the operation and find out.
Both of these functions need to be used cautiously, with additional information and cautions for
VACUUM (https://www.postgresql.org/docs/current/sql-vacuum.html) and ANALYZE
(https://www.postgresql.org/docs/15/sql-analyze.html) in the user documentation for PostgreSQL.
It is good practice to reclaim free space by running the following code snippet:
Analyze ch3."Affordable_Housing_Production_by_Building"
What happens when we apply the VACUUM function? The query is returned even faster:
Vacuum ch3."Affordable_Housing_Production_by_Building"
I think it is easier to appreciate the SQL functions by working with real data. We are going to start
building stories. But first, we need to understand the dataset.
The dataset introduced earlier is from NYC OpenData. You explored the DMOH data in Chapter 2,
Conceptual Framework for SQL Spatial Data Science – Geometry Versus Geography. The Affordable
Housing Production by Project Data is building level data counted towards Housing our Neighbors:
A Blueprint for Housing and Homelessness,
https://www1.nyc.gov/assets/home/downloads/pdf/office-of-the-mayor/2022/Housing-Blueprint.pdf.
Estimates show that one-third of New York City residents spend 50 percent of their income on rent,
and the number of children sleeping in shelters is increasing. Bold changes are being enacted and we
have data to explore how this impacts the community infrastructure and how people are living.
The first step for data analysis is not only exploring the dataset but also thinking about the types of
questions to explore. Interesting variables to explore include the types of units being built and how
they are contributing to improving the housing and affordability crisis. For one, the housing inventory
is not increasing at the same rate as the rapid population and job growth in NYC over the last few
decades. The data in Figure 3.5, available at the provided link, allows you to explore the variables.
Exploring the characteristics of a dataset is instrumental for generating SQL queries and formulating
questions for deeper insights.
extremely_low_income_units are units with rents that are affordable to households earning 0 to 30%
of the area median income (AMI). There is a column heading, borough, shared between other NYC
datasets, so there are options to explore.
Knowing the datatype of the column provides information about the types of functions we can apply
to the data.
The data is imported as a Vector by Data Source Manager in QGIS. Once you complete the fields in
Figure 3.6, you will now have a data source with a geometry column. Often, QGIS will detect the
geometry column but you may have to add the column headings (exactly as it is listed in the data) in
X_POSSIBLE_NAMES and Y_POSSIBLE_NAMES:
Figure 3.6 – Importing a delimited file into QGIS-defining coordinate systems in SQL
Figure 3.6 – Importing a delimited file into QGIS-defining coordinate systems in SQL
QGIS will not render your CRS graphic if it is unable to confirm the projection. You can scroll
through the options listed in Figure 3.7 when you right-click on the layer:
Once you add the layer to your project, it is easily imported into DB Manager. Once you import and
refresh, the table you just imported will be available in pgAdmin as well. In Figure 3.8, the input will
include the data layers you have added to your project. I typically add the layers to the project when
uploading to simplify this step.
You select the Schema option and Table will display the database you selected when connecting
PostGIS. Although it isn’t necessary, I select Source SRID and Target SRID. This is something that
QGIS already detects but for me, the act of checking the checkboxes reminds me to pay attention in
case an error happens when running code.
Another habit I recommend is to check the Convert field names to lowercase and Create spatial
index checkboxes. Up to this point, I simply uploaded the tables as they were listed in the datasets.
This isn’t a deal breaker but you may have noticed the need for quotations when referring to the
Table instances when writing a query. Allowing field names to be lowercase allows pgAdmin to
operate in the preferred environment. You will only need quotes when a reserved word for SQL is
also part of the title or a number. If you wrap it in quotes, pgAdmin recognizes it as a field name and
not a function.
Now that we have our data and a few best practices to move forward, let’s look at specific data
questions in the next section.
Exploring pattern detection tools
There is a column in our data that needs a bit of history. It is labeled Prevailing Wage Status and it
is derived from the Davis-Bacon Act, requiring building contractors to pay unskilled laborers the
prevailing wage, which reduces opportunities for unskilled or low-skilled workers. Prevailing wages
reflect the compensation paid to the majority of workers within a certain area and are often described
as union wages.
Smaller firms owned by minorities are typically non-unionized and unable to pay these higher wages.
The history of the act is tinged with its passage in 1931 to prevent non-unionized black and
immigrant workers from working on federally funded construction projects. The modern implications
of the act in NYC can be observed in Figure 3.9:
Figure 3.9 – Prevailing wages (red) and non-prevailing wages (white) in NYC
Figure 3.9 – Prevailing wages (red) and non-prevailing wages (white) in NYC
When we analyze the data, we ask questions and notice patterns in our data. After reviewing the data
in pgAdmin, there is a CONFIDENTIAL response in the project name column. Since these will not be
rendered in our map, we can remove them. Simultaneously, it is possible to also identify the
extremely low-income units by excluding 0% values. The NOT LIKE operator in the code rejects
values of 0 since we only want the locations with these units. The LIKE operator removes all zeros,
%0%, and that is not our desired outcome. Values of 20 are fine so 0% only removes them from the first
position. Figure 3.10 renders a map when you select the geometry column and if you hover over a
point, additional attribute data is provided.
You will also learn how to simplify the syntax when working with more complex SQL strings. I want
you to be familiar with what happens if we bring in our data without accommodating capitalization or
if reserved words are used in column names. However, for now, by placing table names and column
names in double quotes and column variables in single quotes, our code is executable:
Additional exploration in areas that reflect patterns of Extremely Low Income Units could be the next
step in your analysis.
The next chapter will introduce you to working with census data. The queries will be a little more
complex, so let’s get acquainted with a few more.
Earlier in the chapter, we talked about indexing. You can index data in a table or column. The CREATE
INDEX is a PostgreSQL extension and not an SQL standard. Indexes have a default limit of 32
columns but can be expanded during the build. Indexes can also be problematic and I advise you to
explore whether you indeed need one. The default index is B-tree but you can request any of the
available methods, btree, hash, gist, spgist, gin, or brin.
In a spatial database, spatial queries are often between multiple tables and the geometry has to order
the geometries appropriately. This can take a lot of time and an index is one way of addressing this
complexity.
Run the following code if you would like to try an index. It can also be removed using DROP INDEX:
Now, you will be introduced to the distance operator, <->. It is basically an indexed version of
ST_Distance and is used with the ORDER BY clause. The distance operator returns the two-dimensional
distance between two geometries. These are called nearest-neighbor calculations.
In the following code, abbreviations for table names are introduced. Once the table is identified, the
abbreviations are added and all functions can now refer to the abbreviations ('ah', in this example).
The location included in ST_MakePoint belongs to the NYC Mayor’s office. ST_MakePoint creates
point geometry when assigned specified longitude and latitude values. How close is his office to the
nearest New Construction projects? Not the most nail-biting question, but we can see the results
following the code cell. They may not be useful since they are rendered in the units of the CRS – in
this case, long/lat degrees. You can convert them into a different CRS to get unitless distances or
other measures.
Select different values for EPSG 4326 in ST_SetSRID to see the different options, recalling that the
spatial_ref_sys table in the public schema of spatial databases is listed in the browser of pgAdmin:
Figure 3.11 reprojects the output reported in degrees using Albers projection (EPSG:3005) to output
the distances in meters.
ST_Transform allows you to change the projection as well, but might have a bigger impact than you
intend:
The following code describes a CROSS JOIN. During a typical JOIN, you can bring data together from
different table names and schemas but you can also use subqueries. Remember that SQL considers
SELECT statement outputs as a table!
The CROSS JOIN instances also are not one-to-one links to a shared variable. They are many-to-many
joins based on a specific spatial condition. The distance being calculated is how far the buildings
paying prevailing wages are from buildings not paying prevailing wages. The Null values are
CONFIDENTIAL records that are not displayed in the data. In Figure 3.12, you can observe the addition
of the new ST_Distance column:
SELECT ah1.*,
ST_Distance(ah1.geom, ah2.geom)
FROM (SELECT *
FROM ch3."Affordable_Housing_Production_by_Building"
WHERE "prevailing wage status" = 'Prevailing Wage') ah1
CROSS JOIN (SELECT *
FROM ch3."Affordable_Housing_Production_by_Building"
WHERE "prevailing wage status" = 'Non Prevailing Wage') ah2
Figure 3.12 – Cross joins achieve many-to-many joins and add an additional column
Figure 3.12 – Cross joins achieve many-to-many joins and add an additional column
In the DOHMH data, we are able to view the indoor complaints data records of 311 service requests.
The data is updated daily, so will differ from what you see here depending on when you download.
Let’s explore the street pavement rating data from the NYC Department of Transportation. Ongoing
assessments of its streets are recorded on a 1 to 10 scale as follows:
Poor (%) rating of 1 to 3
The final example in Figure 3.13 features a lateral join that you can do with a single query. Every
row in the streetrating table is now a subquery. The distance between each street rendered as POOR
or FAIR is returned. Limit 1 will limit the query to return one geom per each match creating a faster
query. Observations can be made regarding if there are different neighborhoods or boroughs likely to
have the most streets with POOR or FAIR ratings:
SELECT sr1.*,
ST_Distance(sr1.geom, sr2.geom)
FROM (SELECT *
FROM ch2."streetrating"
WHERE "rating_wor" = 'POOR') sr1
CROSS JOIN LATERAL ( SELECT geom
FROM ch2."streetrating"
WHERE "rating_wor" = 'FAIR'
ORDER BY sr1.geom <-> geom
LIMIT 1) sr2
Figure 3.13 – Street ratings of POOR and FAIR in NYC using lateral joins
Figure 3.13 – Street ratings of POOR and FAIR in NYC using lateral joins
When you hover over the streets in pgAdmin, there is information on the length of the street, whether
trucks or buses are allowed on the street, and the year of the rating. A data question might include a
comparison of the different ratings over a range of years to see how different boroughs manage their
street upgrades.
The dashboard will open and you will be able to enter your query. We are looking for the results to be
retrieved from the Bronx.
Execute the query and it will add the layer to your project. You will want to rename it or the default
query layer label will identify the new data. The query builder in QGIS is quite advanced. In the
upper-left Layers panel, you will see QueryLayer_Bronx. Right-click on the Query layer and you will
see the Update SQL layer. The next step is to zoom in to the layer and view the changes. The lower-
left panel browser in Figure 3.15 brings you directly to your PostgreSQL tables. Don’t forget to
refresh when making changes so that the database updates as well.
The Identify Results panel docked on the right shows the data for the location I have clicked (visible
as a red dot) on the console after I select the Identify Features icon in the top menu ribbon. If you
recall, the Panels are selected by scrolling down in the View menu at the top of the console. A quick
assessment yields a building completed in 2021 where prevailing wages were not paid:
In later chapters, we will stray away from open NYC data but for now, these datasets are great
resources to demonstrate SQL spatial functions.
Summary
In this chapter, additional SQL expressions were presented with real-world, publicly available
datasets. You learned how to import CSV files into QGIS for use across your spatial databases.
Exploring different ways to join data on the fly and continue querying inside QGIS reinforced the
advantages of writing queries within the graphic interface.
In the next chapter, you will be introduced to 2020 census data and continue exploring how to join
tables and ask more complex questions with SQL.
4
Census data is notoriously complex but it is worth understanding a few straightforward tasks to
render the available files, as this is informative and critical to understanding the impacts of
urbanization, changing population demographics, and resource scarcity (to name just a few).
In the last chapter, you were introduced to a few spatial methods such as SQL, ST_Distance and
ST_Intersects. In this chapter, we will build toward discovering patterns in our data. Traditional
statistical methods do not account for spatial relationships such as patterns and clusters, so we will
explore data and extend analyzing data spatially.
Relying on Census Reporter Data from the Los Angeles 2020 package, let’s see how to prepare our
data and address questions as we continue learning about SQL syntax.
Technical requirements
I invite you to find your own data if you are comfortable or access the data recommended in this
book’s GitHub repository at: https://github.com/PacktPublishing/Geospatial-Analysis-with-SQL.
NOTE
The files in this archive have been created from OpenStreetMap (OSM) data and are licensed under the Open Database
1.0 License.
Aggregate Household Income in the Past 12 Months (in 2020 inflation-adjusted Dollars) by tenure and mortgage status:
https://censusreporter.org/data/table/?
table=B25120&geo_ids=05000US06037,150|05000US06037&primary_geo_id=05000US06037
Below Poverty (census tract) County of Los Angeles Open Data (Below_Poverty_(census_tract).geojson):
https://github.com/PacktPublishing/Geospatial-Analysis-with-SQL
Workflows that work best also introduce you to troubleshooting guidelines and where to focus if
things don’t go as planned:
1. Figure 4.1 shows the data when we select View/Edit Data while right-clicking on the table name in pgAdmin.
2. When we head over to QGIS to import the data (as a geojson file), we are greeted by this error message:
Error 6
Feature write errors:
Creation error for features from #-9223372036854775808 to #-9223372036854775808.
Provider errors was:
PostGIS error while adding features: ERROR: duplicate key value violates unique
constraint "tiger2020LA_block_pkey"
DETAIL: Key (id)=(athens) already exists.
Stopping after 100451 error(s)
Only 0 of 100451 features written.
When exploring the file uploaded from the browser (Figure 4.2) you may notice that there is no
unique identifier. The id column contains the non-unique entry of 'athens'. Not to worry – we can add
one pretty easily. I thought I would highlight this error because it is often the reason for upload
failure.
3. You are going to add the unique identifier that was originally missing from the dataset. I think it is important to share with you
how to fix this error, as the solutions floating around on the internet are far more complicated than they need to be.
Depending on how you have set up your menu, toolbars, and panels, the icon for the Field calculator
tool resembles a little abacus. For those of you who do not know what an abacus looks like or don’t
have it on your menu ribbon, access the Processing Toolbox window from the Panels menu. The
Field calculator tool is now visible in the drop-down menu as shown in Figure 4.3.
Figure 4.3 – The Field calculator tool in the Processing Toolbox window
Figure 4.3 – The Field calculator tool in the Processing Toolbox window
4. When the window in Figure 4.4 opens, enter the name of the field – here I entered an abbreviation for unique ID (uid) under
Output field name, double-clicked on the row_number option in the center panel, and hit OK. This is also the window you can
access to update an existing field. Notice the option in the upper-right corner of the Rename field panel.
5. Once you select the new unique identifier in the Import vector layer window in Figure 4.5, your table will now have a unique
identifier column and you will be able to import the data. Again, it isn’t necessary to check Source SRID and Target SRID, but I
do so to remind myself that knowing the SRIDs can avoid problems down the road.
Figure 4.5 – Uploading data into QGIS and now available in pgAdmin
Figure 4.5 – Uploading data into QGIS and now available in pgAdmin
6. Switch over to pgAdmin and refresh the server and you will now have the unique identifier as a column heading, as in Figure 4.6.
Census tables have column names that are not particularly intuitive. There are a variety of options to change column headings and
change data types in QGIS but let’s look at how this is accomplished by an SQL query.
7. Our downloaded table depicts Hispanic and Latino populations, using Hispanic or Latino and NOT Hispanic or
Latino, by race. In the downloaded folder, there is a metadata.json file that resembles what you can see in Figure 4.7. This
is a key for what the column headings indicate. I paste this into the Scratch Pad inside pgAdmin, located on the right-hand side of
the Dashboard.
Adding a panel to the SQL query window (or clicking on the circular arrow in the upper-left corner)
creates a Scratch Pad in which you can save code snippets, or you can use it as a reference. I create a
hard return after each column name description to create a list as shown in Figure 4.8.
8. Using the Scratch Pad as a reference, insert the existing table name and then the table name you would like in your table. A few
things to note with this code example are as follows:
The schema is included since I have several different schemas for building the book chapters. If you have a single
schema or are using public, this isn’t needed.
Although the metadata.json file lists the column titles with an uppercase P, quickly checking the Properties tab
when right-clicking on the name of the table will show you how it is actually rendered in pgAdmin:
function will make the requested changes and you can customize as many columns as you would like.
One of the most challenging aspects of working with census data is solved by the ability to create
column headings that are clear and descriptive.
The documentation for the decennial release can be found here: https://www2.census.gov/programs-
surveys/decennial/2020/technical-documentation/complete-tech-docs/summary-
file/2020Census_PL94_171Redistricting_NationalTechDoc.pdf.
The files accessed in this chapter (in bold) include P4: Hispanic or Latino, and not Hispanic or
Latino by Race for the Population 18 Years and Over; P5: Group Quarters Population by Major
Group Quarters Type; and H1: Occupancy Status:
P1: Race
P4: Hispanic or Latino, and not Hispanic or Latino by Race for the Population 18 Years and Over
P5: Group Quarters Population by Major Group Quarters Type (2020 only)
The types of questions that might be of interest include demographic information such as ethnicity or
race and group quarters, allowing us to look at correctional facilities or access to residential treatment
centers and skilled nursing facilities. Occupancy status is also an important metric when assessing the
nature of neighborhoods. Are the residential properties owner-occupied, rentals, or vacant?
QGIS has other options for renaming the column headings. Two that are straightforward include
opening the Processing Toolbox window in Panels and opening Vector table. When you scroll
down, you will see Rename field as shown in Figure 4.9.
Layer Options also has the option to rename the field names, or you can run the SQL query directly
in QGIS.
The pgAdmin interface is useful for exploring the datasets but if you are using macOS, you will need
to upload the data into QGIS first or use terminal as described earlier. One note on file types – if the
option is available, GeoJSON or Open Geospatial Consortium (OGC) GeoPackage formats
(http://www.geopackage.org) are often the preferred formats.
Yes, shape files are still the leading format thanks to the dominance of ESRI products in the
marketplace and the wide support in existing software products but they come with a host of
limitations. To name a few, limited support for attribute types such as arrays and images, character
limits for attribute names, and the lack of a map projection definition are the ones that disrupt my
workflow most frequently.
Figure 4.10 requests specific columns by replacing * with specific columns. First, we select the
renamed Vacant(% change) column from the table (also renamed) to occupancy_la. It would be
difficult to detect a pattern without filtering to locations where the vacancy rate in 2020 was higher
than in 2010. You can select different metrics but you will need to be certain that fields that require
mathematical computation are integers or numeric.
The output shows the areas with higher vacancy rates when compared to the 2010 census. Run it
without the WHERE clause. What do you notice?
Figure 4.10 – The SQL query to select vacant residences in Los Angeles County
Figure 4.10 – The SQL query to select vacant residences in Los Angeles County
These scenarios are specifically for gaining familiarity with basic SQL before expanding the queries
to address multi-step data questions. In the example in Figure 4.11, we want to see census blocks
where there was a greater percentage change in Hispanic or Latino populations than white
populations in the same area. We can see areas where this was not the result. More data and a deeper
analysis are needed before we can think about this statistically.
Census data can be used to explore patterns of demographic shifts by examining natural changes such
as births and deaths or domestic migration, occurs (due to employment opportunities and other causes
unique to the pandemic that are still being explored).
Enter the following query in the query editor. This code segment compares the population growth
from 2010 to 2020 of primarily Hispanic or Latino households to white alone.
Figure 4.11 – Hispanic or Latino population percentage change from 2010 to 2020 (greater than the growth in the white population)
Figure 4.11 – Hispanic or Latino population percentage change from 2010 to 2020 (greater than the growth in the white
population)
Better SQL queries begin with better questions. Group quarters data includes correctional facilities.
You might want to know where these facilities are located and perhaps where there are more than 500
inmates. These examples are snippets of bigger questions but demonstrate how the displayed data
reflect the filtering and specificity of your queries. Figure 4.12 does not demonstrate clear patterns
yet. What are a few additional datasets that you might consider including here?
Figure 4.12 – Location of correctional facilities with more than 500 inmates in the group quarters census data
Figure 4.12 – Location of correctional facilities with more than 500 inmates in the group quarters census data
When working with survey data and census data specifically, it is important to understand how the
data is generated and what the variables are expressing in your analyses. User guides such as the
2020 Survey of Income and Program Participation (https://www2.census.gov/programs-
surveys/sipp/tech-documentation/methodology/2020_SIPP_Users_Guide_OCT21.pdf) are important
to understand the data definitions and how the data was weighted.
Briefly, weights are calculated to determine the number of people that each surveyed participant
represents. Different members of a population may be sampled with a host of probabilities and
response rates and this is adjusted by weights. For example, if the upper quartile weight is 7,000, this
equates to a respondent representing 7,000 people in the sample.
Let’s go back to the Field calculator tool in QGIS. The Field calculator tool is great for computing
a new vector layer (leaving the underlying data untouched). The dataset lists populations as totals, so
a simple equation was able to generate a new column showing the populations as percentages based
on the underlying total population of the tract, as in Figure 4.13.
Figure 4.13 – The Field calculator tool creating a new column in QGIS
Figure 4.13 – The Field calculator tool creating a new column in QGIS
When you scroll down to Fields and Values, you will be able to select the fields and add the
calculation in the Expression window. The new field will show up as a data layer in the canvas,
labeled Calculated.
Run the query and you will be able to view your input parameters and verify the successful execution
of the calculation:
– hisplat_2020/total_2020 * 100
QGIS version: 3.26.1-Buenos Aires
QGIS code revision: b609df9ed4
Qt version: 5.15.2
Python version: 3.9.5
GDAL version: 3.3.2
GEOS version: 3.9.1-CAPI-1.14.2
PROJ version: Rel. 8.1.1, September 1st, 2021
PDAL version: 2.3.0 (git-version: Release)
Algorithm started at: 2022-09-23T08:10:41
Algorithm 'Field calculator' starting…
Input parameters:
{ 'FIELD_LENGTH' : 3, 'FIELD_NAME' : '% Hispanic Latino', 'FIELD_PRECISION' : 3,
'FIELD_TYPE' : 0, 'FORMULA' : ' "hisplat_2020" / "total_2020" * 100', 'INPUT' :
'memory://MultiPolygon?
crs=EPSG:4326&field=id:string(-1,0)&field=name:string(-1,0)&field=original_id:string(-1,
0)&field=total_2020:integer(-1,0)&field=hisplat_2020:integer(-1,0)&field=nohisplat%28202
0%29:integer(-1,0)&field=Population%20of%20one%20race%20%28…
Execution completed in 0.16 seconds
Results:
{'OUTPUT': 'Calculated_0aea2a82_ba41_4c86_b7ad_8809977fd736'}
Loading resulting layers
Algorithm 'Field calculator' finished
We have already examined percentage increases in population totals and Figure 4.14 now shows us
the actual percentages in 2020 as reported by the census.
Figure 4.14 – Percentage of Hispanic and Latino populations in the 2020 census
Figure 4.14 – Percentage of Hispanic and Latino populations in the 2020 census
Understanding that the work isn’t done once the map is loaded onto the canvas is crucial to
encouraging deeper inquiry and the application of spatial information. Descriptive statistics might ask
where people live in Los Angeles County, California – but inferential statistics will be able to
examine spatial relationships. We can make estimates on the probability that value x is more frequent
in a location than value y. Working with census data, we can ask how variables are related in data
over time in different locations. What might be influencing these findings?
Figure 4.15 looks at the census blocks in 2020. They include the total population (and housing unit
count) for each block. Let’s begin to ask questions to see how that influences the map. For example,
let’s select a population greater than 200. The query is a SELECT statement and should look familiar.
What happens if you change the values?
SELECT * FROM ch4."tiger20LA_block" WHERE "block_pop" > 200
You can run the query inside DB Manager in QGIS. Look for the second icon from the left in the
figure to the right of the refresh symbol in the top menu of DB Manager, visible in Figure 4.15.
Figure 4.15 – Census blocks in Los Angeles County that contain more than 200 people
Figure 4.15 – Census blocks in Los Angeles County that contain more than 200 people
The census population data surveying how Hispanic and Latino populations have increased since the
2010 census can be viewed with a simple query. In QGIS, loading hisplat_la onto the canvas and
viewing the Layer Styling Panel allows you to create intervals to capture blocks with a larger
population of Hispanic and Latino communities as shown in Figure 4.14.
Running the following code in the SQL window in DB Manager, you can add the Query Layer and
the Fire_Hazard_Severity_Zones data. Select the load as a layer option and click on load. Your new
labeled layer (you can choose what to call it) is available on the Layer pane.
Are there patterns to observe where populations are located in relation to fire hazard severity? Using
the Layers panel to classify the data by category, you can select the colors to display on the legend.
Fire_Hazard_Severity_Zones represent areas where mitigation strategies are in effect to reduce the
risk of fires due to fuel, terrain, weather, and other causes. Adding data layers that allow you to
compare risks across a geographical area and the nature of populations impacted by the risks and
mitigation strategies can add an important dimensionality to measure equity and opportunity, as well
as persistent barriers that might arise from the infrastructure built within and between communities.
First, select the layer you would like to style. Choosing the Categorized option and the column or
value to highlight is the next step. The Classify button will provide you with options, as you now can
view the number of categories in the dataset. In Figure 4.16, the symbol color was adjusted by
highlighting the legend value and scrolling through the color profile. Any categories you do not want
to highlight can be deleted by using the minus (-) symbol shown in red.
Figure 4.16 – Layer styling panel to create a color scheme for a Categorized feature class
Figure 4.16 – Layer styling panel to create a color scheme for a Categorized feature class
Figure 4.17 shows the visualization but the population level blends in with the Moderate version of
Fire_Hazard_Severity_Zones. The ability to customize the features in the data layer is easily
addressed in QGIS. When combining layers, you are able to adjust the color selection of symbols,
and select different symbols (especially when working with point geometries).
SELECT hisplat_la.*
FROM ch4.hisplat_la,ch5."Fire_Hazard_Severity_Zones"
WHERE ST_Intersects (hisplat_la.geom,ch5."Fire_Hazard_Severity_Zones".geom)
AND hisplat_la.hisplat_2020 > hisplat_la."nohisplat(2020)"
Figure 4.17 – Fire_Hazard_Severity_Zones and the location of populations where Hispanic and Latino totals are greater than non-
Hispanic
Figure 4.17 – Fire_Hazard_Severity_Zones and the location of populations where Hispanic and Latino totals are greater
than non-Hispanic
The ST_Intersects function is indicating where Hispanic and Latino Populations (increased totals
since the 2010 census) are located along fire zones. As shown in Figure 4.18, selecting a contrasting
color is more impactful. Intersections between the polygons are queried and visualized in QGIS.
Although they seem similar, several different methods provide quite different results in how the data
is queried.
Figure 4.18 – Where increasing populations of Hispanic and Latino people are living in relation to the fire hazard risk
Figure 4.18 – Where increasing populations of Hispanic and Latino people are living in relation to the fire hazard risk
Explore a few of the different topological relationship options, such as the following:
ST_Touches (interiors of the polygons do not overlap)
Another spatial method is ST_Centroid. The geometries of the geom column in your table are collected
and the geometric center of the mass is computed. When the geometry is a point or collection of
points, the arithmetic mean is calculated. LineString centroids use the weighted length of each
segment, and polygons compute centroids in terms of area.
Figure 4.19 has a red star indicating the centroid. We will explore additional measures of centrality in
the next chapter. They are also used when calculating distance and zone attributes, and often
represent the polygon as a whole in advanced calculations:
Figure 4.19 – Calculating the centroid in the OSM land use table
Figure 4.19 – Calculating the centroid in the OSM land use table
Typical spatial statistical concepts are easiest to understand when accessed to answer spatial queries.
You will notice patterns and clusters of location data. Can we predict the unknown from the known
information? Descriptive statistics can tell us where the fire zones are located in California. Do fires
occur more frequently in certain locations? As we begin to explore more advanced SQL queries, it is
important to determine whether we are analyzing one feature alone or whether the location is
influenced by other attribute values.
So far, we have been working with vector data, mostly discrete in nature, but spatially continuous
data (raster) is also common. In this chapter, you also worked with spatially continuous categorical
data. The land cover types are assigned to categories as observed in the legends of the generated
maps.
A good practice, to begin with, is to assume the null hypothesis – no pattern exists. Applying
statistical principles, we should then have a high level of confidence that what we are observing is
not due to chance. This is important because thematic maps can be manipulated. Examining the
center, dispersion, and directional trends in our data is an important element of spatial analysis.
In later chapters, the mean center (geographic distributions), the median center (relationships
between features), and the central feature will also measure the relationships between geographic
distributions, measures of central tendency, and localized events. We can also measure clusters of,
say, census tracts based on attribute values using hot spot analysis. While heat maps measure
frequencies of events, hot spot analysis looks for statistical fixed distance bands.
SELECT Find_SRID('ch4','hisplat_la','geom');
SELECT * FROM geometry_columns
Understanding how to query datasets with SQL is the first step in introducing prediction models to
the database. The execution of these properties will be expanded on as our journey continues.
Exploring the below_poverty_censustract data for Los Angeles County, I want to be able to isolate a
tract and explore neighboring tracts. Location and distance might hold clues for exploring
marginalized communities or populations living below the poverty line.
Looking for values to explore, I am exploring tracts with a high percentage of the population living
below the poverty level. Running the following code to order the data by descending values will help
me to find the higher percentages:
I identified the tract of interest by tract number and queried any tracts that intersect. ST_Intersects
compares two geometries and returns true if they have any point in common. The following code
was written into the query editor in QGIS. Execute the code and name the query layer in the box in
Figure 4.21:
Remember the geometry viewer of the single tract visible in pgAdmin before opening QGIS. The
nature of the neighborhood is visible and we can see that it is near the Long Beach Terminal. It is the
blue polygon in Figure 4.22.
How can we identify areas adjacent to or near our tract of interest? Learning about a few of the
spatial methods will be helpful in later chapters where we will be able to predict output values by
examining a wide variety of features of the locations surrounding our outcomes of interest. Visible in
Figure 4.23, we can see proximity to water, income level, and other features that we will examine
using regression experiments or time-series predictions in later chapters. For now, begin asking
questions. What type of data would you like to see?
Figure 4.23 – Intersecting polygons organized by Los Angeles County Board of Supervisor districts and high percentages of
poverty
Figure 4.23 – Intersecting polygons organized by Los Angeles County Board of Supervisor districts and high
percentages of poverty
Many of the data layers we can explore are included in the census we began looking at earlier in the
chapter. In Figure 4.23, we can observe different tracts and explore how their poverty rates compare
to the initial tract we selected. Different county supervisors can be identified categorically in the
Layers panel, and revisiting the census tables created in QGIS and imported to pgAdmin would likely
be a useful next step.
Areas such as Los Angeles County are prone to fires, landslides, and a host of other risks of natural
disasters, figure 4.24.
Figure 4.24 – Landslide zones in Los Angeles County
The complexity of looking at the topography and how it influences the demographics of people and
infrastructure, especially in an area such as Los Angeles, California, becomes an opportunity to
engage with geospatial tools and ask bigger questions.
Figure 4.25 is a zoomed-in map of the landslide zones in and surrounding the urban area of Los
Angeles. Although we will be leaving the California area in the next chapter, the objective here has
been to introduce you to a few data portals where publicly available datasets are available for
exploration. The data is from the county of Los Angeles –
https://data.lacounty.gov/datasets/lacounty::landslide-zones/about – and depicts areas prone to
landslides. If you are like me, perhaps coastal communities mostly came to mind, and you didn’t
imagine the risk existing in urban areas as well.
Let’s revisit the fire zones as we close out the chapter. Before we can build prediction models, it is
important to gain a sense of the type of questions we might formulate.
In Figure 4.26 here, the building data from OSM is added as a layer to the canvas. It isn’t fully
loaded, as the density of buildings would obscure the detail I am trying to highlight at this zoom
setting.
My question to you is do you think there is a pattern to where certain buildings are located in these
higher-risk zones? Can we examine demographic information from the census and hypothesize and
test these hypotheses?
I hope you have an appreciation of the powerful integration capabilities of working with SQL in both
the pgAdmin interface and with QGIS. Figure 4.26 highlights the utility of the visualization
interface, styling capabilities, and the ability to run SQL queries right in the canvas. By generating
query layers, you are able to visualize the results in real time. Become curious about your basemap
selection, the transparency of different layers, and the other customizable features.
Summary
In this chapter, you learned how to write a few spatial topographic queries and add multiple layers of
data to your SQL queries. You were introduced to an important resource for understanding
demographic and economic data attributes associated with geographical locations – the 2020 US
Census. You successfully ran queries on data and generated your own maps for exploration.
In Chapter 5, Using SQL Functions – Spatial and Non-Spatial, you will continue to be introduced to
SQL spatial queries and begin to work with more advanced concepts, such as measuring distance and
evaluating buffers, in addition to building a simple prediction model.
Section 2: SQL for Spatial Analytics
In Part 2, readers will learn about spatial data types and how to abstract and encapsulate spatial
structures such as boundaries and dimensions. Finally, large datasets are then enabled for local
analyses. Open source GIS, combined with plugins that expand QGIS functionality, have made QGIS
an important tool for analyzing spatial information.
I would argue that this is particularly relevant when learning how to write efficient SQL queries.
There is a natural cadence and syntax that differs from writing a code snippet in, say, Python or R.
The datasets we are exploring are from the Amazon rainforest in Brazil. Mining activities in the
Amazon rainforest are a known provocation for deforestation and the associated impact of toxic
pollution on surrounding communities. Mining activities also require dense road networks for
transportation and infrastructure to support industrial mining, which impacts the growth of
surrounding vegetation, pollution run-off, and habitats. The areas of Brazil with the densest
vegetation and risk of deforestation are often inhabited by indigenous populations, which are often
the people with the least influence to restrict access and defend protected lands.
By the end of this chapter, you will understand how to write SQL queries and explore both spatial
and non-spatial data.
Technical requirements
I invite you to find your own data if you are comfortable or access the data recommended in this
book’s GitHub repository at https://github.com/PacktPublishing/Geospatial-Analysis-with-SQL.
We will access the following data to explore the questions in this chapter:
Buildings in Brazil (places) and points of interest (POIs)
Roads
Preparing data
So far, you have learned how to download data directly from Open Street Map and begin asking data
questions. When you prefer to work with a smaller dataset, instead of downloading larger files
directly to your computer, you can also create datasets directly from the map canvas and write SQL
queries entirely in QGIS.
QGIS has a training dataset but it is not too complicated to substitute the default set with something
of more interest. We can select a region of interest from the canvas and explore different layers by
using a QGIS plugin known as QuickOSM.
There is also a plugin for uploading data into the QGIS canvas for exploring our region of interest.
2. Next, open a new project in QGIS and navigate to DB Manager. As a reminder, you can access it by selecting Layer from the top
menu bar. Scroll down in the Browser window and select OpenStreetMap. Alternatively, you can open the base map directly in
the Browser window:
4. The lower left corner of the QGIS console has a search function shown in figure 5.2. Upon clicking inside the search box, you will
see Nominatim Geocoder, where you can enter locations of interest directly instead of searching for and zooming in on locations.
Select the Nominatim Geocoder and enter the location in the search box. If you type Palmas, the map will provide options and
then take you to the location you entered.
We are exploring Brazil and the Amazon rainforest. There are many areas of interest. For this
example, we are zooming in on Palmas in the STATE OF TOCANTINS. It is in the southeast
portion of Brazil that’s visible in Figure 5.2 as a small red dot. I selected this area as there is a lot of
activity, but feel free to look around. You will most likely need to zoom in as the plugin will timeout
over larger regions (you can reset this from the default) and it will be easier to simplify the
introduction to using this tool.
You should notice the little icon (a magnifying glass on a green square) in your toolbar, as visible in
Figure 5.1:
Figure 5.3 – Locating an area of interest in Open Street Map in QGIS
5. When you execute the QuickOSM tool from the toolbar or the Vector pill in the menu, you will see a Quick query tab with a
bunch of presets that you can select. Scroll through the options in Figure 5.4 by clicking the Preset dropdown.
Because we are specifically looking for roads and buildings, these are the queries I selected in Figure
5.4 and Figure 5.5. Select Canvas Extent so that the data is only downloaded from your specific area
of interest. Run the query; the data layer will be visible in the canvas. Be sure to also indicate the
geometry you are exploring.
6. Go to the Advanced section and check the appropriate boxes on the right. Multipolygons should also be selected. Update these
settings when your data is points or lines. You can also reset the Timeout option to accommodate larger files. In figure 5.4 it
defaults to 25 but I routinely set it to 2000. Run the query:
Figure 5.4 – Creating the building data layers in QuickOSM
7. Next, we will create an additional data layer for roads, as shown in Figure 5.5, but remember to select Multilinestrings and Lines
instead of Multipolygons. Repeat this process for layers in individual projects that interest you! Here is a resource for
understanding OSM features: https://wiki.openstreetmap.org/wiki/Map_features. Expand the Advanced section to see the rest of
the available options:
Figure 5.5 – QuickOSM – selecting a highway key in QGIS
8. The layers you add using this method are temporary. The little rectangular or indicator icon visible next to the place layer in
Figure 5.6 indicates a scratch layer:
Figure 5.6 – Scratch layer with an indicator icon
9. Clicking on this icon will bring you to the window shown in Figure 5.7. There are many options for saving the layer. I lean toward
GeoJSON, but ESRI Shapefile is popular as well. GeoPackage is a standards-based data format (https://www.geopackage.org):
Figure 5.7 – Saving a scratch layer in QGIS
10. Upon saving a scratch layer, you will need to enter a few details when prompted, including Layer name and other options; refer
to Figure 5.8. Add different layers until you have a decent variety. You can use the QGIS sample data (on GitHub) or build your
own while following the vector file types as a guide:
Figure 5.8 – Saving a scratch layer in QGIS
To save the data, we have created a folder called practice_data in Downloads and added all the Layer
instances into that folder, being careful to save them in the correct formats.
This was a brief introduction to how you can customize your data. I think it is important to have a
variety of ways to achieve something such as finding and bringing your data into QGIS or your SQL
database. This is the data we will use for the rest of this chapter.
Spatial concepts
Understanding how spatial statements are constructed and how to customize them for your queries is
an important task in learning new ways to interact with spatial datasets.
ST_Within
The synopsis of ST_Within includes the following:
A true and false determination is provided to determine if geometry A is entirely within geometry B.
You will need to be comfortable with SELECT statements. Consider the following code. I am interested
in buildings located within protected areas. I need to SELECT the buildings and would like to identify
them by their name and fid. Geopackages often use fid as their unique identifier, although it isn’t
often necessary since OSM has its own identification for its features. I grabbed it here as an example.
Any unique identifier will work. I ran the following query in pgAdmin.
The name, fid, and geometry (point) are displayed in the output window once you run the code.
ST_AsText is returning the Well-Known Text (WKT) format of the geometry data to standardize the
format returned:
The tables I am accessing are listed in the FROM statement. Next, I added the condition to include only
the rows where the buildings are located within the protected areas of Figure 5.9. The geometry of
the building is only returned if it is entirely within the geometry of the protected area. Anything
spanning the geometry will not be included.
CREATE VIEW
We can also CREATE VIEW. I often use this when writing queries in pgAdmin but plan on visualizing
them in QGIS. This is the view of a specific query. You can locate the Layer property in the schema
where it was created or in your Public schema folder. The format requires you to indicate a VIEW
name and your query (AS is a reserved SQL keyword) follows this example. The query indicates
which columns to include. Run the following code:
The output is shown in Figure 5.10. Select the downward arrow and download the output to your
computer. The output is formatted as a CSV and you can upload it into QGIS by using the Data
Source Manager area or selecting Add Layer from the Layer menu. Simply drag the layer onto
your canvas and it will be added to the map:
Figure 5.10 – CREATE VIEW displaying buildings located within protected areas
The Geometry Viewer area in pgAdmin is convenient for looking at maps but only if you change the
SRID value. In Figure 5.11, I have selected Geometry Viewer just to locate the data, but let’s head
over to QGIS for a better view without having to write more code to change the projection:
Returning to our canvas with the layers from our example, be sure to arrange the layers so that the
points and roads, for example, are on the top layers and not buried under polygons.
We can’t gather any insights with all the layers present, as shown in Figure 5.13. So, now, it is time
to consider a few data questions. After adding the VIEW property we created previously to the layers
canvas, right-click and select Zoom:
Figure 5.13 – All layers applied to the canvas
NOTE
Increasing the size of the point in Layer Styling increases visibility.
After zooming into the layer, as shown in Figure 5.14, you will now see the buildings within the
protected areas, which are represented by red circles:
Figure 5.14 – Zooming into the layer to display CREATE VIEW
In Figure 5.15, you can see the boundary of the protected areas (beige) and the buildings located
within that boundary:
Figure 5.15 – Protected areas (beige) are shown with the buildings located within the boundaries
The water layer is styled with polygons colored blue for reference. Areas that appear brown are not
necessarily the result of deforestation in a traditional sense; forests are also used for agriculture and
cattle. This is not without issue or consequence, however, as you will see when we explore land use
and road infrastructure.
Now that you can select the tables and data to display on the canvas, let’s learn how to create a buffer.
Spatial-based SQL queries for items such as containment, intersection, distances, and buffers are the
foundation of spatial analysis.
ST_Buffer
We can create buffers around specific areas in our map. This is reported as radius degrees or meters,
depending on your SRID. The buffer is shown highlighted in white in Figure 5.16.
Now, we must select the unique ID and then the parameters required by the function:
When asking data questions about specific areas on our map, we often need to select specific values
in a column. In Figure 5.15, I have specifically requested the protection column (Figure 5.16) but
need to know the identity of the values of interest – in this case, reserva biológica.
In addition to clicking Layer Properties, it is possible to access information about our data tables
seen here in figure 5.17. Highlight your Table in pgAdmin and select SQL at the top of the console.
Now, you can view a vertical list of column headings with information that is often helpful if you
aren’t seeking the detail of an attribute table.
When asking data questions about specific areas on your map, you often need to select specific
values in a column. Asking data questions about datasets of increasing complexity often requires
user-defined functions. Now, we will write our own!
3. Insert the RETURNS TABLE function, followed by the data type. This is a text option as the actual function will add the variable
when it is run.
4. Now, select the language property of SQL as PostgreSQL since it is not limited by a single procedural language.
5. The actual query will now be included inside $$ query $$. These are called dollar-quoted string constants ($$).
We need to create the function (as shown in the following code) and then pass to the function what
we want it to do.
First, we want to select boundary_protected_area, which is where highways intersect. We are passing
text (x text), which counts as one variable. The text is entered when we run the function that was
created. In our example, the text has been replaced with highway:
We can now view the region of interest and observe the dense network of highways running through
the protected area, as shown in Figure 5.19:
Figure 5.19 – Protected region and proximity to highway networks
Returning to pgAdmin, you can also select a particular boundary and explore its specific relationships.
We will randomly select osm_id '10160475'. You can see the boundaries in the Geometry Viewer
area in Figure 5.20:
Running the same code in QGIS provides the map scene shown in Figure 5.21. To update the layer
that’s already been loaded onto the canvas, we need to right-click on the query layer and select
update SQL Layer. You have a wide range of base maps to select from in QGIS:
Figure 5.21 – Viewing the boundaries in the getprotecteda function
The query runs under the hood in QGIS when adding the updated layer to the canvas, as shown in the
following SQL statement generated by QGIS:
Now, we can see the updated layer for the selected OSM-ID in the region with the associated
intersection of the network of the highway (Figure 5.22):
Figure 5.22 – Updated SQL layer showing the region of interest with intersecting highways
To explore the mining areas in Brazil, we can use the ST_Area function. In pgAdmin, we can observe
these results in Geometry Viewer:
SELECT
geom ST_Area(geom)/10000 AS hectares
FROM ch5.mining_polygons
WHERE ch5.mining_polygons.country_name = 'Brazil'
Becoming familiar with this function allows you to explore the area occupied by indigenous
populations shown in Figure 5.24 by running the following code:
SELECT
nm_uf,
ST_Area(geom)/10000 AS hectares
FROM ch5."br_demog_indig_2019"
ORDER BY hectares DESC
LIMIT 25;
Figure 5.24 – Areas occupied by indigenous populations
The distance is provided in the units defined by the spatial reference system of the geometries:
I tend to run queries in pgAdmin and save them there. In Figure 5.25, we can see a partial rendering of
the data. I only requested a minimum number of rows to expedite the runtime since I will head over
to QGIS to see the visualization:
Figure 5.25 – Exploring the data in Geometry Viewer in pgAdmin
When observing the data, I noticed that there were several classifications of roads. If you right-click
on the road layer, you will see the option to filter. We can also do this with SQL code, but it is only a
few clicks away within the QGIS console, as shown in Figure 5.26:
1. Select the Field property you would like to filter. In this example, this will be status.
2. If you are not familiar with the variable, double-click it; options will appear in the Values window on the right.
5. Select Test to see if your expression is valid. You can see the results of my query and that it was successful, returning 613 rows.
6. Select OK; you will see the filter icon next to the query in your Layer panel:
Figure 5.26 – Filtering the roads field so that it only includes official roads
Figure 5.26 – Filtering the roads field so that it only includes official roads
The roads we filtered are now shown in pink in Figure 5.27. We can now add the layers to the canvas
and observe the results of the query. You can remove the fill in Layer Stying and retain the outline of
the polygon. We are observing the proximity of roads to the indigenous territories in Figure 5.27. But
the addition of the deforested areas is where we can begin to see the impact of the roads on the
surrounding areas, as observed by the shaded area in Figure 5.28:
Figure 5.27 – Roads and their relationship with indigenous territories in green polygons
Figure 5.27 – Roads and their relationship with indigenous territories in green polygons
Upon observing the impact of building the road infrastructure on deforestation in Figure 5.28, we can
see it is related to a decrease in vegetation, but we will explore other ways to measure the impact in
the next chapter as we expand our skills and ask bigger questions:
The ability to bring different layers into our queries highlights the power of combining QGIS and
SQL into our emerging data questions.
Summary
In this chapter, we prepared data so that it could be imported and analyzed. You were introduced to
new spatial functions so that you can estimate areas of polygons and define relationships between and
within polygons. You even learned how to create your own functions. These are handy to know
especially when you might return to datasets that are frequently updated.
In the next chapter, we will expand on building SQL queries in the QGIS graphical query builder that
you were introduced to when filtering the roads data. SQL is indeed a powerful tool for creating
efficient workflows and optimizing the power of working with GIS tools such as QGIS.
6
In this chapter, we will continue exploring the tools and concepts available for learning about SQL
and applying this knowledge to geospatial analysis. Writing SQL queries into QGIS allows you to
visualize output directly on a map. pgAdmin also allows a quick look at a snapshot of your data using
Geometry Viewer but without a lot of customization.
Discovering quick and efficient tools for integrating database queries with the SQL query builder
builds on the query language you are already scripting.
Technical requirements
The data for the exercises in this chapter can be found on GitHub at
https://github.com/PacktPublishing/Geospatial-Analysis-with-SQL
Census_Designated_places_2020
As we learned about geometric and geographic spatial queries in defining coordinates and locations,
we investigated points, lines, and areas of polygons. When referring to a topological query, we are
mainly interested in where objects are located and where they might intersect, in addition to specific
attributes.
Topological SQL queries investigate spatial relations between tables or entities within a database.
Topological criteria focus on the arrangement of entities in space. These arrangements are preserved
regardless of projection. We have already explored spatial relationships such as distance,
intersections, and geometric relationships, for example. We can calculate characteristics and
relationships as well as create new objects.
The characteristics of topological queries are qualitative and examined over two-dimensional space,
essentially exploring region boundaries to define a relation or dependency. In practice, you can think
of these queries as asking questions about adjacency, connectivity, inclusion, and intersection. You
may have heard of non-topographic data being referred to as spaghetti data. In fact, ESRI shapefiles
are representative of this type of data. Shapefiles store vector data and build connections on top of the
feature class. Topographical data, by contrast, stores relationships such as nodes or points and
polygons capturing shared boundaries and adjacency to avoid overlapping polygons or lines that omit
connectivity to other lines.
Conditional queries
We evaluate these relations with boolean test queries (true or false), numerical distance queries, or
action queries by testing relationships based on geometries. These are all conditional queries. For
example, any of the functions listed can be inserted into this sample query structure. We are looking
at the relationship between two different geometries in a single table. We will expand this to multiple
tables, but as an introduction, this is the framework:
These are the sample functions that evaluate the function and return data if it is true.
The following boolean functions are either true or false based on the relationships you query:
FUNCTIONS:
Equals(),
Disjoint(),
Touches(),
Within(),
Overlaps(),
Crosses(),
Intersects (),
Contains(),
Relates()
You will create a QGIS workflow for exploring datasets, but first, let’s get familiar with the interface.
2. When you click on the Layer Properties option, the window opens, as shown in Figure 6.2. Select Source and the page will
display the following options:
Settings: Displays the layer name and data source encoding. The default is UTF-8.
Assigned Coordinate Reference System (CRS): This operates as a fix in case the wrong projection is displayed. If
you need to change the projection beyond this instance of the project, you will need to go to the Processing Toolbox
option and select the Reproject layer.
Geometry: These options are typically defined already but if not, you can modify them here.
3. Next, select the Provider Feature Filter option, and the Query Builder button will appear in the lower-right corner of the
window. Click on Query Builder and you will see a window displayed, as in Figure 6.2. The fields that populate will be from the
layer you selected in the Source window:
The Airport_Influence_Area data layer is from the Open Data portal for Los Angeles County. The
polygons denote an airport influence area within 2 miles of an airport, representing noise, vibration,
odors, or other inconveniences unique to the proximity of an airport. According to a state's Division
of Aeronautics, the airport influence area is determined by an airport landuse commission and varies
for each airport.
In this quick introductory example, you can choose AIRPORT_NAME and either see a sample of the
values in the right window or select All to see all values. Alternatively, if you are interested in a
specific airport, you can search for it in the Search window. In addition, simply selecting Fields and
hitting Test will show you how many rows are in your data. I often start here, write the query, and
then look to see whether the number of rows provided after the query makes sense.
Figure 6.3 includes a simple query where we are selecting AIRPORT_NAME and the value as Burbank.
By clicking on Fields and adding an operator and a value, the canvas will update with your filter. The
Test button allows you to run the query to see whether it is valid:
We expect a single row since we are requesting a single airport. Other fields from different layers will
have multiple value options and will return the valid rows for your query. For example, let’s look at
Income_per_Capita_(Census_tract) in Los Angeles County. In Figure 6.4, we can see the styling
layer by graduated equal count intervals:
Figure 6.4 – Income per capita in census tracts in Los Angeles County (QGIS)
Figure 6.4 – Income per capita in census tracts in Los Angeles County (QGIS)
Returning to the query builder, we need to request an actual value. We can request values using
operators.
How about census tracts where income per capita is less than $31,602?
The operators are, = (equal), < (less than), > (greater than), <= (less than or equal to), >= (greater
than or equal to), != (not equal, and % (wildcard).
The remaining reserved words are self-explanatory, with the exception of ILIKE, which indicates
insensitive to the case. The last two options, for example, might be useful if you are looking for cities
that begin with a letter; that is, returning all values that begin with b would use b%. The wildcard (%)
moves to the first place when looking for words that end with the indicated letter, %b.
Using the query builder, we select the AIRPORT field and all values that begin with b, regardless of
case (ILIKE). The clause returns two rows, as displayed in Figure 6.5:
Notice that in Figure 6.6, without the ILIKE operator the clause returns 0 rows because the values are
returned capitalized, and we requested the values with a lowercase b:
Figure 6.6 – Selecting all airports beginning with the letter b using the LIKE operator in QGIS
Figure 6.6 – Selecting all airports beginning with the letter b using the LIKE operator in QGIS
These simple queries are a nice introduction to how queries are structured. Conditional queries have a
filter expression that evaluates a relationship and returns the output. These examples are looking at a
single table and the data values that are included. In the next section, we are going to begin working
across tables.
Aggregation and sorting queries
These quick queries are useful, but often, you will need to manage more complex queries or access
more than one table in your database.
We will start with something more interesting than locating an airport with b included in its name.
Let’s use the database manager query manager to write a simple SQL query. Select all the rows
from our table. You can indicate an alias if you want to simplify the code, SELECT * FROM
ch6."Census_Designated_Places_2020" c, and now, any time you refer to the table, it can simply be
c.geom or c.name, for example. I will start doing this practice as we move to complex queries but I
find it can be confusing in teaching environments, so I will continue to write out the name of the
tables; however, you can certainly explore using the alias of your choosing.
The table will load with the results of the query, as shown in Figure 6.7:
Selecting all columns in your dataset is an easy way to look over the data and the type of data. Let’s
select a WHERE statement and find the El Segundo location. I grabbed the city randomly, and you can
do the same or pick a different location. El Segundo is the epicenter for Los Angeles sports teams, the
future location for the 2028 Olympics, and home of the Los Angeles Times, to name a few bits of
history:
Figure 6.8 shows the city and area of El Segundo. Now, let’s add it to our map:
Scroll down the page and check the Load as new layer checkbox. You also have the opportunity to
label the new layer in the Layer name (prefix) panel, shown in Figure 6.9. Load the data and wait
for QueryLayer to become visible in the Layer panel. Make sure you move it to the top of the layers
so that it won’t be hidden. If you don’t choose a name Query layer will be the default:
Figure 6.9 – Loading the new layer onto the canvas in QGIS
Figure 6.9 – Loading the new layer onto the canvas in QGIS
To see the updated canvas, right-click on Query Layer and select Update SQL Layer…, as shown in
Figure 6.10:
Figure 6.11 shows the updated SQL layer, creating an additional column that you can customize or
leave with the default _uid_ value, an abbreviation for unique identifier:
Figure 6.11 – Updating the layer with the updated SQL query
Figure 6.11 – Updating the layer with the updated SQL query
The green polygon in Figure 6.12 is the El Segundo census place, part of unincorporated Los
Angeles County. The layer for income shows light green at the Lowest Income per Capita By
census tract surrounded by the highest census tract income levels in a darker red color:
Before we build a more complex story, let’s become familiar with the SQL query builder. It’s similar
to the query builder we have been using, but now we have the flexibility of writing SQL queries and
customizing our data questions.
I recommend adding DB Manager to your toolbar if you have not done so already. You can find it in
the View dropdown in the menu bar. Scroll down to Toolbars and click on Database Toolbar—DB
Manager will open.
Figure 6.13 shows us the different areas where we can select data we are interested in from the
relevant tables, write a WHERE query, group our results by whichever variable we choose, and load the
results as a new layer in our layer panel.
There are a host of other options on the lower-right panel that we will explore in more detail in
Chapter 7, Exploring PostGIS for Geographic Analysis:
Next, we need to select a table. To keep it simple for now, pick the same table from the last example.
Select Census_Designated_Places_2020 from the list of tables. One advantage of working in the SQL
Query Builder is that we are not limited to data from a single table or schema. Figure 6.14 is a
snippet of the tables you will see when you scroll down (depending on how many tables you have
created in your PostgreSQL database):
Figure 6.14 – Clicking on the Tables drop-down menu to view the data
Figure 6.14 – Clicking on the Tables drop-down menu to view the data
The best part of working in the SQL Query Builder is the ability to select only columns you are
interested in. This will be a lifesaver for census tables and some of the more complex datasets in the
final chapters. When you select Columns and add the WHERE clause, the SQL window will update
with only the requested tables, as shown in the following screenshot:
Notice in the query results you have the three columns that you specified now displayed in Figure
6.16. This renders the identical output as we saw in Figure 6.15:
Figure 6.16 – The filtered table from the SQL query builder: QGIS
Figure 6.16 – The filtered table from the SQL query builder: QGIS
Now, let’s filter by a numeric value and only see census tracts that meet the criteria we select. A
simple one-variable, one-table query can be written in the query builder. You can choose the entire
set of values or a sample, but to see the full range of values, I decided to scroll through and pick a
random value. By testing the query, you can see that there will be enough rows returned to make it
interesting.
SELECT ch6."Census_Designated_Places_2020".*
FROM ch6."Census_Designated_Places_2020",ch6."Below_Poverty__census_tract_"
WHERE
ST_Within(ch6."Census_Designated_Places_2020".geom,ch6."Below_Poverty__census_tract_".ge
om)
Or we can use the query builder, as shown in Figure 6.17. Using the SQL query builder, you can
easily filter the data to only include income below a certain value:
Figure 6.17 – Filtering data where census tracts have an income per capita below $31,602
Figure 6.17 – Filtering data where census tracts have an income per capita below $31,602
Now, when we load the layer, we can update and view the census tracts with income per capita below
$31,602, as shown in Figure 6.18:
Figure 6.18 – Querying income levels by census tract in Los Angeles County
Figure 6.18 – Querying income levels by census tract in Los Angeles County
Census data is notably complicated to locate and label properly. SQL queries are well suited to
making the process run more smoothly.
The area highlighted in Figure 6.19 is the percent change in population from the 2010 census to the
2020 census. In addition, I captured the total population and total population: Hispanic Latino for
(2020)".
Renaming the columns in census tables is quite a task. The codes identify data products available
from the census. For example, B in the column heading in Figure 6.20 is used for detailed estimates
for base or detailed tables. The next two characters indicate a table topic, commuting and traveling to
work (in this case, 08), and the next three characters are unique identifiers for a topic within a topic,
such as poverty status. When working on local projects, it is possible to create aliases in the attribute
table window.
Downloading the ACS 2021 1-year data from Census Reporter (GitHub
https://github.com/PacktPublishing/Geospatial-Analysis-with-SQL you can rename columns, as
shown in Figure 6.20. The metadata.json file within the folder provides information about the table
names. To keep the dataset manageable, I selected the column reporting below 100 percent poverty.
NOTE
This data was taken from the U.S. Census Bureau (2021). Means of Transportation to Work by Poverty Status in the Past
12 Months, American Community Survey 1-Year Estimates. Retrieved from https://censusreporter.org.
Briefly, the federal poverty level is determined each year by the Department of Health and Human
Services (DHHS) to establish a minimum standard income amount needed to provide for a family.
When a family earns less than this amount, the family becomes eligible for certain benefits to assist
in meeting basic needs. Higher thresholds are provided for larger households. In the data provided by
the census, you may notice higher percentages such as 150 and 200. These levels are calculated by
dividing income by the poverty guideline for the year and multiplying by 100, and although these are
above the minimum thresholds, they may be eligible for additional resources or subsidies.
Scroll down to the next button in the vertical menu, and you will be able to update the alias on any
field. I use this during the exploratory phase so that I can keep track of the different column headings,
especially when they are coded and can be difficult to decipher.
Figure 6.21 shows the window for adding an alias in the General settings. Aliases, though, will not
be transferred to the source data but are also not restricted by character lengths, so can be useful
within a project or project layer:
Figure 6.21 – The attributes form for adding aliases in the Layer Properties window
Figure 6.21 – The attributes form for adding aliases in the Layer Properties window
Now, when you access the attribute table, you have the updated columns with the aliases, as shown in
Figure 6.22. I recommend first dropping any fields not relevant to your analyses. In the Processing
Toolbox (View | Panels) setting, select Vector Table and Drop field(s):
I noticed that when I imported the table into the schema, the aliases were not retained. Figure 6.23
has the source data columns. Let’s explore how to make these names permanent:
Whenever you want to edit field names, or drop or add any field(s) to a table, return to the
Processing Toolbox Panel and search for Vector table.
Figure 6.24 displays the window for renaming fields in your table. Enter a folder name under
Renamed if you want a permanent layer. Notice the default, [Create temporary layer]. In the
absence of a saved layer, you will notice the icon to the right of your layers in the panel in Figure
6.24. The scratch layers will not be saved with your project. You will be prompted to save them if
you close the project without making them permanent:
Figure 6.24 – Renaming table fields and saving them as permanent files
Figure 6.24 – Renaming table fields and saving them as permanent files
Let’s end the chapter with a few simple queries. Run the following code to practice using an alias to
simplify the code. Notice that after the FROM clause, we can use p to represent the table we are
referencing. The WHERE clause is now able to use this to indicate a column in the table:
We are selecting all the columns from the below-poverty census tract. This column is all of the tracts
with a poverty level below the federal poverty level from the ACS 2021. This is 1-year data
providing the means of transportation to work by poverty status in the past 12 months.
SQL is a powerful tool for exploring granularity in larger datasets. I notice that often when looking at
demographic data, the questions are limited to population sizes, race, and income, passing over
factors such as built infrastructure, environmental issues, and community barriers.
The map shown in Figure 6.25 is showing census tracts with over 48% of households below 200
percent of the poverty threshold:
The following code also includes aliases for the Hispanic and Latino population (h) and poverty level
(v). The WHERE clause can identify census tracts based on multiple calculations. We can also run this in
pgAdmin to observe the data in Figure 6.26:
Try updating the field names in this table as well. When hovering over a census tract, information
about the population can readily be observed:
Figure 6.26 – Selecting census tracts under the federal poverty level
Figure 6.26 – Selecting census tracts under the federal poverty level
There are many characteristics of communities that can influence health outcomes and highlight
inequities. In the final screenshot of the chapter, let’s think about access to hospitals, with hospitals
identified by a red plus sign inside a circle in figure 6.27.
What can we learn about distances traveled or time to the nearest facilities?
Figure 6.27 – The number of hospitals in a community can impact patient outcomes
Figure 6.27 – The number of hospitals in a community can impact patient outcomes
There are different graphical query builders, and you are now familiar with two popular open source
alternatives. You are ready to build your own customized workflows. For example, I tend to import
datasets into QGIS and link them to pgAdmin since I work in macOS. Although I rely on the
autocompletion and ease of writing SQL queries in QGIS, during the exploration of a new dataset I
write and run most of my queries in pgAdmin.
There is no right way or wrong way, only the way that makes you feel most efficient and productive.
Summary
In this chapter, we continued to learn about SQL queries as a way to filter datasets and generate
focused responses to questions. The goal for writing SQL syntax is to feel conversational—building
on the ability to ask specific questions. Understanding location and being able to dig a little deeper to
see how “where” things are happening can often shed insight on “why” as well.
Writing conditional and aggregation data queries integrated nicely into specific workflows that you
can expand as we dig deeper into the platforms that host the SQL consoles and windows.
In the next chapter, we will dig a little deeper into PostgreSQL by exploring pgAdmin in more detail
and how to integrate PostGIS with QGIS.
7
As a brief refresher, when referring to PostGIS, we mean the open source Postgres spatial extension
for working with spatial databases. The main reason I rely on PostgreSQL is its natural integration
with spatial information at no additional cost. This is the environment where we can execute location
queries in SQL. As you will see in the next chapter, we can integrate expanded geoprocessing
functions with QGIS as a sophisticated graphical interface.
In this chapter, let's dig a little deeper into PostGIS and augment your earlier introduction to creating
a schema that organizes data into logical buckets. When you want to share data across projects, use
the public schema. Your tables are within the schemas with the spatial information stored in the geom
column.
Importing data into QGIS Database Manager for exploration and analysis in pgAdmin
Continuing to develop SQL queries and visualize them in the pgAdmin Geometry Viewer
Technical requirements
The data for the exercises in this chapter can be found on GitHub at
https://github.com/PacktPublishing/Geospatial-Analysis-with-SQL.
puerto-rico-190101-free (2022)
puerto-rico-180101-free (2018)
Configuring and authenticating PostGIS in QGIS
Although the focus of this chapter is PostGIS and the pgAdmin interface, as a macOS user, I rely on
the integration of both to upload data to PostgreSQL. When initially downloading PostGIS, you are
required to set a password. This is how you can link databases to QGIS and write spatial queries
across platforms. Thankfully, if you do happen to forget or lose it, all is not lost.
In QGIS, head over to Settings and scroll down to Options. On the left-hand side vertical menu, go
to Authentication and select Configurations, as shown in Figure 7.1. Here, click on Utilities, erase
the authentication database, and recreate your configurations:
Now that PostGIS has been enabled across platforms, you can work with spatially enabled datasets.
To create the dataset where we will work on structured queries with PostGIS, let’s return to the
geofabrik website, https://download.geofabrik.de/central-america.html. If you want to go directly to
PR without the user knowing how to navigate the site and become familiar with future projects, you
can go to http://download.geofabrik.de/north-america/us/puerto-rico.html. The dataset for Puerto
Rico is located at the Central America link in the paragraph preceding the list of Central American
countries. I am a big fan of geofabrik because not only do they provide a choice of formats for
working with OSM files, but the data is updated daily.
OSM data is natively in XML format and the software you select will export the data into a format of
your choosing. The shapefiles will work but require defined table structures.
First, download the protocol buffer binary format (.pbf) due to its compatibility with OSM and its
smaller size for download. Puerto Rico has a separate listing and I like this format because datasets
are assigned to files by their geometry types. Features in OSM are extracted by data type within a
giant database. They have been renamed by location, year, and data type.
The Browser panel in QGIS, as shown in Figure 7.3, will display the downloaded files. Notice the
osm.pbf file and the file folders that contain shapefiles. Depending on the data you want to use, open
the files; you will see the option to load them onto the canvas in QGIS. My current workflow is to
simply drag the files onto the QGIS canvas. Once they are in the Layers Panel area, I can import
them into PostgreSQL using Database Manager:
By clicking on the osm.pbf file in figure 7.4, you will see that files are arranged by geometric type. I
think this is cleaner and easier to organize but feel free to use the shapefile format shown in Figure
7.5. Because each feature is listed separately, it makes the SQL queries a bit busier, but your mileage
may vary:
Figure 7.4 – The osm.pbf files listed by geometric type in the QGIS Browser panel
Figure 7.4 – The osm.pbf files listed by geometric type in the QGIS Browser panel
By contrast, the shape files are listed by category not geometry, as displayed in Figure 7.5, although
a_free is a polygon and free is point data. The difference between them is whether a building is
located by a point or if you can visualize the perimeter boundaries as a polygon. In OSM, point-type
features are nodes. When you connect two or more nodes, you create a way, which is the line
connecting the nodes. To organize multiple nodes, think about connecting nodes and ways – this
represents a multipolygon. In OSM, these are referred to as relations:
Figure 7.5 – The format of shape files within the OSM database files
Figure 7.5 – The format of shape files within the OSM database files
The osm.pbf files also contain a Fields column. In Figure 7.6, you can see the dropdown for the
column headings in the attribute table:
Now that the data has been downloaded, let’s connect QGIS to PostGIS so that we can explore
PostGIS in pgAdmin and then also in QGIS for Chapter 8, Integrating SQL with QGIS. By scrolling
down the Browser panel in QGIS, you can right-click and create a connection that will be visible in
pgAdmin.
Now, let’s look at how to get data into PostGIS.
Selecting the Test Connection button will confirm that the connection is now valid:
Feel free to rename your tables by right-clicking and entering the preferred name. Now that your data
has been imported into the connection you created, which is SQL in my case, simply right-clicking,
as seen in Figure 7.8, reveals the options for Refresh (to update your data) and executing SQL
queries within QGIS. We will return to the query editor in Chapter 8, Integrating SQL with QGIS:
In the next section, we will return to pgAdmin to interact with Postgres databases and specifically
PostGIS and spatial functions. Although I won’t be covering the topic of database administration, it is
managed within pgAdmin and there are additional resources available in the documentation at
https://www.pgAdmin.org/docs/pgAdmin/latest/getting_started.html.
Now that you have accessed DB Manager to import datasets and connect to pgAdmin, you are ready
to explore and analyze your data in pgAdmin.
I suggest that you explore the options across the menu bar, depending on your role and interests.
Figure 7.10 highlights the Preferences options and the customization options for Graph Visualizer.
Row Limit, as described, is a choice between the speed of the execution of your query and the
display of a graphic, chart, or map:
With that, you have imported and uploaded your data to pgAdmin. We are now ready to execute SQL
queries and begin building a data story.
For our query, we will select all of the columns in the table with the multipolygons by including *(see
the following SQL statement). Creating an alias, mp, to refer to the table, simplifies the code for the
following queries. We only want the landuse column, where the variable is equal to residential.
Following major weather events, the loss of residential properties is typically profound and Puerto
Rico is no exception. I invite you to also compare the differences in commercial and retail, as well
as industrial. You are now able to build a richer story of the struggles in rebuilding infrastructure,
many of which are unique to Puerto Rico as an island and a territory of the US. Although Puerto
Ricans follow United States Federal laws, at the time of writing, they are not permitted to vote in
presidential elections and do not have voting representation by Congress.
Recall that single quotes are used for variables listed in columns of a table. Because 'residential' is
a variable within a column, notice the single quotes. The alias, mp, is defined in the FROM statement to
simplify the query:
ST_Transform will transform coordinates in your table into the specified spatial reference system. The
spatial_ref_sys table listed with your tables in pgAdmin contains a list of SRIDs.
The output of your SQL query will be like what’s shown in Figure 7.11. Geometry Viewer is a
convenient tool for observing the results of the query visually. From here, you can zoom into areas of
interest and even select the polygons for additional information:
Puerto Rico has experienced several major weather events since 2017, including hurricanes Irma and
Maria, as well as seismic activity. The 2022 hurricane Fiona moved through the area and the impact
on infrastructure can be observed by comparing recent data with archived historical OSM data on
geofabrik. You can also explore the neighboring island, Isla de Vieques, where a hospital that was
lost in 2017 is still not fully rebuilt. This small island is also where the last Spanish fort in the
Americas was built.
Visually, even by observing the data output, it is difficult to distinguish differences such as those in
the figure provided. We will need to execute a series of SQL queries and see if we can quantify any
of the results. This will be the focus of Chapter 8, Integrating SQL with QGIS, when we will explore
raster functions.
Figure 7.12 displays 2018 data from geofabrik that was downloaded using the available shapefiles:
The column headings are slightly different but the overall structure of the query remains the same:
To highlight and work with SQL syntax, it is important to identify datasets. The Puerto Rico dataset,
although only a snapshot of potential queries, has a wide variety of data layers, including boundaries
or polygons such as buildings or areas of interest and amenities that include facilities for use by
residents or visitors. Point data includes places and lines and multilines that include routes,
waterways, and highways.
Importing additional data – power plants
For example, if we want to look at the location of Power_Plants in Puerto Rico, we can upload the
data and select the data from Puerto Rico, as seen in Figure 7.13:
Click on the layers icon at the top right-hand corner of the canvas and explore the options for
basemaps that are available in pgAdmin:
There are also times when the basemap should fade into the background to highlight another feature,
such as when we look at the change in waterways after weather events, as shown in Figure 7.14
(2019) and Figure 7.15 (2022):
There are many features to explore in the datasets. Often, it is a simple query that leads to interesting
insights. Notice the number of rows generated for a simple idea of quantifying insights:
This book isn’t able to provide inexhaustible instructions on any technical tool or platform. My goal
here is to provide a template for deeper exploration both in writing SQL queries and geospatial
analysis.
You should bring a variety of data layers into any data question. This includes areas, boundaries, and
building polygons combined with point data that includes named places, as well as lines and
multilines for demonstrating roads and routes through and around a geographical area.
SQL uses an IS NOT NULL condition that simply returns TRUE when it encounters a non-NULL value;
otherwise, it will return FALSE. This is applicable to SELECT, INSERT, UPDATE, or DELETE statements.
In Figure 7.16, we are querying land_area, which returns the name of the land_area specified. We
can see that three values or polygons are returned. Let’s take a look at the Geometry Viewer area in
Figure 7.17:
When we visualize our data in the Geometry Viewer area, we will see the three municipalities
represented. Curious as to why we only have these areas visualized, we can return to the data output
and observe the rows with missing data in the name column.
Figure 7.17 – Geometry Viewer of the IS NOT Null statement for land use in OSM
Figure 7.17 – Geometry Viewer of the IS NOT Null statement for land use in OSM
The data shows that not all of the municipalities have a name entered. Let’s rerun the query and see
what the output looks like if we include null values. Figure 7.18 returns the land_area in the puerto
rico multipolygon shapefile dataset we anticipated:
Explore this using IS NOT NULL and compare the 2022 data with the 2019 data:
The multiline data shows routes through a geographical area, so I imagined they would slowly
increase in number after a natural disaster. It might be revealing to explore the rate of rebuilding and
if certain municipalities seem to have priority.
What do you think the 2022 data will reveal? Simply change the table year and see what you notice.
Does this change if you remove IS NOT NULL? Why or why not?
Upon exploring IS NOT NULL one more time in Figure 7.19 and comparing it to the output shown in
Figure 7.20 where we remove the expression, there are a few more routes visible:
Because OSM is a living document, one of the disadvantages includes potentially incomplete
datasets. Always do a little research to also examine what you should expect to see in a dataset. For
example, knowing the number of municipalities in a region or background knowledge about a
transportation network or drivers of change in deforestation or military presence will help inform
your analysis.
FROM name_table: What table are you selecting this attribute from?
Try to recognize this template in queries you observe or create. There should be a story embedded in
each query.
Spatial queries
How does PostGIS extract information from a database? Recall that we are querying coordinates,
reference systems, and other dimensions from a wide variety of datasets of different geometries that
include the length of a line, the area of a polygon, or even a point location. These are collected based
on specific properties of geometries.
OpenStreetMap data is often stored in a Postgres database with PostGIS extensions. This type of
database provides fast access to the data and can be readily imported into QGIS to import the raw
OSM data into a Postgres database.
When uploading, you will notice the Overpass API running in the background to extract the data
from the main OpenStreetMap database. The API is faster than simply accessing the main OSM
database server. Using SQL query language allows you to customize the subset of the data you are
interested in exploring.
That brief reminder will be helpful when we approach multi-table queries again.
I was taught this in a course I took years ago. Think of a predicate as determining if the point of your
initial query is true or false. The point of joining a table on a true or false predicate is to assert
something about the statement.
This predicate is based on field values replaced by spatial relationships between features in your
dataset. Now, I am saying that if the expressions I select are true, I would like to join this table to a
second table based on another set of assertions. You can keep adding joins like so, so long as you are
considering a statement that asserts something:
SELECT expressions FROM table 1 JOIN table 2 ON predicate JOIN table 3 ON predicate
……
Let’s use ST_Contains to add a JOIN and explore a multi-table spatial query. ST_Contains accepts two
geometries (a and b). If geometry b or the second geometry is located inside geometry a or the first
geometry, then you have a true value. If the second geometry is not located in the first, it will return
false and you won’t have a value returned. Because this function returns a binary or true/false value,
it can be used for a JOIN. Using these functions in this way is referred to as a spatial join.
We are using aliases for the multipolygon table, mp, and the points table, c. You need to select the
geometries if you want to be able to visualize the output in the Geometry Viewer area.
The question I am formulating is about determining how many hospitals are in the cities located in
the place column in the pr_points_2019 dataset. The count() function only counts non-null values
but I am interested in returning all of the rows. This is the count(*) function in the query.
So, the story is to select hospitals inside the cities and tally them up! But only do so if the building is
inside the city and also a hospital. Have a look at the following query:
SELECT mp.geom,mp.building,
count(*) as count
FROM pr_multipolygons_2019 mp
JOIN pr_points_2019 c
ON ST_Contains(mp.geom, c.geom)
WHERE building = 'hospital'
GROUP BY
mp.building,mp.geom,c.name;
Now, we have a count of the hospitals in 2019. Our dataset reports 15 hospitals. Now, let’s add the
name of the hospital to the query and rerun it with our 2022 dataset, as shown in Figure 7.22:
Remember that the data is only as complete as the updating and uploading to OSM, but we can see
that there are more hospitals back online in 2022, and overall, that is a good thing. We can rerun our
2019 data and add the names, remembering that our count(*) will also return null values, as shown in
Figure 7.23:
Figure 7.23 – Identifying the names of hospitals located in the cities included in the dataset
Figure 7.23 – Identifying the names of hospitals located in the cities included in the dataset
In Figure 7.24, we can zoom into one of the hospitals. It can be challenging to locate them on the
maps since there are so few of them and they are scattered throughout the island. The majority are
located near San Juan and the larger cities. Again, when querying our datasets, we need to think
about the limitations of the data we are analyzing:
Figure 7.24 – Geometry Viewer is zoomed into an observable polygon representing a hospital in 2019
Figure 7.24 – Geometry Viewer is zoomed into an observable polygon representing a hospital in 2019
How can we capture more of the medical facilities that may have been brought online following an
earthquake, power outage, or weather disaster such as a hurricane? These are some examples of the
type of deeper dives we will take in the next chapter, where we will apply spatial statistics and
explore raster data.
Summary
In this chapter, we dove a little deeper into a few of the tools. You learned how to load a large dataset
into QGIS for access within the pgAdmin graphical interface. For me, I think it’s a good way to
enhance the utility but also the limitations of using pgAdmin as a management tool. I typically use it to
explore recently uploaded data via QGIS once I refresh and make sure the data tables are accessible
in pgAdmin. Functions can also be run in the query editor and saved to a file to be uploaded into QGIS
or your GIS of preference.
We will return to QGIS in the next chapter. Thanks for coming along.
8
In this chapter, you will learn to explore SQL queries within DB Manager in QGIS and you will also
learn how to import and configure raster datasets along with calculating raster functions.
QGIS plugins
Technical Requirements
I invite you to find your own data if you are comfortable or access the data recommended in this
book’s GitHub repository at: https://github.com/PacktPublishing/Geospatial-Analysis-with-SQL.
The data for the exercises in all these chapters can be found on GitHub at https://github.com/PacktPublishing/Geospatial-
Analysis-with-SQL
Figure 8.1 – The SQL query builder inside the SQL window in QGI
2. Where is populated when you select Columns’ values in the right-hand vertical section in Figure 8.2. The column value in the
data column will populate and display the variables listed in the column of interest. In the example, values for landuse equal to
residential will be available. Also, don’t forget to add the = operator.
Figure 8.2 – The SQL query builder in QGIS: adding data
3. The SQL query builder is also a useful tool for improving your skills. When you hit OK, the query populates as shown in Figure
8.3. From here, you can complete the process of selecting the unique identifier in the dataset and the field that contains your
geometry (geom). I suggest naming the layer, especially if you will be querying the dataset numerous times. Otherwise, your
layers will be labeled QueryLayer, QueryLayer_2, and so on. Include geom and id as unique identifiers for loading the
data onto the canvas.
Figure 8.3 – Executing a SQL query in DB Manager QGI
After loading the new layer, right-click on the new layer in the QGIS panel and select Update SQL.
Figure 8.4 displays the pane and the option to update the layer on the canvas. Once you zoom in to
the layer, the canvas updates with the visual output of the SQL query you entered.
Figure 8.4 – Updating the SQL layer in QGI
The ability to create layers and add or remove them from the canvas yields dynamic interactive
queries generated live as you explore datasets from multiple layers. Figure 8.5 shows QueryLayer
updated in the canvas.
Figure 8.5 – Updated SQL layer on the QGIS canvas
This may be a single step in a several-step workflow or an attempt to visualize the residential
properties in the recent 2022 multipolygon dataset. The ability to visualize queries using a robust
graphical interface helps to generate deeper insights. For example, how does the number of
residential properties compare to the number after the 2017 hurricane, isolated earthquakes, or any
other events that have preceded your query?
Changes to our data layers are often visually distinct but let’s see whether a SQL query can help us
make a quantitative comparison between two different datasets.
Here are the functions we will access to compare polygons between our datasets:
The ST_Multi geometry (geometry geom): https://postgis.net/docs/ST_Multi.html
The ST_Intersects Boolean (geometry geomA and geometry geomB): https://postgis.net/docs/ST_Intersects .html
First, we are using the ST_Multi function to work with this collection of geometries.
The COALESCE function is operating like a filter to return non-null arguments from left to right, as it
looks for ST_Difference using the geometry of the landuse_2019 and landuse_2022 datasets.
You can see the query in the DB Manager SQL window in Figure 8.6. Execute and select Load as
new layer. Select landuse_2019 as the geometry column and load the layer to the canvas.
The polygons are layered in some instances and can be easily masked. To explore SQL queries within
the scope of this chapter (and this book), we will look at the differences between the polygons for
now and follow it up with a visual comparison in Figure 8.7.
Figure 8.7 – Updated query layer in the QGIS canva
Zooming into the layers for landuse_2019 and land-use_2022, you can see the layer styling depicting
residential areas as pink, green spaces such as forests, farms, and so on as green, and industrial or
military zones as black in Figure 8.8.
Figure 8.8 – Multipolygon land use from 2019 (left) and 2022 (right)
The biggest areas of change appear to be the return of commercial and residential neighborhoods.
Once you detect a change, it is important to focus on a specific area and dig a little deeper. Explore
the built infrastructure, and add road networks or other layers to the canvas. See what you notice.
In the following query, which generated Figure 8.9, we are exploring both land use datasets and using
linear boundaries as a "blade" to split the data. You can also see CROSS JOIN LATERAL here, which is
basically a subquery that can go back to earlier entries in a FROM clause and join each row with all of
the table functions that apply to that row, the blade geometry:
SELECT ST_Multi(COALESCE(
ST_Difference(a.geom, blade.geom),
A.geom
)) AS geom
FROM public.landuse_2019 AS a
CROSS JOIN LATERAL (
SELECT ST_Union(b.geom) AS geom
FROM public.landuse_2022 as b
WHERE ST_Intersects (a.geom, b.geom) ) AS blade
;
The multipolygon landuse_2019 dataset is using linear boundaries as a blade to split the data.
ST_Intersects compares two geometries and returns true if they intersect by having, at minimum,
one point in common. ST_Union combines the geometry without overlaps, often as a geometry
collection.
Figure 8.9 – Comparing polygons as a cross join across two datasets using blade geometry
The red areas indicate differences between the datasets. The green polygons are primarily forests and
the black color in the map represents industrial or military zones. The polygon in the upper-right
corner indicates a change in forest cover in 2022. Notice the polygon is not present in Figure 8.10 on
the right.
Figure 8.10 – Output of polygon differences between landuse_2017 (left) and landuse_2022 (right)
So far, we have been working with vector geometries. Raster data is a geographic data type stored as
a grid of regularly sized pixels. You may have noticed satellite basemaps in GIS interfaces such as
QGIS, for example. Each of the cells or pixels in a grid contains values that represent information.
Rasters that represent thematic data can be derived from analyzing other data. A common application
of raster data is the classification of a satellite image by looking at land use categories.
In the next section, you will be introduced to raster data and learn about a few functions to explore
and interact with some of the features in QGIS.
Raster data
Learning how to work with raster data could actually fill an entire book. My goal is to present an
introduction discussing where to find datasets, upload them to your database, and begin interacting
within QGIS using a few built-in tools and the SQL query builder.
Navigating to the Plugins menu, you should have the DB Manager plugin installed. Let’s add the
Google Earth and the Google Earth Engine Data Catalog plug-in. You will not need this to follow
along with the examples provided in the chapter but it is a powerful tool for locating raster data and
my goal is for you to continue exploring SQL and geospatial analysis beyond the pages of this book.
SRTM_downloader is an additional resource for locating raster data. Return to the Plugins menu and
search for SRTM_downloader. Once it's downloaded, enter your credentials on the Earthdata website.
On the canvas, select Set canvas extent. Click on the icon that was loaded in your toolbar when you
downloaded the plugin. SRTM_downloader will appear as shown in Figure 8.12. Set the canvas extent
to render the raster layer and save the file locally.
Figure 8.12 – The SRTM-Downloader window
The images will load into the window and directly into your Layer panel. The raster uploads into the
canvas onto the extent you uploaded into the console in Figure 8.13.
Installing plugins
We installed plugins in earlier chapters but a brief reminder of the steps is necessary for a successful
installation. The Google Earth Engine (GEE) plugin allows you to generate maps directly from
terminal or Python console within QGIS. Because access to GEE will need to be authenticated, you
will need to link your GEE access to QGIS. You will need an active GEE account
(https://earthengine.google.com/) and the gcloud CLI (https://cloud.google.com/sdk/docs/install) to
authenticate within terminal or Python console in QGIS.
Figure 8.11 displays the Plugins option in the top menu bar in QGIS.
The biggest challenges and sources of error with installing plugins or necessary dependencies result
from inaccurate paths. Later in the chapter, we will be importing our data using terminal. An
important resource is the System setting in QGIS, shown in Figure 8.15.
Figure 8.15 – Exploring system paths in QGIS from the menu bar
Let’s go through what’s happening in the previous figure in the following set of steps:
1. Click on the QGIS version label in your menu and select Preferences.
2. If you generate an error, it is important to note the path and the nature of the error. Often, the installation is not in the right
location and you need to modify this either manually or by redirecting the path in terminal.
3. To find terminal window, you can search for it on your computer if it isn’t displayed.
4. Navigate to terminal to install earthengine-api. You can discover the path to your working directory by entering pwd, or the path
to your current directory by entering cd.
2. Returning to QGIS, open your Python console, visible as the Python logo in your menu or plugin (or located by searching in the
lower left-hand window) as shown in Figure 8.13, and enter the following code into the editor. You will need to access the
documentation for installing on other systems or if you used OSGeo4W installer for QGIS.
import ee
ee.Authenticate()
ee.Initialize()
print(ee.Image("NASA/NASADEM_HGT/001").get("title").getInfo())
There are a variety of system settings and if you run into any difficulties, here are a couple of good
resources. Again, the biggest reason for incompatibility is simply that your packages are not in the
same path as QGIS:
Python installation: https://developers.google.com/earth-engine/guides/python_install
One of the biggest advantages of QGIS and GEE integration is direct access to the catalog of scenes
and maps. The GEE data catalog also has a plugin to provide direct access to images and collections
of satellite data.
I have noticed that with the QGIS- LTR, you may need to reinstall the plug-in when logging in again.
Clicking on Database Toolbar and selecting the icon for the plugin (seen during plugin download)
under Toolbars as shown in Figure 8.17 allows you to visualize the available datasets for your region
of interest.
Figure 8.17 – Viewing the toolbars in QGIS
Figure 8.18 demonstrates that depending on the region you select on your canvas, you can indicate
the dataset you want to explore, the bands, the dates for which to view satellite data, and the amount
of cloud coverage applicable to your data. This resource is useful for exploring specific date ranges
as you select satellite images based on dates. These settings require interaction to find appropriate
images. For example, perhaps there were clouds during your date intervals of interest. Allowing more
cloud coverage or extending the date of interest will allow more layers for exploration. In addition,
each image may not necessarily cover the geographic area you prefer.
The layer is added to the canvas (be sure to check the checkbox in Figure 8.18) and although there is
cloud cover, I can see areas in Figure 8.19 into which I can zoom and retrieve useful information.
Naturally, if we were only interested in a specific area, we would need to scrutinize the data to locate
usable data. Here, we are focused on the land use data from Puerto Rico that we downloaded from
Geofabrik in Chapter 7, Exploring PostGIS for Geographic Analysis:
Your work with plugins is limited only by your skills, and they are an important tool and resource
available within QGIS. There is also a cloud masking plug-in that works with Landsat data for you to
explore. I hope this introduction piqued your interest and you continue to explore. Next, I will share a
few more out-of-the-box options before introducing you to the actual resource I am using for the
examples within this chapter.
Other data resources for raster data
The Cornell University Geospatial Information Repository (CUGIR)
(https://cugir.library.cornell.edu/) is a great resource for exploring raster data. The site hosts a wide
variety of categories to download data for maps with easy downloads in different formats, as seen in
Figure 8.20.
Figure 8.20 – CUGIR
For me, CUGIR is a perfect resource for instruction but when I want timely data for a specific
question or data exploration, I head over to the United States Geological Society (USGS)
(https://earthexplorer.usgs.gov/). Create your free account and you will have access to raster data.
USGS EarthExplorer
Use the map to select the region of interest or scroll down on the left side to find a feature of interest
in the US or the world. In this case, I am interested in Puerto Rico – not a particular area but seeking
a sample with minimal cloud cover and compelling topography for our exercise in raster exploration.
Scrolling down through the options, you can select a date range, the level of cloud cover, and your
datasets to consider. Figure 8.21 displays the datasets currently available. I am looking for Landsat
images.
Figure 8.21 – Available datasets in the USGS EarthExplorer hub
Additional search criteria can help you identify the appropriate download. In Figure 8.22, the red
Polygon option that is selected is visible and I opted to select Use Map for the reference since I had
zoomed in on the location of interest.
Figure 8.22 – Data search in EarthExplorer
The results are displayed by the date acquired, listed in the left-hand vertical panel. The footprint
icon (highlighted in blue in Figure 8.23) shows you where the satellite image is located in relation to
your highlighted polygon. When you click on the icon next in line, you can actually see what the
image will look like. You can see the actual cloud cover of the area of interest, for example. Click on
the other images and see how they compare to the other displayed results. Observe the metadata and
options for downloading the images to your local computer.
Figure 8.23 – Exploring available images in EarthExplorer
As you browse through images, what you see in Figure 8.24 displays once you click on the image
boundary on the canvas. Scroll down for information about the landsat product you are observing.
The Browse icon will provide metadata for you to explore. This data is handy if you need to know
about how the pixel data is rendered or the projection of the dataset, for example.
I hope you will browse around the examples of different datasets to explore. Many are offered with a
variety of download formats, such as Digital Elevation Models (DEMs), GeoTiff, or even XML
(QGIS can convert to GeoTiff).
For importing raster data and exploring raster data, we can learn about a few SQL functions, but first,
we need to locate the data and download it to our local computer. Download the years of interest
available by scrolling in the lower-left panel. I selected 2017 and 2020 to explore how land use might
have evolved due to a few hurricanes and natural events.
raster2pgsql
Importing rasters into the database on macOS and even Windows or Linux relies on the raster loader
executable, raster2pgsql. The Geospatial Data Abstraction Library (GDAL) supports raster and
vector data formats and uploads rasters to a PostGIS SQL raster table.
Open your terminal. You can follow the installation instructions for postgres.app here:
https://postgresapp.com/. When you want to check the installation, simply enter psql into the
command line.
---------------------------------------
3.2 USE_GEOS=1 USE_PROJ=1 USE_STATS=1
(1 row)
Now that you can be confident that you are up and running, let’s import our downloaded raster data.
Enter raster2pgsql -G to see the GDAL-supported raster formats. Here is a sample.
The command parameters for importing data are the same, with a few programmable options. The
syntax is relatively straightforward once you are familiar.
raster2pgsql creates an input file and loads it into 100 x 100 tiles. The -I option creates a spatial
index on the raster column once the table has been generated. A spatial index will speed up your
queries.
The following code statement is explained here. Raster constraints are generated by -C and include a
-s SRID – pixel size, for example. Following the 100 x 100 tiles, you will enter the path to the
downloaded file highlighted in the following code. I am using the public schema and have already
created a database table (public.landuse_2017) and database (bonnymcclain) in pgAdmin:
Here is an example from my terminal to demonstrate the output and index creation (a snapshot):
Remember that if you are having trouble, figure out the correct path, then enter cd and the name of
the folder to bring you to the right location (depending on where your files are located):
cd Applications
If you don’t indicate a directory, you will simply go to your home folder (or select cd ~). Here are a
few more examples:
ls or list
You can now see the contents of the folder you selected.
cd / returns you to the root level of your startup disk.
You might have to look up the shell commands for your specific operating system.
ST_Metadata
When you want to explore your raster dataset, you can either run a query in pgAdmin or QGIS. I often
have both running so I can do simple queries where I am not going to need a graphical interface
directly in pgAdmin. The SRID and number of bands are quick pieces of information I like to have
access to when writing queries or visualizing maps, as in Figure 8.26:
SELECT
rid,
(ST_Metadata(rast)).*
FROM
public.landuse_2017
LIMIT
100
Figure 8.26 – ST_Metadata SQL query
In addition to ST_Metadata running in either pgAdmin or QGIS, the database format in pgAdmin lists
raster information in the tree running below the browser, as shown in Figure 8.27:
The SQL code that renders similar information is commented out, so it won’t run, indicated by -- If
you enter SELECT * FROM raster_columns and scroll from left to right, you will have information on
all of your raster tables.
Polygonizing a raster
ST_DumpAsPolygons returns geomval rows, created by a specific geometry (geom) and the pixel value
(val). The polygon displayed is from the union of pixels with the same pixel value (val) from the
selected band:
Load it as a new layer and now you can view it on the canvas in QGIS, as shown in Figure 8.28.
Select the QueryLayer or landuse_2022 polygon to see the rasterized polygon. To increase visibility,
select Singleband pseudocolor in Layer Properties, selecting the defaults in Figure 8.29.
QueryLayer will provide raster information and landuse_2022 contains information about features
such as the military base highlighted in red in Figure 8.30.
Figure 8.29 – Layer properties in QGIS for raster band rendering
Although I don’t recommend going crazy with colors, I have shown these options to illustrate the
scope of customization available.
Figure 8.30 – Single-band pseudocolor and land use multipolygons
Figure 8.31 is a close-up of the information that pops up when you select a polygon on the canvas:
Figure 8.31 – The landuse_2022 feature layer, along with our ST_DumpPolygon and raster layer
You should now have the polygonized raster on the canvas and be able to explore the relationships
displayed.
Summary
In this chapter, you saw the value of spatial queries when comparing different datasets consisting of
polygons or multipolygons. You uploaded vector and raster data into the database and were
introduced to additional QGIS plugins. The ability to access data resources and import them using
terminal helped create raster functions and explore the process of creating polygons out of rasters.
In this book, you have been introduced to SQL and geospatial analysis using PostGIS and QGIS as
helpful tools in processing both vector geometries and raster data. I recommend exploring additional
datasets and adding different functions to your query language skills.
I thank you for the time and attention along the way and wish you luck on your continued spatial
journey.
Index
As this ebook edition doesn't have fixed pagination, the page numbers below are hyperlinked for
reference only, based on the printed edition of this book.
A
American National Standards Institute (ANSI) 5
Anaconda 191
ANALYZE function 55
anomalies
detecting 47-49
B
bounding boxes 52
Brownsville 48
C
census-designated places 136-139
census table
codes 78-82
columns, renaming 78
URL 194
CREATE statement 22
QGIS 23
importing 36-39
preparing 102
databases
connecting to 53-58
dataset
E
Earthdata
URL 187
F
Federal Information Processing Standard (FIPS) 5
functions, SQL
creating 118-121
data types 7
spatial databases 4
SQL 11
SRIDs 5-7
G
gcloud
URL 189
geofabrik 155
geospatial analytics
fundamentals 3
raster data 10
vector data 8
URL 189
H
hypotheses
testing 47-49
I
International Electrotechnical Commission (IEC) 5
using 167-170
L
Landscape Change Monitoring System (LCMS) 197, 198
URL 197
M
Miniconda 191
multi-table joins
exploring 173-177
N
nodes 156
O
object-oriented programming (OOP) language 12
URL 80
OSM data
OSM features
reference link 105
P
Palmas 104
exploring 61-67
patterns
detecting 47-49
pgAdmin 19
download link 19
installation 19-21
polygons
PostGIS
PostGIS documentation 52
PostgreSQL 11, 15
download link 15
installation 15-19
prediction models
building 91-96
psql
Q
QGIS 23
download link 23
installing 23
queries
sorting 136
fundamentals 39
QuickOSM 102
R
raster
polygonizing 202
analyzing 10, 11
relations 156
S
SELECT statement 12
spatial concepts
ST_Within 109
spatial databases 4
creating 28-30
spatial relationships
analyzing 45, 46
spatial vectors
SQL
CREATE statement 22
exploring 11
SELECT statement 12
SQL queries
executing 53-58
in QGIS 180-183
SRIDs 5, 6
PostGIS SRID 6, 7
SRTM downloader
ST_Area function
ST_Buffer 116
ST_Contains
exploring 173
ST_DumpAsPolygons
ST_DWITHIN function
ST_Intersects function 54
ST_Within 110
T
table columns
renaming 78
ST_DumpAsPolygons 202-204
U
unique identifier 138
URL 194
datasets 194
images 196
V
VACUUM function 55
W
Well-Known Text (WKT) 110, 202
Subscribe to our online digital library for full access to over 7,000 books and videos, as well as
industry leading tools to help you plan your personal development and advance your career. For more
information, please visit our website.
Why subscribe?
Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files
available? You can upgrade to the eBook version at packt.com and as a print book customer, you are
entitled to a discount on the eBook copy. Get in touch with us at [email protected] for
more details.
At www.packt.com, you can also read a collection of free technical articles, sign up for a range of
free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.
ISBN: 978-1-80324-166-1
Automate map production to make and edit maps at scale, cutting down on repetitive tasks
Automate data updates using the ArcPy Data Access module and cursors
Query, edit, and append to feature layers and create symbology with renderers and colorizers
Learn new tricks to manage data for entire cities or large companies
Learning ArcGIS Pro
ISBN: 978-1-78528-449-6
Install ArcGIS Pro and assign Licenses to users in your organization
Navigate and use the ArcGIS Pro ribbon interface to create maps and perform analysis
Author map layouts using cartographic tools and best practices to show off the results of your analysis and maps
Import existing map documents, scenes, and globes into your new ArcGIS Pro projects quickly
Your review is important to us and the tech community and will help us make sure we’re delivering
excellent quality content.
Do you like to read on the go but are unable to carry your print books everywhere?
Is your eBook purchase not compatible with the device of your choice?
Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.
Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical
books directly into your application.
The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free
content in your inbox daily
https://packt.link/free-ebook/9781835083147
3. That’s it! We’ll send your free PDF and other benefits to your email directly