Post Gis Intro
Post Gis Intro
Release 1.0
1 Welcome 1
2 Einführung 3
3 Installation 11
7 Simple SQL 37
9 Geometries 45
10 Geometry Exercises 59
11 Räumliche Beziehungen 63
13 Spatial Joins 79
15 Spatial Indexing 89
16 Projecting Data 95
17 Projection Exercises 99
18 Geography 103
i
21 Geometry Constructing Exercises 121
23 Validity 133
24 Equality 139
28 3-D 163
Stichwortverzeichnis 227
ii
KAPITEL 1
Welcome
These sections conform to a number of conventions to make it easier to follow the conversation. This
section gives a brief overview of what to expect in the way of typographic conventions, as well as a short
overview of the structure of each workbook.
1.1.1 Directions
Directions for you, the workshop attendee, will be noted by bold font.
For example:
Click Next to continue.
1.1.2 Code
SELECT postgis_full_version();
These examples can be entered into the query window or command line interface.
1
Introduction to PostGIS, Release 1.0
1.1.3 Notes
Notes are used to provide information that is useful but not critical to the overall understanding of the
topic.
Bemerkung: If you haven’t eaten an apple today, the doctor may be on the way.
1.1.4 Functions
Where function names are defined in the text, they will be rendered in a bold font.
For example:
ST_Touches(geometry A, geometry B) returns TRUE if either of the geome-
tries‘ boundaries intersect
File names, paths, table names and column names will be shown in fixed-width font.
For example:
Select the name column in the nyc_streets table.
Menus/submenus and form elements such as fields or check boxes and other on-screen artifacts are
displayed in italics.
For example:
Click on the File > New menu. Check the box that says Confirm.
1.1.7 Workflow
Sections are designed to be progressive. Each section will start with the assumption that you have com-
pleted and understood the previous section in the series and will build on that knowledge. A single
section will progress through a handful of ideas and provide working examples wherever possible. At
the end of a section, where appropriate, we have included a handful of exercises to allow you to try out
the ideas we’ve presented. In some cases the section will include „Things To Try“. These tasks con-
tain more complex problems than the exercises and is designed to challenge participants with advanced
knowledge.
2 Kapitel 1. Welcome
KAPITEL 2
Einführung
PostGIS is a spatial database. Oracle Spatial and SQL Server (2008 and later) are also spatial databases.
But what does that mean; what is it that makes an ordinary database a spatial database?
The short answer, is. . .
Spatial databases store and manipulate spatial objects like any other object in the database.
The following briefly covers the evolution of spatial databases, and then reviews three aspects that asso-
ciate spatial data with a database – data types, indexes, and functions.
1. Spatial data types refer to shapes such as point, line, and polygon;
2. Multi-dimensional spatial indexing is used for efficient processing of spatial operations;
3. Spatial functions, posed in SQL, are for querying of spatial properties and relationships.
Combined, spatial data types, indexes, and functions provide a flexible structure for optimized perfor-
mance and analysis.
In legacy first-generation GIS implementations, all spatial data is stored in flat files and special GIS
software is required to interpret and manipulate the data. These first-generation management systems are
designed to meet the needs of users where all required data is within the user’s organizational domain.
They are proprietary, self-contained systems specifically built for handling spatial data.
Second-generation spatial systems store some data in relational databases (usually the „attribute“ or
non-spatial parts) but still lack the flexibility afforded with direct integration.
True spatial databases were born when people started to treat spatial features as first class data-
base objects.
3
Introduction to PostGIS, Release 1.0
Spatial databases fully integrate spatial data with a relational database. The system orientation changes
from GIS-centric to database-centric.
Bemerkung: A spatial database management system may be used in applications besides the geo-
graphic world. Spatial databases are used to manage data related to the anatomy of the human body,
large-scale integrated circuits, molecular structures, and electro-magnetic fields, among others.
An ordinary database has strings, numbers, and dates. A spatial database adds additional (spatial) types
for representing geographic features. These spatial data types abstract and encapsulate spatial structures
such as boundary and dimension. In many respects, spatial data types can be understood simply as
shapes.
4 Kapitel 2. Einführung
Introduction to PostGIS, Release 1.0
Spatial data types are organized in a type hierarchy. Each sub-type inherits the structure (attributes) and
the behavior (methods or functions) of its super-type.
An ordinary database provides indexes to allow for fast and random access to subsets of data. Indexing
for standard types (numbers, strings, dates) is usually done with B-tree indexes.
A B-tree partitions the data using the natural sort order to put the data into a hierarchical tree. The natural
sort order of numbers, strings, and dates is simple to determine – every value is less than, greater than
or equal to every other value.
But because polygons can overlap, can be contained in one another, and are arrayed in a two-dimensional
(or more) space, a B-tree cannot be used to efficiently index them. Real spatial databases provide a
„spatial index“ that instead answers the question „which objects are within this particular bounding
box?“.
A bounding box is the smallest rectangle – parallel to the coordinate axes – capable of containing a
given feature.
Bounding boxes are used because answering the question „is A inside B?“ is very computationally
intensive for polygons but very fast in the case of rectangles. Even the most complex polygons and
linestrings can be represented by a simple bounding box.
Indexes have to perform quickly in order to be useful. So instead of providing exact results, as B-trees
do, spatial indexes provide approximate results. The question „what lines are inside this polygon?“ will
be instead interpreted by a spatial index as „what lines have bounding boxes that are contained inside
this polygon’s bounding box?“
The actual spatial indexes implemented by various databases vary widely. The most common imple-
mentations are the R-Tree and Quadtree (used in PostGIS), but there are also grid-based indexes and
GeoHash indexes implemented in other spatial databases.
For manipulating data during a query, an ordinary database provides functions such as concatenating
strings, performing hash operations on strings, doing mathematics on numbers, and extracting informa-
tion from dates.
A spatial database provides a complete set of functions for analyzing geometric components, determi-
ning spatial relationships, and manipulating geometries. These spatial functions serve as the building
block for any spatial project.
The majority of all spatial functions can be grouped into one of the following five categories:
1. Conversion: Functions that convert between geometries and external data formats.
2. Management: Functions that manage information about spatial tables and PostGIS administrati-
on.
3. Retrieval: Functions that retrieve properties and measurements of a Geometry.
4. Comparison: Functions that compare two geometries with respect to their spatial relation.
6 Kapitel 2. Einführung
Introduction to PostGIS, Release 1.0
PostGIS turns the PostgreSQL Database Management System into a spatial database by adding support
for the three features: spatial types, spatial indexes, and spatial functions. Because it is built on Post-
greSQL, PostGIS automatically inherits important „enterprise“ features as well as open standards for
implementation.
A common question from people familiar with open source databases is, „Why wasn’t PostGIS built on
MySQL?“.
PostgreSQL has:
• Proven reliability and transactional integrity by default (ACID)
• Careful support for SQL standards (full SQL92)
• Pluggable type extension and function extension
• Community-oriented development model
• No limit on column sizes („TOAST“able tuples) to support big GIS objects
• Generic index structure (GiST) to allow R-Tree index
• Easy to add custom functions
Combined, PostgreSQL provides a very easy development path to add new spatial types. In the pro-
prietary world, only Illustra (now Informix Universal Server) allowed such easy extension. This is no
coincidence; Illustra is a proprietary re-working of the original PostgreSQL code base from the 1980’s.
Because the development path for adding types to PostgreSQL was so straightforward, it made sense to
start there. When MySQL released basic spatial types in version 4.1, the PostGIS team took a look at
their code, and the exercise reinforced the original decision to use PostgreSQL.
Because MySQL spatial objects had to be hacked on top of the string type as a special case, the MySQL
code was spread over the entire code base. Development of PostGIS 0.1 took under a month. Doing a
„MyGIS“ 0.1 would have taken a lot longer, and as such, might never have seen the light of day.
The Shapefile (and other formats like the Esri File Geodatabase and the GeoPackage) have been a stan-
dard way of storing and interacting with spatial data since GIS software was first written. However, these
„flat“ files have the following disadvantages:
• Files require special software to read and write. SQL is an abstraction for random data access
and analysis. Without that abstraction, you will need to write all the access and analysis code
yourself.
• Concurrent users can cause corruption and slowdowns. While it’s possible to write extra code
to ensure that multiple writes to the same file do not corrupt the data, by the time you have solved
the problem and also solved the associated performance problem, you will have written the better
part of a database system. Why not just use a standard database?
• Complicated questions require complicated software to answer. Complicated and interesting
questions (spatial joins, aggregations, etc) that are expressible in one line of SQL in the database
take hundreds of lines of specialized code to answer when programming against files.
Most users of PostGIS are setting up systems where multiple applications will be expected to access the
data, so having a standard SQL access method simplifies deployment and development. Some users are
working with large data sets; with files, they might be segmented into multiple files, but in a database
they can be stored as a single large table.
In summation, the combination of support for multiple users, complex ad hoc queries, and performance
on large data sets are what sets spatial databases apart from file-based systems.
In the May of 2001, Refractions Research released the first version of PostGIS. PostGIS 0.1 had objects,
indexes and a handful of functions. The result was a database suitable for storage and retrieval, but not
analysis.
As the number of functions increased, the need for an organizing principle became clear. The „Simple
Features for SQL“ (SFSQL) specification from the Open Geospatial Consortium provided such structure
with guidelines for function naming and requirements.
With PostGIS support for simple analysis and spatial joins, Mapserver became the first external applica-
tion to provide visualization of data in the database.
Over the next several years the number of PostGIS functions grew, but its power remained limited. Many
of the most interesting functions (e.g., ST_Intersects(), ST_Buffer(), ST_Union()) were very difficult to
code. Writing them from scratch promised years of work.
Fortunately a second project, the „Geometry Engine, Open Source“ or GEOS, came along. The GEOS li-
brary provides the necessary algorithms for implementing the SFSQL specification. By linking in GEOS,
PostGIS provided complete support for SFSQL by version 0.8.
As PostGIS data capacity grew, another issue surfaced: the representation used to store geometry proved
relatively inefficient. For small objects like points and short lines, the metadata in the representation had
as much as a 300% overhead. For performance reasons, it was necessary to put the representation on a
diet. By shrinking the metadata header and required dimensions, overhead greatly reduced. In PostGIS
1.0, this new, faster, lightweight representation became the default.
Recent releases of PostGIS continue to add features and performance improvements, as well as support
for new features in the PostgreSQL core system.
8 Kapitel 2. Einführung
Introduction to PostGIS, Release 1.0
For a complete list of case studies, see the PostGIS case studies page.
IGN is the national mapping agency of France, and uses PostGIS to store the high resolution topographic
map of the country, „BDUni“. BDUni has more than 100 million features, and is maintained by a staff
of over 100 field staff who verify observations and add new mapping to the database daily. The IGN
installation uses the database transactional system to ensure consistency during update processes, and a
warm standby system to maintain uptime in the event of a system failure.
RedFin
RedFin is a real estate agency with a web-based service for exploring properties and estimate values.
Their system was originally build on MySQL, but they found that moving to PostgreSQL and PostGIS
provided huge benefits in performance and reliability.
PostGIS has become a widely used spatial database, and the number of third-party programs that support
storing and retrieving data using it has increased as well. The programs that support PostGIS include
both open source and proprietary software on both server and desktop systems.
The following table shows a list of some of the software that leverages PostGIS:
Open/Free Closed/Proprietary
• Loading/Extracting • Loading/Extracting
– Shp2Pgsql – Safe FME Desktop Transla-
– ogr2ogr tor/Converter
– Dxf2PostGIS • Web-Based
• Web-Based – Ionic Red Spider (now ERDAS)
– Mapserver – Cadcorp GeognoSIS
– GeoServer (Java-based WFS / WMS – Iwan Mapserver
-server ) – MapDotNet Server
– SharpMap SDK - for ASP.NET 2.0 – MapGuide Enterprise (using FDO)
– MapGuide Open Source (using FDO) – ESRI ArcGIS Server
• Desktop • Desktop
– uDig – Cadcorp SIS
– QGIS – Microimages TNTmips GIS
– mezoGIS – ESRI ArcGIS
– OpenJUMP – Manifold
– OpenEV – GeoConcept
– SharpMap SDK for Microsoft.NET – MapInfo (v10)
2.0 – AutoCAD Map 3D (using FDO)
– ZigGIS for ArcGIS/ArcObjects.NET
– GvSIG
– GRASS
10 Kapitel 2. Einführung
KAPITEL 3
Installation
To explore the PostgreSQL/PostGIS database, and learn about writing spatial queries in SQL, we will
need some software, either installed locally or available remotely on the cloud.
• There are instructions below on how to access PostgreSQL for installation on Windows or MacOS.
PostgreSQL for Windows and MacOS either include PostGIS or have an easy way to add it on.
• There are instructions below on how to install PgAdmin. PgAdmin is a graphical database explorer
and SQL editor which provides a „user facing“ interface to the database engine that does all the
world.
For always up-to-date directions on installing PostgreSQL, go to the PostgreSQL download page and
select the operating system you are using.
11
Introduction to PostGIS, Release 1.0
3. In the Applications folder, double-click the Postgres icon to start the server.
12 Kapitel 3. Installation
Introduction to PostGIS, Release 1.0
14 Kapitel 3. Installation
KAPITEL 4
4.1 PgAdmin
PostgreSQL has a number of administrative front-ends. The primary one is psql, a command-line tool
for entering SQL queries. Another popular PostgreSQL front-end is the free and open source graphical
tool pgAdmin. All queries done in pgAdmin can also be done on the command line with psql.
1. Find pgAdmin and start it up.
2. If this is the first time you have run pgAdmin, you probably don’t have any servers configured.
Right click the Servers item in the Browser panel.
We’ll name our server PostGIS. In the Connection tab, enter the Host name/address. If
you’re working with a local PostgreSQL install, you’ll be able to use localhost. If you’re
using a cloud service, you should be able to retrieve the host name from your account.
Leave Port set at 5432, and both Maintenance database and Username as postgres. The
15
Introduction to PostGIS, Release 1.0
Password should be what you specified with a local install or with your cloud service.
1. Open the Databases tree item and have a look at the available databases. The postgres database
is the user database for the default postgres user and is not too interesting to us.
2. Right-click on the Databases item and select New Database.
3. Fill in the Create Database form as shown below and click OK.
Name nyc
Owner postgres
4. Select the new nyc database and open it up to display the tree of objects. You’ll see the public
schema.
5. Click on the SQL query button indicated below (or go to Tools > Query Tool).
6. Enter the following query into the query text field to load the PostGIS spatial extension:
7. Click the Play button in the toolbar (or press F5) to „Execute the query.“
8. Now confirm that PostGIS is installed by running a PostGIS function:
SELECT postgis_full_version();
Supported by a wide variety of libraries and applications, PostGIS provides many options for loading
data.
We will first load our working data from a database backup file, then review some standard ways of
loading different GIS data formats using common tools.
1. In the PgAdmin browser, right-click on the nyc database icon, and then select the Restore. . .
option.
21
Introduction to PostGIS, Release 1.0
2. Browse to the location of your workshop data data directory, and select the nyc_data.backup
file.
3. Click on the Restore options tab, scroll down to the Do not save section and toggle Owner to
Yes.
4. Click the Restore button. The database restore should run to completion without errors.
5. After the load is complete, right click the nyc database, and select the Refresh option to update
the client information about what tables exist in the database.
ogr2ogr is a commandline utility for converting data between GIS data formats, including common
file formats and common spatial databases.
Windows:
• Builds of ogr2ogr can be downloaded from GIS Internals.
• ogr2ogr is included as part of QGIS Install and accessible via OSGeo4W Shell -
• Builds of ogr2ogr can be downloaded from MS4W.
MacOS:
• If you installed Postgres.app, then you will find ogr2ogr in the /Applications/
Postgres.app/Contents/Versions/*/bin directory.
• Alternately, you can download an independent build of GDAL from KyngChaos and install
that.
• Finally, if you have installed HomeBrew you can install the gdal package to get access to
ogr2ogr
Linux:
• If you installed Linux from packages, ogr2ogr should be installed and on your PATH
already as part of the gdal or libgdal* packages.
The postgis workshop data directory includes a 2000/ sub-directory, which contains shape files from
the 2000 census, that were obsoleted by data from the 2010 census. We can practice data loading using
those files, to avoid creating name collisions with the data we already loaded using the backup file. Be
sure to be in the 2000/ sub-directory with the shell when doing these instructions:
export PGPASSWORD=mydatabasepassword
Rather than passing the password in the connection string, we put it in the environment, so it won’t be
visible in the process list while the command runs.
Note that on Windows, you will need to use set instead of export.
ogr2ogr \
-nln nyc_census_blocks_2000 \
-nlt PROMOTE_TO_MULTI \
-lco GEOMETRY_NAME=geom \
-lco FID=gid \
-lco PRECISION=NO \
Pg:'dbname=nyc host=localhost user=pramsey port=5432' \
nyc_census_blocks_2000.shp
For more visual clarity, these lines are displayed with \, but they should be written in one line on your
shell.
The ogr2ogr has a huge number of options, and we’re only using a handful of them here. Here is a
line-by-line explanation of the command.
ogr2ogr \
The executable name! You may need to ensure the executable location is in your PATH or use the full
path to the executable, depending on your setup.
-nln nyc_census_blocks_2000 \
The nln option stands for „new layer name“, and sets the table name that will be created in the target
database.
-nlt PROMOTE_TO_MULTI \
The nlt option stands for „new layer type“. For shape file input in particular, the new layer type is often
a „multi-part geometry“, so the system needs to be told in advance to use „MultiPolygon“ instead of
„Polygon“ for the geometry type.
-lco GEOMETRY_NAME=geom \
-lco FID=gid \
-lco PRECISION=NO \
The lco option stands for „layer create option“. Different drivers have different create options, and we
are using three options for the PostgreSQL driver here.
• GEOMETRY_NAME sets the column name for the geometry column. We prefer „geom“ over
the default, so that our tables match the standard column names in the workshop.
• FID sets the primary key column name. Again we prefer „gid“ which is the standard used in the
workshop.
• PRECISION controls how numeric fields are represented in the database. The default when loa-
ding a shape file is to use the database „numeric“ type, which is more precise but sometimes harder
to work with than simple number types like „integer“ and „double precision“. We use „NO“ to
turn off the „numeric“ type.
The order of arguments in ogr2ogr is, roughly: executable, then options, then destination location,
then source location. So this is the destination, the connection string for our PostgreSQL database. The
„Pg:“ portion is the driver name, and then the connection string is contained in quotation marks (because
it might have embedded spaces).
nyc_census_blocks_2000.shp
The source data set in this case is the shape file we are reading. It is possible to read multiple layers in
one invocation by putting the connection string here, and then following it with a list of layer names, but
in this case we have just the one shape file to load.
You may be asking yourself – „What’s this shapefile thing?“ A „shapefile“ commonly refers to a
collection of files with .shp, .shx, .dbf, and other extensions on a common prefix name (e.g.,
nyc_census_blocks). The actual shapefile relates specifically to files with the .shp extension. However,
the .shp file alone is incomplete for distribution without the required supporting files.
Mandatory files:
• .shp—shape format; the feature geometry itself
• .shx—shape index format; a positional index of the feature geometry
• .dbf—attribute format; columnar attributes for each shape, in dBase III
Optional files include:
• .prj—projection format; the coordinate system and projection information, a plain text file des-
cribing the projection using well-known text format
The shp2pgsql utility makes shape data usable in PostGIS by converting it from binary data into a
series of SQL commands that are then run in the database to load the data.
The shp2pgsql converts Shape files into SQL. It is a conversion utility that is part of the PostGIS code
base and ships with PostGIS packages. If you installed PostgreSQL locally on your computer, you may
find that shp2pgsql has been installed along with it, and it is available in the executable directory of
your installation.
Unlike ogr2ogr, shp2pgsql does not connect directly to the destination database, it just emits the
SQL equivalent to the input shape file. It is up to the user to pass the SQL to the database, either with a
„pipe“ or by saving the SQL to file and then loading it.
Here is an example invocation, loading the same data as before:
export PGPASSWORD=mydatabasepassword
shp2pgsql \
-D \
-I \
-s 26918 \
nyc_census_blocks_2000.shp \
nyc_census_blocks_2000 \
| psql dbname=nyc user=postgres host=localhost
shp2pgsql \
The executable program! It reads the source data file, and emits SQL which can be directed to a file or
piped to psql to load directly into the database.
-D \
The D flag tells the program to generate „dump format“ which is much faster to load than the default
„insert format“.
-I \
The I flag tells the program to create a spatial index on the table after loading is complete.
-s 26918 \
The s flag tells the program what the „spatial reference identifier (SRID)“ of the data is. The source data
for this workshop is all in „UTM 18“, for which the SRID is 26918 (see below).
nyc_census_blocks_2000.shp \
nyc_census_blocks_2000 \
The utility program is generating a stream of SQL. The „|“ operator takes that stream and uses it as input
to the psql database terminal program. The arguments to psql are just the connection string for the
destination database.
Most of the import process is self-explanatory, but even experienced GIS professionals can trip over an
SRID.
An „SRID“ stands for „Spatial Reference IDentifier.“ It defines all the parameters of our data’s geogra-
phic coordinate system and projection. An SRID is convenient because it packs all the information about
a map projection (which can be quite complex) into a single number.
You can see the definition of our workshop map projection by looking it up either in an online database,
• https://epsg.io/26918
or directly inside PostGIS with a query to the spatial_ref_sys table.
Bemerkung: The PostGIS spatial_ref_sys table is an OGC-standard table that defines all the
spatial reference systems known to the database. The data shipped with PostGIS, lists over 3000 known
spatial reference systems and details needed to transform/re-project between them.
In both cases, you see a textual representation of the 26918 spatial reference system (pretty-printed here
for clarity):
If you open up the nyc_neighborhoods.prj file from the data directory, you’ll see the same
projection definition.
Data you receive from local agencies—such as New York City—will usually be in a local projection
noted by „state plane“ or „UTM“. Our projection is „Universal Transverse Mercator (UTM) Zone 18
North“ or EPSG:26918.
QGIS, is a desktop GIS viewer/editor for quickly looking at data. You can view a number of data formats
including flat shapefiles and a PostGIS database. Its graphical interface allows for easy exploration of
your data, as well as simple testing and fast styling.
Try using this software to connect your PostGIS database. The application can be downloaded from
http://qgis.org
The data for this workshop is four shapefiles for New York City, and one attribute table of sociodemo-
graphic variables. We’ve loaded our shapefiles as PostGIS tables and will add sociodemographic data
later in the workshop.
The following describes the number of records and table attributes for each of our datasets. These attri-
bute values and relationships are fundamental to our future analysis.
To explore the nature of your tables in pgAdmin, right-click a highlighted table and select Properties.
You will find a summary of table properties, including a list of table attributes within the Columns tab.
6.1 nyc_census_blocks
A census block is the smallest geography for which census data is reported. All higher level census
geographies (block groups, tracts, metro areas, counties, etc) can be built from unions of census blocks.
We have attached some demographic data to our collection of blocks.
Number of records: 36592
blkid A 15-digit code that uniquely identifies every census block. Eg:
360050001009000
popn_total Total number of people in the census block
popn_white Number of people self-identifying as „White“ in the block
popn_black Number of people self-identifying as „Black“ in the block
popn_nativ Number of people self-identifying as „Native American“ in the block
popn_asian Number of people self-identifying as „Asian“ in the block
popn_other Number of people self-identifying with other categories in the block
boroname Name of the New York borough. Manhattan, The Bronx, Brooklyn, Staten Is-
land, Queens
geom Polygon boundary of the block
29
Introduction to PostGIS, Release 1.0
Bemerkung: To get census data into GIS, you need to join two pieces of information: the actual data
(text), and the boundary files (spatial). There are many options for getting the data, including downloa-
ding data and boundaries from the Census Bureau’s American FactFinder.
6.2 nyc_neighborhoods
New York has a rich history of neighborhood names and extent. Neighborhoods are social constructs that
do not follow lines laid down by the government. For example, the Brooklyn neighborhoods of Carroll
Gardens, Red Hook, and Cobble Hill were once collectively known as „South Brooklyn.“ And now,
depending on which real estate agent you talk to, the same four blocks in the-neighborhood-formerly-
known-as-Red-Hook can be referred to as Columbia Heights, Carroll Gardens West, or Red Hook!
Number of records: 129
6.3 nyc_streets
The street centerlines form the transportation network of the city. These streets have been flagged with
types in order to distinguish between such thoroughfares as back alleys, arterial streets, freeways, and
smaller streets. Desirable areas to live might be on residential streets rather than next to a freeway.
Number of records: 19091
6.4 nyc_subway_stations
The subway stations link the upper world where people live to the invisible network of subways beneath.
As portals to the public transportation system, station locations help determine how easy it is for different
people to enter the subway system.
Number of records: 491
6.2. nyc_neighborhoods 31
Introduction to PostGIS, Release 1.0
Abb. 3: The streets of New York City. Major roads are in red.
6.4. nyc_subway_stations 33
Introduction to PostGIS, Release 1.0
6.5 nyc_census_sociodata
There is a rich collection of social-economic data collected during the census process, but only at the
larger geography level of census tract. Census blocks combine to form census tracts (and block groups).
We have collected some social-economic at a census tract level to answer some of these more interesting
questions about New York City.
tractid An 11-digit code that uniquely identifies every census tract. („36005000100“)
transit_total Number of workers in the tract
transit_private Number of workers in the tract who use private automobiles / motorcycles
transit_public Number of workers in the tract who take public transit
transit_walk Number of workers in the tract who walk
transit_other Number of workers in the tract who use other forms like walking / biking
transit_none Number of workers in the tract who work from home
transit_time_mins Total number of minutes spent in transit by all workers in the tract (minutes)
family_count Number of families in the tract
family_income_median
Median family income in the tract (dollars)
family_income_meanAverage family income in the tract (dollars)
family_income_aggregate
Total income of all families in the tract (dollars)
edu_total Number of people with educational history
edu_no_highschool_dipl
Number of people with no high school diploma
edu_highschool_dipl
Number of people with high school diploma and no further education
edu_college_dipl Number of people with college diploma and no further education
edu_graduate_dipl Number of people with graduate school diploma
6.5. nyc_census_sociodata 35
Introduction to PostGIS, Release 1.0
Simple SQL
SQL, or „Structured Query Language“, is a means of asking questions of, and updating data in, relational
databases. You have already seen SQL when we created our first database. Recall:
SELECT postgis_full_version();
But that was a question about the database. Now that we’ve loaded data into our database, let’s use SQL
to ask questions of the data! For example,
„What are the names of all the neighborhoods in New York City?“
Open up the SQL query window in pgAdmin by clicking the „Query Tool“ button.
The query will run for a few (milli)seconds and return the 129 results.
37
Introduction to PostGIS, Release 1.0
But what exactly happened here? To understand, let’s begin with the four „verbs“ of SQL,
• SELECT, returns rows in response to a query
• INSERT, adds new rows to a table
• UPDATE, alters existing rows in a table
• DELETE, removes rows from a table
We will be working almost exclusively with SELECT in order to ask questions of tables using spatial
functions.
Bemerkung: For a synopsis of all SELECT parameters, see the PostgresSQL documentation.
The some_columns are either column names or functions of column values. The
some_data_source is either a single table, or a composite table created by joining two ta-
bles on a key or condition. The some_condition is a filter that restricts the number of rows to be
returned.
„What are the names of all the neighborhoods in Brooklyn?“
We return to our nyc_neighborhoods table with a filter in hand. The table contains all the neigh-
borhoods in New York, but we only want the ones in Brooklyn.
SELECT name
FROM nyc_neighborhoods
WHERE boroname = 'Brooklyn';
The query will run for even fewer (milli)seconds and return the 23 results.
Sometimes we will need to apply a function to the results of our query. For example,
„What is the number of letters in the names of all the neighborhoods in Brooklyn?“
Fortunately, PostgreSQL has a string length function, char_length(string).
SELECT char_length(name)
FROM nyc_neighborhoods
WHERE boroname = 'Brooklyn';
Often, we are less interested in the individual rows than in a statistic that applies to all of them. So
knowing the lengths of the neighborhood names might be less interesting than knowing the average
length of the names. Functions that take in multiple rows and return a single result are called „aggregate“
functions.
PostgreSQL has a series of built-in aggregate functions, including the general purpose avg() for ave-
rage values and stddev() for standard deviations.
„What is the average number of letters and standard deviation of number of letters in the
names of all the neighborhoods in Brooklyn?“
avg | stddev
---------------------+--------------------
11.7391304347826087 | 3.9105613559407395
The aggregate functions in our last example were applied to every row in the result set. What if we
want the summaries to be carried out over smaller groups within the overall result set? For that we
add a GROUP BY clause. Aggregate functions often need an added GROUP BY statement to group the
result-set by one or more columns.
„What is the average number of letters in the names of all the neighborhoods in New York
City, reported by borough?“
We include the boroname column in the output result so we can determine which statistic applies to
which borough. In an aggregate query, you can only output columns that are either (a) members of the
grouping clause or (b) aggregate functions.
avg(expression): PostgreSQL aggregate function that returns the average value of a numeric column.
char_length(string): PostgreSQL string function that returns the number of character in a string.
stddev(expression): PostgreSQL aggregate function that returns the standard deviation of input values.
Using the nyc_census_blocks table, answer the following questions (don’t peek at the answers!).
Here is some helpful information to get started. Recall from the About Our Data section our
nyc_census_blocks table definition.
And, here are some common SQL aggregation functions you might find useful:
• avg() - the average (mean) of the values in a set of records
• sum() - the sum of the values in a set of records
• count() - the number of records in a set of records
Now the questions:
• How many records are in the nyc_streets table?
SELECT Count(*)
FROM nyc_streets;
41
Introduction to PostGIS, Release 1.0
19091
SELECT Count(*)
FROM nyc_streets
WHERE name LIKE 'B%';
1282
8175032
Bemerkung: What is this AS? You can give a table or a column another name by using an alias.
Aliases can make queries easier to both write and to read. So instead of our outputted column
name as sum we write it AS the more readable population.
1385108
boroname | count
---------------+-------
Queens | 30
Brooklyn | 23
Staten Island | 24
The Bronx | 24
Manhattan | 28
SELECT
boroname,
100.0 * Sum(popn_white)/Sum(popn_total) AS white_pct
FROM nyc_census_blocks
GROUP BY boroname;
boroname | white_pct
---------------+------------------
Brooklyn | 42.8011737932687
Manhattan | 57.4493039480463
The Bronx | 27.9037446899448
Queens | 39.722077394591
Staten Island | 72.8942034860154
avg(expression): PostgreSQL aggregate function that returns the average value of a numeric column.
count(expression): PostgreSQL aggregate function that returns the number of records in a set of records.
sum(expression): PostgreSQL aggregate function that returns the sum of records in a set of records.
Geometries
9.1 Einführung
In the previous section, we loaded a variety of data. Before we start playing with our data lets have a
look at some simpler examples. In pgAdmin, once again select the nyc database and open the SQL query
tool. Paste this example SQL code into the pgAdmin SQL Editor window (removing any text that may
be there by default) and then execute.
45
Introduction to PostGIS, Release 1.0
The above example CREATEs a table (geometries) then INSERTs five geometries: a point, a line, a
polygon, a polygon with a hole, and a collection. Finally, the inserted rows are SELECTed and displayed
in the Output pane.
In conformance with the Simple Features for SQL (SFSQL) specification, PostGIS provides two tables
to track and report on the geometry types available in a given database.
• The first table, spatial_ref_sys, defines all the spatial reference systems known to the data-
base and will be described in greater detail later.
• The second table (actually, a view), geometry_columns, provides a listing of all „features“
(defined as an object with geometric attributes), and the basic details of those features.
46 Kapitel 9. Geometries
Introduction to PostGIS, Release 1.0
Let’s have a look at the geometry_columns table in our database. Paste this command in the Query
Tool as before:
• f_geometry_column is the name of the column that geometry containing column – for feature
tables with multiple geometry columns, there will be one record for each.
• coord_dimension and srid define the the dimension of the geometry (2-, 3- or 4-
dimensional) and the Spatial Reference system identifier that refers to the spatial_ref_sys
table respectively.
• The type column defines the type of geometry as described below; we’ve seen Point and Line-
string types so far.
By querying this table, GIS clients and libraries can determine what to expect when retrieving data and
can perform any necessary projection, processing or rendering without needing to inspect each geometry.
Bemerkung:
Do some or all of your nyc tables not have an srid of 26918? It’s easy to fix by updating
the table.
The Simple Features for SQL (SFSQL) specification, the original guiding standard for PostGIS develop-
ment, defines how a real world object is represented. By taking a continuous shape and digitizing it at a
fixed resolution we achieve a passable representation of the object. SFSQL only handled 2-dimensional
representations. PostGIS has extended that to include 3- and 4-dimensional representations; more recent-
ly the SQL-Multimedia Part 3 (SQL/MM) specification has officially defined their own representation.
Our example table contains a mixture of different geometry types. We can collect general information
about each object using functions that read the geometry metadata.
• ST_GeometryType(geometry) returns the type of the geometry
• ST_NDims(geometry) returns the number of dimensions of the geometry
• ST_SRID(geometry) returns the spatial reference identifier number of the geometry
48 Kapitel 9. Geometries
Introduction to PostGIS, Release 1.0
9.3.1 Points
A spatial point represents a single location on the Earth. This point is represented by a single coordinate
(including either 2-, 3- or 4-dimensions). Points are used to represent objects when the exact details,
such as shape and size, are not important at the target scale. For example, cities on a map of the world
can be described as points, while a map of a single state might represent cities as polygons.
SELECT ST_AsText(geom)
FROM geometries
WHERE name = 'Point';
POINT(0 0)
Some of the specific spatial functions for working with points are:
• ST_X(geometry) returns the X ordinate
• ST_Y(geometry) returns the Y ordinate
So, we can read the ordinates from a point like this:
The New York City subway stations (nyc_subway_stations) table is a data set represented as
points. The following SQL query will return the geometry associated with one point (in the ST_AsText
column).
9.3.2 Linestrings
A linestring is a path between locations. It takes the form of an ordered series of two or more points.
Roads and rivers are typically represented as linestrings. A linestring is said to be closed if it starts and
ends on the same point. It is said to be simple if it does not cross or touch itself (except at its endpoints
if it is closed). A linestring can be both closed and simple.
The street network for New York (nyc_streets) was loaded earlier in the workshop. This dataset
contains details such as name, and type. A single real world street may consist of many linestrings, each
representing a segment of road with different attributes.
The following SQL query will return the geometry associated with one linestring (in the ST_AsText
column).
SELECT ST_AsText(geom)
FROM geometries
WHERE name = 'Linestring';
LINESTRING(0 0, 1 1, 2 1, 2 2)
Some of the specific spatial functions for working with linestrings are:
• ST_Length(geometry) returns the length of the linestring
• ST_StartPoint(geometry) returns the first coordinate as a point
• ST_EndPoint(geometry) returns the last coordinate as a point
• ST_NPoints(geometry) returns the number of coordinates in the linestring
So, the length of our linestring is:
SELECT ST_Length(geom)
FROM geometries
WHERE name = 'Linestring';
3.41421356237309
9.3.3 Polygons
A polygon is a representation of an area. The outer boundary of the polygon is represented by a ring.
This ring is a linestring that is both closed and simple as defined above. Holes within the polygon are
also represented by rings.
Polygons are used to represent objects whose size and shape are important. City limits, parks, building
footprints or bodies of water are all commonly represented as polygons when the scale is sufficiently
high to see their area. Roads and rivers can sometimes be represented as polygons.
50 Kapitel 9. Geometries
Introduction to PostGIS, Release 1.0
The following SQL query will return the geometry associated with one polygon (in the ST_AsText
column).
SELECT ST_AsText(geom)
FROM geometries
WHERE name LIKE 'Polygon%';
Bemerkung: Rather than using an = sign in our WHERE clause, we are using the LIKE operator to
carry out a string matching operation. You may be used to the ``*`` symbol as a „glob“ for pattern
matching, but in SQL the ``%`` symbol is used, along with the LIKE operator to tell the system to
do globbing.
POLYGON((0 0, 1 0, 1 1, 0 1, 0 0))
POLYGON((0 0, 10 0, 10 10, 0 10, 0 0),(1 1, 1 2, 2 2, 2 1, 1 1))
The first polygon has only one ring. The second one has an interior „hole“. Most graphics systems
include the concept of a „polygon“, but GIS systems are relatively unique in allowing polygons to
explicitly have holes.
Some of the specific spatial functions for working with polygons are:
• ST_Area(geometry) returns the area of the polygons
• ST_NRings(geometry) returns the number of rings (usually 1, more if there are holes)
• ST_ExteriorRing(geometry) returns the outer ring as a linestring
Polygon 1
PolygonWithHole 99
Note that the polygon with a hole has an area that is the area of the outer shell (a 10x10 square) minus
the area of the hole (a 1x1 square).
9.3.4 Collections
There are four collection types, which group multiple simple geometries into sets.
• MultiPoint, a collection of points
• MultiLineString, a collection of linestrings
• MultiPolygon, a collection of polygons
• GeometryCollection, a heterogeneous collection of any geometry (including other collections)
Collections are another concept that shows up in GIS software more than in generic graphics software.
They are useful for directly modeling real world objects as spatial objects. For example, how to model a
lot that is split by a right-of-way? As a MultiPolygon, with a part on either side of the right-of-way.
52 Kapitel 9. Geometries
Introduction to PostGIS, Release 1.0
Some of the specific spatial functions for working with collections are:
• ST_NumGeometries(geometry) returns the number of parts in the collection
• ST_GeometryN(geometry,n) returns the specified part
• ST_Area(geometry) returns the total area of all polygonal parts
• ST_Length(geometry) returns the total length of all linear parts
Within the database, geometries are stored on disk in a format only used by the PostGIS program. In
order for external programs to insert and retrieve useful geometries, they need to be converted into a
format that other applications can understand. Fortunately, PostGIS supports emitting and consuming
geometries in a large number of formats:
• Well-known text (WKT)
– ST_GeomFromText(text, srid) returns geometry
– ST_AsText(geometry) returns text
– ST_AsEWKT(geometry) returns text
• Well-known binary (WKB)
– ST_GeomFromWKB(bytea) returns geometry
– ST_AsBinary(geometry) returns bytea
– ST_AsEWKB(geometry) returns bytea
• Geographic Mark-up Language (GML)
– ST_GeomFromGML(text) returns geometry
– ST_AsGML(geometry) returns text
• Keyhole Mark-up Language (KML)
– ST_GeomFromKML(text) returns geometry
– ST_AsKML(geometry) returns text
• GeoJSON
54 Kapitel 9. Geometries
Introduction to PostGIS, Release 1.0
SELECT encode(
ST_AsBinary(ST_GeometryFromText('LINESTRING(0 0,1 0)')),
'hex');
01020000000200000000000000000000000000000000000000000000000000f03f0000000000000000
For the purposes of this workshop we will continue to use WKT to ensure you can read and understand
the geometries we’re viewing. However, most actual processes, such as viewing data in a GIS applicati-
on, transferring data to a web service, or processing data remotely, WKB is the format of choice.
Since WKT and WKB were defined in the SFSQL specification, they do not handle 3- or 4-dimensional
geometries. For these cases PostGIS has defined the Extended Well Known Text (EWKT) and Extended
Well Known Binary (EWKB) formats. These provide the same formatting capabilities of WKT and
WKB with the added dimensionality.
Here is an example of a 3D linestring in WKT:
Note that the text representation changes! This is because the text input routine for PostGIS is liberal in
what it consumes. It will consume
• hex-encoded EWKB,
• extended well-known text, and
• ISO standard well-known text.
On the output side, the ST_AsText function is conservative, and only emits ISO standard well-known
text.
In addition to the ST_GeometryFromText function, there are many other ways to create geometries
from well-known text or similar formatted inputs:
In addition to emitters for the various forms (WKT, WKB, GML, KML, JSON, SVG), PostGIS also
has consumers for four (WKT, WKB, GML, KML). Most applications use the WKT or WKB geometry
creation functions, but the others work too. Here’s an example that consumes GML and output JSON:
SELECT ST_AsGeoJSON(ST_GeomFromGML('<gml:Point><gml:coordinates>1,1</
˓→gml:coordinates></gml:Point>'));
The WKT strings we’ve see so far have been of type ‚text‘ and we have been converting them to type
‚geometry‘ using PostGIS functions like ST_GeomFromText().
PostgreSQL includes a short form syntax that allows data to be converted from one type to another, the
casting syntax, oldata::newtype. So for example, this SQL converts a double into a text string.
SELECT 0.9::text;
One thing to note about using casting to create geometries: unless you specify the SRID, you will get
a geometry with an unknown SRID. You can specify the SRID using the „extended“ well-known text
form, which includes an SRID block at the front:
56 Kapitel 9. Geometries
Introduction to PostGIS, Release 1.0
It’s very common to use the casting notation when working with WKT, as well as geometry and geogra-
phy columns (see Geography).
ST_Area: Returns the area of the surface if it is a polygon or multi-polygon. For „geometry“ type area
is in SRID units. For „geography“ area is in square meters.
ST_AsText: Returns the Well-Known Text (WKT) representation of the geometry/geography without
SRID metadata.
ST_AsBinary: Returns the Well-Known Binary (WKB) representation of the geometry/geography wi-
thout SRID meta data.
ST_EndPoint: Returns the last point of a LINESTRING geometry as a POINT.
ST_AsEWKB: Returns the Well-Known Binary (WKB) representation of the geometry with SRID meta
data.
ST_AsEWKT: Returns the Well-Known Text (WKT) representation of the geometry with SRID meta
data.
ST_AsGeoJSON: Returns the geometry as a GeoJSON element.
ST_AsGML: Returns the geometry as a GML version 2 or 3 element.
ST_AsKML: Returns the geometry as a KML element. Several variants. Default version=2, default
precision=15.
ST_AsSVG: Returns a Geometry in SVG path data given a geometry or geography object.
ST_ExteriorRing: Returns a line string representing the exterior ring of the POLYGON geometry. Return
NULL if the geometry is not a polygon. Will not work with MULTIPOLYGON
ST_GeometryN: Returns the 1-based Nth geometry if the geometry is a GEOMETRYCOLLECTI-
ON, MULTIPOINT, MULTILINESTRING, MULTICURVE or MULTIPOLYGON. Otherwise, return
NULL.
ST_GeomFromGML: Takes as input GML representation of geometry and outputs a PostGIS geometry
object.
ST_GeomFromKML: Takes as input KML representation of geometry and outputs a PostGIS geometry
object
ST_GeomFromText: Returns a specified ST_Geometry value from Well-Known Text representation
(WKT).
ST_GeomFromWKB: Creates a geometry instance from a Well-Known Binary geometry representation
(WKB) and optional SRID.
ST_GeometryType: Returns the geometry type of the ST_Geometry value.
ST_InteriorRingN: Returns the Nth interior linestring ring of the polygon geometry. Return NULL if the
geometry is not a polygon or the given N is out of range.
ST_Length: Returns the 2d length of the geometry if it is a linestring or multilinestring. geometry are in
units of spatial reference and geography are in meters (default spheroid)
ST_NDims: Returns coordinate dimension of the geometry as a small int. Values are: 2,3 or 4.
ST_NPoints: Returns the number of points (vertexes) in a geometry.
ST_NRings: If the geometry is a polygon or multi-polygon returns the number of rings.
ST_NumGeometries: If geometry is a GEOMETRYCOLLECTION (or MULTI*) returns the number of
geometries, otherwise return NULL.
ST_Perimeter: Returns the length measurement of the boundary of an ST_Surface or ST_MultiSurface
value. (Polygon, Multipolygon)
ST_SRID: Returns the spatial reference identifier for the ST_Geometry as defined in spatial_ref_sys
table.
ST_StartPoint: Returns the first point of a LINESTRING geometry as a POINT.
ST_X: Returns the X coordinate of the point, or NULL if not available. Input must be a point.
ST_Y: Returns the Y coordinate of the point, or NULL if not available. Input must be a point.
58 Kapitel 9. Geometries
KAPITEL 10
Geometry Exercises
Here’s a reminder of all the functions we have seen so far. They should be useful for the exercises!
• sum(expression) aggregate to return a sum for a set of records
• count(expression) aggregate to return the size of a set of records
• ST_GeometryType(geometry) returns the type of the geometry
• ST_NDims(geometry) returns the number of dimensions of the geometry
• ST_SRID(geometry) returns the spatial reference identifier number of the geometry
• ST_X(point) returns the X ordinate
• ST_Y(point) returns the Y ordinate
• ST_Length(linestring) returns the length of the linestring
• ST_StartPoint(geometry) returns the first coordinate as a point
• ST_EndPoint(geometry) returns the last coordinate as a point
• ST_NPoints(geometry) returns the number of coordinates in the linestring
• ST_Area(geometry) returns the area of the polygons
• ST_NRings(geometry) returns the number of rings (usually 1, more if there are holes)
• ST_ExteriorRing(polygon) returns the outer ring as a linestring
• ST_InteriorRingN(polygon, integer) returns a specified interior ring as a linestring
• ST_Perimeter(geometry) returns the length of all the rings
• ST_NumGeometries(multi/geomcollection) returns the number of parts in the col-
lection
• ST_GeometryN(geometry, integer) returns the specified part of the collection
• ST_GeomFromText(text) returns geometry
• ST_AsText(geometry) returns WKT text
59
Introduction to PostGIS, Release 1.0
10.1 Exercises
SELECT ST_Area(geom)
FROM nyc_neighborhoods
WHERE name = 'West Village';
1044614.5296486
Bemerkung: The area is given in square meters. To get an area in hectares, divide by 10000. To
get an area in acres, divide by 4047.
SELECT
ST_GeometryType(geom),
ST_Length(geom)
FROM nyc_streets
WHERE name = 'Pelham St';
ST_MultiLineString
50.323
SELECT
ST_AsGeoJSON(geom)
FROM nyc_subway_stations
WHERE name = 'Broad St';
{"type":"Point",
"crs":{"type":"name","properties":{"name":"EPSG:26918"}},
"coordinates":[583571.905921312,4506714.341192182]}
• What is the total length of streets (in kilometers) in New York City? (Hint: The units of
measurement of the spatial data are meters, there are 1000 meters in a kilometer.)
10418.9047172
13965.3201224118
or. . .
14601.3987215548
Tottenville
SELECT ST_Length(geom)
FROM nyc_streets
WHERE name = 'Columbus Cir';
10.1. Exercises 61
Introduction to PostGIS, Release 1.0
308.34199
type | length
--------------------------------------------------+------------------
residential | 8629870.33786606
motorway | 403622.478126363
tertiary | 360394.879051303
motorway_link | 294261.419479668
secondary | 276264.303897926
unclassified | 166936.371604458
primary | 135034.233017947
footway | 71798.4878378096
service | 28337.635038596
trunk | 20353.5819826076
cycleway | 8863.75144825929
pedestrian | 4867.05032825026
construction | 4803.08162103562
residential; motorway_link | 3661.57506293745
trunk_link | 3202.18981240201
primary_link | 2492.57457083536
living_street | 1894.63905457332
primary; residential; motorway_link; residential | 1367.76576941335
undefined | 380.53861910346
steps | 282.745221342127
motorway_link; residential | 215.07778911517
Bemerkung: The ORDER BY length DESC clause sorts the result by length in descending
order. The result is that most prevalent types are first in the list.
Räumliche Beziehungen
So far we have only used spatial functions that measure (ST_Area, ST_Length), serialize
(ST_GeomFromText) or deserialize (ST_AsGML) geometries. What these functions have in common
is that they only work on one geometry at a time.
Spatial databases are powerful because they not only store geometry, they also have the ability to com-
pare relationships between geometries.
Questions like „Which are the closest bike racks to a park?“ or „Where are the intersections of subway
lines and streets?“ can only be answered by comparing geometries representing the bike racks, streets,
and subway lines.
The OGC standard defines the following set of methods to compare geometries.
11.1 ST_Equals
name | geom
----------+---------------------------------------------------
Broad St | 0101000020266900000EEBD4CF27CF2141BC17D69516315141
63
Introduction to PostGIS, Release 1.0
SELECT name
FROM nyc_subway_stations
WHERE ST_Equals(
geom,
'0101000020266900000EEBD4CF27CF2141BC17D69516315141');
Broad St
Bemerkung: The representation of the point was not very human readable
(0101000020266900000EEBD4CF27CF2141BC17D69516315141) but it was an exact
representation of the coordinate values. For a test like equality, using the exact coordinates is necessary.
ST_Intersects, ST_Crosses, and ST_Overlaps test whether the interiors of the geometries
intersect.
ST_Intersects(geometry A, geometry B) returns t (TRUE) if the two shapes have any
space in common, i.e., if their boundaries or interiors intersect.
The opposite of ST_Intersects is ST_Disjoint(geometry A , geometry B). If two geome-
tries are disjoint, they do not intersect, and vice-versa. In fact, it is often more efficient to test „not
intersects“ than to test „disjoint“ because the intersects tests can be spatially indexed, while the disjoint
test cannot.
For multipoint/polygon, multipoint/linestring, linestring/linestring, linestring/polygon, and line-
string/multipolygon comparisons, ST_Crosses(geometry A, geometry B) returns t (TRUE)
if the intersection results in a geometry whose dimension is one less than the maximum dimension of
the two source geometries and the intersection set is interior to both source geometries.
ST_Overlaps(geometry A, geometry B) compares two geometries of the same dimension
and returns TRUE if their intersection set results in a geometry different from both but of the same
dimension.
Let’s take our Broad Street subway station and determine its neighborhood using the ST_Intersects
function:
POINT(583571 4506714)
name | boroname
--------------------+-----------
Financial District | Manhattan
11.3 ST_Touches
ST_Touches tests whether two geometries touch at their boundaries, but do not intersect in their
interiors
ST_Touches(geometry A, geometry B) returns TRUE if either of the geometries‘ boundaries
intersect or if only one of the geometry’s interiors intersects the other’s boundary.
ST_Within and ST_Contains test whether one geometry is fully within the other.
ST_Within(geometry A , geometry B) returns TRUE if the first geometry is completely wi-
thin the second geometry. ST_Within tests for the exact opposite result of ST_Contains.
ST_Contains(geometry A, geometry B) returns TRUE if the second geometry is completely
contained by the first geometry.
An extremely common GIS question is „find all the stuff within distance X of this other stuff“.
The ST_Distance(geometry A, geometry B) calculates the shortest distance between two
geometries and returns it as a float. This is useful for actually reporting back the distance between
objects.
SELECT ST_Distance(
ST_GeometryFromText('POINT(0 5)'),
ST_GeometryFromText('LINESTRING(-2 2, 2 2)'));
For testing whether two objects are within a distance of one another, the ST_DWithin function pro-
vides an index-accelerated true/false test. This is useful for questions like „how many trees are within
a 500 meter buffer of the road?“. You don’t have to calculate an actual buffer, you just have to test the
distance relationship.
Using our Broad Street subway station again, we can find the streets nearby (within 10 meters of) the
subway stop:
SELECT name
FROM nyc_streets
(Fortsetzung auf der nächsten Seite)
name
--------------
Wall St
Broad St
Nassau St
And we can verify the answer on a map. The Broad St station is actually at the intersection of Wall,
Broad and Nassau Streets.
ST_Contains(geometry A, geometry B): Returns true if and only if no points of B lie in the exterior of
A, and at least one point of the interior of B lies in the interior of A.
ST_Crosses(geometry A, geometry B): Returns TRUE if the supplied geometries have some, but not all,
interior points in common.
ST_Disjoint(geometry A , geometry B): Returns TRUE if the Geometries do not „spatially intersect“ -
if they do not share any space together.
ST_Distance(geometry A, geometry B): Returns the 2-dimensional cartesian minimum distance (based
on spatial ref) between two geometries in projected units.
ST_DWithin(geometry A, geometry B, radius): Returns true if the geometries are within the specified
distance (radius) of one another.
ST_Equals(geometry A, geometry B): Returns true if the given geometries represent the same geometry.
Directionality is ignored.
ST_Intersects(geometry A, geometry B): Returns TRUE if the Geometries/Geography „spatially inter-
sect“ - (share any portion of space) and FALSE if they don’t (they are Disjoint).
ST_Overlaps(geometry A, geometry B): Returns TRUE if the Geometries share space, are of the same
dimension, but are not completely contained by each other.
ST_Touches(geometry A, geometry B): Returns TRUE if the geometries have at least one point in com-
mon, but their interiors do not intersect.
ST_Within(geometry A , geometry B): Returns true if the geometry A is completely inside geometry B
Here’s a reminder of the functions we saw in the last section. They should be useful for the exercises!
• sum(expression) aggregate to return a sum for a set of records
• count(expression) aggregate to return the size of a set of records
• ST_Contains(geometry A, geometry B) returns true if geometry A contains geome-
try B
• ST_Crosses(geometry A, geometry B) returns true if geometry A crosses geometry
B
• ST_Disjoint(geometry A , geometry B) returns true if the geometries do not „spa-
tially intersect“
• ST_Distance(geometry A, geometry B) returns the minimum distance between geo-
metry A and geometry B
• ST_DWithin(geometry A, geometry B, radius) returns true if geometry A is radi-
us distance or less from geometry B
• ST_Equals(geometry A, geometry B) returns true if geometry A is the same as geo-
metry B
• ST_Intersects(geometry A, geometry B) returns true if geometry A intersects geo-
metry B
• ST_Overlaps(geometry A, geometry B) returns true if geometry A and geometry B
share space, but are not completely contained by each other.
• ST_Touches(geometry A, geometry B) returns true if the boundary of geometry A
touches geometry B
• ST_Within(geometry A, geometry B) returns true if geometry A is within geometry
B
Also remember the tables we have available:
• nyc_census_blocks
75
Introduction to PostGIS, Release 1.0
12.1 Exercises
• What is the geometry value for the street named ‚Atlantic Commons‘?
SELECT ST_AsText(geom)
FROM nyc_streets
WHERE name = 'Atlantic Commons';
MULTILINESTRING((586781.701577724 4504202.15314339,586863.51964484
˓→4504215.9881701))
name | boroname
------------+----------
Fort Green | Brooklyn
SELECT name
FROM nyc_streets
WHERE ST_DWithin(
geom,
ST_GeomFromText('LINESTRING(586782 4504202,586864 4504216)', 26918),
(Fortsetzung auf der nächsten Seite)
name
------------------
Cumberland St
Atlantic Commons
• Approximately how many people live on (within 50 meters of) Atlantic Commons?
SELECT Sum(popn_total)
FROM nyc_census_blocks
WHERE ST_DWithin(
geom,
ST_GeomFromText('LINESTRING(586782 4504202,586864 4504216)',
˓→26918),
50
);
1438
12.1. Exercises 77
Introduction to PostGIS, Release 1.0
Spatial Joins
Spatial joins are the bread-and-butter of spatial databases. They allow you to combine information from
different tables by using spatial relationships as the join key. Much of what we think of as „standard GIS
analysis“ can be expressed as spatial joins.
In the previous section, we explored spatial relationships using a two-step process: first we extracted
a subway station point for ‚Broad St‘; then, we used that point to ask further questions such as „what
neighborhood is the ‚Broad St‘ station in?“
Using a spatial join, we can answer the question in one step, retrieving information about the subway
station and the neighborhood that contains it:
SELECT
subways.name AS subway_name,
neighborhoods.name AS neighborhood_name,
neighborhoods.boroname AS borough
FROM nyc_neighborhoods AS neighborhoods
JOIN nyc_subway_stations AS subways
ON ST_Contains(neighborhoods.geom, subways.geom)
WHERE subways.name = 'Broad St';
We could have joined every subway station to its containing neighborhood, but in this case we wanted
information about just one. Any function that provides a true/false relationship between two tables can be
used to drive a spatial join, but the most commonly used ones are: ST_Intersects, ST_Contains,
and ST_DWithin.
79
Introduction to PostGIS, Release 1.0
The combination of a JOIN with a GROUP BY provides the kind of analysis that is usually done in a
GIS system.
For example: „What is the population and racial make-up of the neighborhoods of Manhattan?“
Here we have a question that combines information from about population from the census with the
boundaries of neighborhoods, with a restriction to just one borough of Manhattan.
SELECT
neighborhoods.name AS neighborhood_name,
Sum(census.popn_total) AS population,
100.0 * Sum(census.popn_white) / Sum(census.popn_total) AS white_pct,
100.0 * Sum(census.popn_black) / Sum(census.popn_total) AS black_pct
FROM nyc_neighborhoods AS neighborhoods
JOIN nyc_census_blocks AS census
ON ST_Intersects(neighborhoods.geom, census.geom)
WHERE neighborhoods.boroname = 'Manhattan'
GROUP BY neighborhoods.name
ORDER BY white_pct DESC;
What’s going on here? Notionally (the actual evaluation order is optimized under the covers by the
database) this is what happens:
1. The JOIN clause creates a virtual table that includes columns from both the neighborhoods and
census tables.
2. The WHERE clause filters our virtual table to just rows in Manhattan.
3. The remaining rows are grouped by the neighborhood name and fed through the aggregation
function to Sum() the population values.
4. After a little arithmetic and formatting (e.g., GROUP BY, ORDER BY) on the final numbers, our
query spits out the percentages.
Bemerkung: The JOIN clause combines two FROM items. By default, we are using an INNER JOIN,
but there are four other types of joins. For further information see the join_type definition in the Post-
greSQL documentation.
We can also use distance tests as a join key, to create summarized „all items within a radius“ queries.
Let’s explore the racial geography of New York using distance queries.
First, let’s get the baseline racial make-up of the city.
SELECT
100.0 * Sum(popn_white) / Sum(popn_total) AS white_pct,
100.0 * Sum(popn_black) / Sum(popn_total) AS black_pct,
Sum(popn_total) AS popn_total
FROM nyc_census_blocks;
So, of the 8M people in New York, about 44% are recorded as „white“ and 26% are recorded as „black“.
Duke Ellington once sang that „You / must take the A-train / To / go to Sugar Hill way up in Harlem.“
As we saw earlier, Harlem has far and away the highest African-American population in Manhattan
(80.5%). Is the same true of Duke’s A-train?
First, note that the contents of the nyc_subway_stations table routes field is what we are inte-
rested in to find the A-train. The values in there are a little complex.
A,C,G
4,5
D,F,N,Q
5
E,F
E,J,Z
R,W
Bemerkung: The DISTINCT keyword eliminates duplicate rows from the result. Without the
DISTINCT keyword, the query above identifies 491 results instead of 73.
So to find the A-train, we will want any row in routes that has an ‚A‘ in it. We can do this a number of
ways, but today we will use the fact that strpos(routes,'A') will return a non-zero number only
if ‚A‘ is in the routes field.
A,B,C
A,C
A
A,C,G
A,C,E,L
A,S
A,C,F
A,B,C,D
A,C,E
Let’s summarize the racial make-up of within 200 meters of the A-train line.
SELECT
100.0 * Sum(popn_white) / Sum(popn_total) AS white_pct,
100.0 * Sum(popn_black) / Sum(popn_total) AS black_pct,
Sum(popn_total) AS popn_total
FROM nyc_census_blocks AS census
JOIN nyc_subway_stations AS subways
ON ST_DWithin(census.geom, subways.geom, 200)
WHERE strpos(subways.routes,'A') > 0;
So the racial make-up along the A-train isn’t radically different from the make-up of New York City as
a whole.
In the last section we saw that the A-train didn’t serve a population that differed much from the racial
make-up of the rest of the city. Are there any trains that have a non-average racial make-up?
To answer that question, we’ll add another join to our query, so that we can simultaneously calculate the
make-up of many subway lines at once. To do that, we’ll need to create a new table that enumerates all
the lines we want to summarize.
Now we can join the table of subway lines onto our original query.
SELECT
lines.route,
100.0 * Sum(popn_white) / Sum(popn_total) AS white_pct,
(Fortsetzung auf der nächsten Seite)
As before, the joins create a virtual table of all the possible combinations available within the constraints
of the JOIN ON restrictions, and those rows are then fed into a GROUP summary. The spatial magic is
in the ST_DWithin function, that ensures only census blocks close to the appropriate subway stations
are included in the calculation.
ST_Contains(geometry A, geometry B): Returns true if and only if no points of B lie in the exterior of
A, and at least one point of the interior of B lies in the interior of A.
ST_DWithin(geometry A, geometry B, radius): Returns true if the geometries are within the specified
distance of one another.
ST_Intersects(geometry A, geometry B): Returns TRUE if the Geometries/Geography „spatially inter-
sect“ - (share any portion of space) and FALSE if they don’t (they are Disjoint).
round(v numeric, s integer): PostgreSQL math function that rounds to s decimal places
strpos(string, substring): PostgreSQL string function that returns an integer location of a specified sub-
string.
sum(expression): PostgreSQL aggregate function that returns the sum of records in a set of records.
Here’s a reminder of some of the functions we have seen. Hint: they should be useful for the exercises!
• sum(expression): aggregate to return a sum for a set of records
• count(expression): aggregate to return the size of a set of records
• ST_Area(geometry) returns the area of the polygons
• ST_AsText(geometry) returns WKT text
• ST_Contains(geometry A, geometry B) returns the true if geometry A contains geo-
metry B
• ST_Distance(geometry A, geometry B) returns the minimum distance between geo-
metry A and geometry B
• ST_DWithin(geometry A, geometry B, radius) returns the true if geometry A is
radius distance or less from geometry B
• ST_GeomFromText(text) returns geometry
• ST_Intersects(geometry A, geometry B) returns the true if geometry A intersects
geometry B
• ST_Length(linestring) returns the length of the linestring
• ST_Touches(geometry A, geometry B) returns the true if the boundary of geometry A
touches geometry B
• ST_Within(geometry A, geometry B) returns the true if geometry A is within geome-
try B
Also remember the tables we have available:
• nyc_census_blocks
– name, popn_total, boroname, geom
• nyc_streets
85
Introduction to PostGIS, Release 1.0
14.1 Exercises
name | routes
-----------+--------
Spring St | 6
• What are all the neighborhoods served by the 6-train? (Hint: The routes column in the
nyc_subway_stations table has values like ‚B,D,6,V‘ and ‚C,6‘)
name | boroname
--------------------+-----------
Midtown | Manhattan
Hunts Point | The Bronx
Gramercy | Manhattan
Little Italy | Manhattan
Financial District | Manhattan
South Bronx | The Bronx
Yorkville | Manhattan
Murray Hill | Manhattan
Mott Haven | The Bronx
Upper East Side | Manhattan
Chinatown | Manhattan
East Harlem | Manhattan
Greenwich Village | Manhattan
Parkchester | The Bronx
Soundview | The Bronx
Bemerkung: We used the DISTINCT keyword to remove duplicate values from our result set
where there were more than one subway station in a neighborhood.
• After 9/11, the ‚Battery Park‘ neighborhood was off limits for several days. How many peo-
ple had to be evacuated?
SELECT Sum(popn_total)
FROM nyc_neighborhoods AS n
JOIN nyc_census_blocks AS c
ON ST_Intersects(n.geom, c.geom)
WHERE n.name = 'Battery Park';
17153
SELECT
n.name,
Sum(c.popn_total) / (ST_Area(n.geom) / 1000000.0) AS popn_per_sqkm
FROM nyc_census_blocks AS c
JOIN nyc_neighborhoods AS n
ON ST_Intersects(c.geom, n.geom)
GROUP BY n.name, n.geom
ORDER BY 2 DESC;
name | popn_per_sqkm
-----------------+------------------
Upper East Side | 48524.4877489857
Upper West Side | 40152.4896080024
14.1. Exercises 87
Introduction to PostGIS, Release 1.0
Spatial Indexing
Recall that spatial index is one of the three key features of a spatial database. Indexes make using a
spatial database for large data sets possible. Without indexing, any search for a feature would require a
„sequential scan“ of every record in the database. Indexing speeds up searching by organizing the data
into a search tree which can be quickly traversed to find a particular record.
Spatial indices are one of the greatest assets of PostGIS. In the previous example building spatial joins
requires comparing whole tables with each other. This can get very costly: joining two tables of 10,000
records each without indexes would require 100,000,000 comparisons; with indexes the cost could be as
low as 20,000 comparisons.
Our data load file already included spatial indexes for all the tables, so in order to demonstrate the
efficacy of indexes we will have to first remove them.
Let’s run a query on nyc_census_blocks without our spatial index.
Our first step is to remove the index.
Bemerkung: The DROP INDEX statement drops an existing index from the database system. For more
information, see the PostgreSQL documentation.
Now, watch the „Timing“ meter at the lower right-hand corner of the pgAdmin query window and run
the following. Our query searches through every single census block in order to identify blocks that
contain subway stops that start with „B“.
SELECT count(blocks.blkid)
FROM nyc_census_blocks blocks
JOIN nyc_subway_stations subways
ON ST_Contains(blocks.geom, subways.geom)
WHERE subways.name LIKE 'B%';
89
Introduction to PostGIS, Release 1.0
count
---------------
46
The nyc_census_blocks table is very small (only a few thousand records) so even without an index,
the query only takes 300 ms on my test computer.
Now add the spatial index back in and run the query again.
Bemerkung: The USING GIST clause tells PostgreSQL to use the generic index structure (GIST)
when building the index. If you receive an error that looks like ERROR: index row requires
11340 bytes, maximum size is 8191 when creating your index, you have likely neglected
to add the USING GIST clause.
On my test computer the time drops to 50 ms. The larger your table, the larger the relative speed impro-
vement of an indexed query will be.
Standard database indexes create a hierarchical tree based on the values of the column being indexed.
Spatial indexes are a little different – they are unable to index the geometric features themselves and
instead index the bounding boxes of the features.
In the figure above, the number of lines that intersect the yellow star is one, the red line. But the bounding
boxes of features that intersect the yellow box is two, the red and blue ones.
The way the database efficiently answers the question „what lines intersect the yellow star“ is to first
answer the question „what boxes intersect the yellow box“ using the index (which is very fast) and then
do an exact calculation of „what lines intersect the yellow star“ only for those features returned by the
first test.
For a large table, this „two pass“ system of evaluating the approximate index first, then carrying out an
exact test can radically reduce the amount of calculations necessary to answer a query.
Both PostGIS and Oracle Spatial share the same „R-Tree“1 spatial index structure. R-Trees break up
data into rectangles, and sub-rectangles, and sub-sub rectangles, etc. It is a self-tuning index structure
that automatically handles variable data density, differing amounts of object overlap, and object size.
1
http://postgis.net/docs/support/rtree.pdf
Only a subset of functions will automatically make use of a spatial index, if one is available.
• ST_Intersects
• ST_Contains
• ST_Within
• ST_DWithin
• ST_ContainsProperly
• ST_CoveredBy
• ST_Covers
• ST_Overlaps
• ST_Crosses
• ST_DFullyWithin
• ST_3DIntersects
• ST_3DDWithin
• ST_3DDFullyWithin
• ST_LineCrossingDirection
• ST_OrderingEquals
• ST_Equals
The first four are the ones most commonly used in queries, and ST_DWithin is very important for doing
„within a distance“ or „within a radius“ style queries while still getting a performance boost from the
index.
In order to add index acceleration to other functions that are not in this list (most commonly, ST_Relate)
add an index-only clause as descibed below.
SELECT Sum(popn_total)
FROM nyc_neighborhoods neighborhoods
JOIN nyc_census_blocks blocks
(Fortsetzung auf der nächsten Seite)
49821
Now let’s do the same query using the more exact ST_Intersects function.
SELECT Sum(popn_total)
FROM nyc_neighborhoods neighborhoods
JOIN nyc_census_blocks blocks
ON ST_Intersects(neighborhoods.geom, blocks.geom)
WHERE neighborhoods.name = 'West Village';
26718
A much lower answer! The first query summed up every block whose bounding box intersects the neigh-
borhood’s bounding box; the second query only summed up those blocks that intersect the neighborhood
itself.
15.4 Analyzing
The PostgreSQL query planner intelligently chooses when to use or not to use indexes to evaluate a
query. Counter-intuitively, it is not always faster to do an index search: if the search is going to return
every record in the table, traversing the index tree to get each record will actually be slower than just
sequentially reading the whole table from the start.
Knowing the size of the query rectangle is not enough to pin down whether a query will return a large
number or small number of records. Below, the red square is small, but will return many more records
than the blue square.
In order to figure out what situation it is dealing with (reading a small part of the table versus reading
a large portion of the table), PostgreSQL keeps statistics about the distribution of data in each indexed
15.4. Analyzing 93
Introduction to PostGIS, Release 1.0
table column. By default, PostgreSQL gathers statistics on a regular basis. However, if you dramatically
change the contents of your table within a short period of time, the statistics will not be up-to-date.
To ensure the statistics match your table contents, it is wise the to run the ANALYZE command after bulk
data loads and deletes in your tables. This force the statistics system to gather data for all your indexed
columns.
The ANALYZE command asks PostgreSQL to traverse the table and update its internal statistics used for
query plan estimation (query plan analysis will be discussed later).
ANALYZE nyc_census_blocks;
15.5 Vacuuming
It’s worth stressing that just creating an index is not enough to allow PostgreSQL to use it effectively.
VACUUMing must be performed whenever a large number of UPDATEs, INSERTs or DELETEs are
issued against a table. The VACUUM command asks PostgreSQL to reclaim any unused space in the table
pages left by updates or deletes to records.
Vacuuming is so critical for the efficient running of the database that PostgreSQL provides an „autova-
cuum“ facility by default.
Autovacuum both vacuums (recovers space) and analyzes (updates statistics) on your tables at sensible
intervals determined by the level of activity. While this is essential for highly transactional databases,
it is not advisable to wait for an autovacuum run after adding indices or bulk-loading data. Whenever a
large batch update is performed, you should manually run VACUUM.
Vacuuming and analyzing the database can be performed separately as needed. Issuing VACUUM com-
mand will not update the database statistics; likewise issuing an ANALYZE command will not recover
unused table rows. Both commands can be run against the entire database, a single table, or a single
column.
geometry_a && geometry_b: Returns TRUE if A’s bounding box overlaps B’s.
geometry_a = geometry_b: Returns TRUE if A’s bounding box is the same as B’s.
ST_Intersects(geometry_a, geometry_b): Returns TRUE if the Geometries/Geography „spatially inter-
sect“ - (share any portion of space) and FALSE if they don’t (they are Disjoint).
Projecting Data
The earth is not flat, and there is no simple way of putting it down on a flat paper map (or computer
screen), so people have come up with all sorts of ingenious solutions, each with pros and cons. Some
projections preserve area, so all objects have a relative size to each other; other projections preserve
angles (conformal) like the Mercator projection; some projections try to find a good intermediate mix
with only little distortion on several parameters. Common to all projections is that they transform the
(spherical) world onto a flat Cartesian coordinate system, and which projection to choose depends on
how you will be using the data.
We’ve already encountered projections when we loaded our nyc data. (Recall that pesky SRID 26918).
Sometimes, however, you need to transform and re-project between spatial reference systems. PostGIS
includes built-in support for changing the projection of data, using the ST_Transform(geometry,
srid) function. For managing the spatial reference identifiers on geometries, PostGIS provides the
ST_SRID(geometry) and ST_SetSRID(geometry, srid) functions.
We can confirm the SRID of our data with the ST_SRID function:
26918
And what is definition of „26918“? As we saw in „loading data section“, the definition is contained in the
spatial_ref_sys table. In fact, two definitions are there. The „well-known text“ (WKT) definition
is in the srtext column, and there is a second definition in „proj.4“ format in the proj4text column.
The PostGIS reprojection engine will attempt to find the best projection from the spatial_ref_sys
table:
• auth_name / auth_srid If proj can find a valid „authority name“ and „authority srid“ in its internal
catalogue, it will use that to generate a projection definition.
• srtext If proj can parse and form a definition object from the srtext it will use that.
• proj4text Finally, proj will attempt to process the proj4text.
95
Introduction to PostGIS, Release 1.0
All this redundancy means that all you need to create a new projection in PostGIS is either a valid
srtext string or proj4text string. All the common authority name/code pairs are already loaded in
the table by default.
If you have a choice when creating a custom projection, fill out the srtext column, since that column
is also used by external programs like GeoServer, QGIS, and FME and others.
Taken together, a coordinate and an SRID define a location on the globe. Without an SRID, a coordinate
is just an abstract notion. A „Cartesian“ coordinate plane is defined as a „flat“ coordinate system placed
on the surface of Earth. Because PostGIS functions work on such a plane, comparison operations require
that both geometries be represented in the same SRID.
If you feed in geometries with differing SRIDs you will just get an error:
SELECT ST_Equals(
ST_GeomFromText('POINT(0 0)', 4326),
ST_GeomFromText('POINT(0 0)', 26918)
);
Bemerkung: Be careful of getting too happy with using ST_Transform for on-the-fly conversion.
Spatial indexes are built using SRID of the stored geometries. If comparison are done in a different
SRID, spatial indexes are (often) not used. It is best practice to choose one SRID for all the tables in
your database. Only use the transformation function when you are reading or writing data to external
applications.
If we return to our proj4 definition for SRID 26918, we can see that our working projection is UTM
(Universal Transverse Mercator) of zone 18, with meters as the unit of measurement.
Let’s convert some data from our working projection to geographic coordinates – also known as „longi-
tude/latitude“.
To convert data from one SRID to another, you must first verify that your geometry has a valid SRID.
Since we have already confirmed a valid SRID, we next need the SRID of the projection to transform
into. In other words, what is the SRID of geographic coordinates?
The most common SRID for geographic coordinates is 4326, which corresponds to „longitude/latitude
on the WGS84 spheroid“. You can see the definition here:
https://epsg.io/4326
You can also pull the definitions from the spatial_ref_sys table:
SELECT srtext FROM spatial_ref_sys WHERE srid = 4326;
GEOGCS["WGS 84",
DATUM["WGS_1984",
SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],
AUTHORITY["EPSG","6326"]],
PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],
UNIT["degree",0.01745329251994328,AUTHORITY["EPSG","9122"]],
AUTHORITY["EPSG","4326"]]
Let’s convert the coordinates of the ‚Broad St‘ subway station into geographics:
SELECT ST_AsText(ST_Transform(geom,4326))
FROM nyc_subway_stations
WHERE name = 'Broad St';
POINT(-74.01067146887341 40.70710481558761)
If you load data or create a new geometry without specifying an SRID, the SRID value will be 0. Recall
in Geometries, that when we created our geometries table we didn’t specify an SRID. If we query
our database, we should expect all the nyc_ tables to have an SRID of 26918, while the geometries
table defaulted to an SRID of 0.
To view a table’s SRID assignment, query the database’s geometry_columns table.
SELECT f_table_name AS name, srid
FROM geometry_columns;
name | srid
---------------------+-------
nyc_census_blocks | 26918
nyc_homicides | 26918
nyc_neighborhoods | 26918
nyc_streets | 26918
(Fortsetzung auf der nächsten Seite)
However, if you know what the SRID of the coordinates is supposed to be, you can set it post-facto, using
ST_SetSRID on the geometry. Then you will be able to transform the geometry into other systems.
SELECT ST_AsText(
ST_Transform(
ST_SetSRID(geom,26918),
4326)
)
FROM geometries;
ST_AsText: Returns the Well-Known Text (WKT) representation of the geometry/geography without
SRID metadata.
ST_SetSRID(geometry, srid): Sets the SRID on a geometry to a particular integer value.
ST_SRID(geometry): Returns the spatial reference identifier for the ST_Geometry as defined in spati-
al_ref_sys table.
ST_Transform(geometry, srid): Returns a new geometry with its coordinates transformed to the SRID
referenced by the integer parameter.
Projection Exercises
Here’s a reminder of some of the functions we have seen. Hint: they should be useful for the exercises!
• sum(expression) aggregate to return a sum for a set of records
• ST_Length(linestring) returns the length of the linestring
• ST_SRID(geometry) returns the SRID of the geometry
• ST_Transform(geometry, srid) converts geometries into different spatial reference sys-
tems
• ST_GeomFromText(text) returns geometry
• ST_AsText(geometry) returns WKT text
• ST_AsGML(geometry) returns GML text
Remember the online resources that are available to you:
• https://epsg.io/
Also remember the tables we have available:
• nyc_census_blocks
– name, popn_total, boroname, geom
• nyc_streets
– name, type, geom
• nyc_subway_stations
– name, geom
• nyc_neighborhoods
– name, boroname, geom
99
Introduction to PostGIS, Release 1.0
17.1 Exercises
• What is the length of all streets in New York, as measured in UTM 18?
SELECT Sum(ST_Length(geom))
FROM nyc_streets;
10418904.7172
• What is the length of all streets in New York, as measured in SRID 2831?
SELECT Sum(ST_Length(ST_Transform(geom,2831)))
FROM nyc_streets;
10421993.706374
Bemerkung: The difference between the UTM 18 and the State Plane Long Island measurements
is (10421993 - 10418904)/10418904, or 0.02%. Calculated on the spheroid using Geography the
total street length is 10421999, which is closer to the State Plane value. This is not surprising,
since the State Plane Long Island projection is precisely calibrated for a very small area (New
York City) while UTM 18 has to provide reasonable results for a large regional area.
SELECT Count(*)
FROM nyc_streets
WHERE ST_Intersects(
ST_Transform(geom, 4326),
'SRID=4326;LINESTRING(-74 20, -74 60)'
);
223
The „74th meridian“ is a fancy way of saying „a vertical line in geographics where the X value is
-74“. We can construct such a line and then compare it to the streets, projected into geographics
also. Projecting the line into UTM and comparing it there will return a slightly different answer.
To get the same answer, you need to „segmentize“ it, so it has more points, before transforming.
SELECT Count(*)
FROM nyc_streets
WHERE ST_Intersects(
geom,
ST_Transform(ST_Segmentize('SRID=4326;LINESTRING(-74 20, -74 60)
˓→'::geometry,0.001), 26918)
);
Geography
It is very common to have data in which the coordinate are „geographics“ or „latitude/longitude“.
Unlike coordinates in Mercator, UTM, or Stateplane, geographic coordinates are not Cartesian coor-
dinates. Geographic coordinates do not represent a linear distance from an origin as plotted on a plane.
Rather, these spherical coordinates describe angular coordinates on a globe. In spherical coordinates a
point is specified by the angle of rotation from a reference meridian (longitude), and the angle from the
equator (latitude).
You can treat geographic coordinates as approximate Cartesian coordinates and continue to do spatial
calculations. However, measurements of distance, length and area will be nonsensical. Since spherical
coordinates measure angular distance, the units are in „degrees.“ Further, the approximate results from
indexes and true/false tests like intersects and contains can become terribly wrong. The distance between
points get larger as problem areas like the poles or the international dateline are approached.
For example, here are the coordinates of Los Angeles and Paris.
103
Introduction to PostGIS, Release 1.0
SELECT ST_Distance(
'SRID=4326;POINT(-118.4079 33.9434)'::geometry, -- Los Angeles (LAX)
'SRID=4326;POINT(2.5559 49.0083)'::geometry -- Paris (CDG)
);
121.898285970107
Bemerkung: Different spatial databases have different approaches for „handling geographics“
• Oracle attempts to paper over the differences by transparently doing geographic calculations when
the SRID is geographic.
• SQL Server uses two spatial types, „STGeometry“ for Cartesian data and „STGeography“ for
geographics.
• Informix Spatial is a pure Cartesian extension to Informix, while Informix Geodetic is a pure
geographic extension.
• Similar to SQL Server, PostGIS uses two types, „geometry“ and „geography“.
Using the geography instead of geometry type, let’s try again to measure the distance between Los
Angeles and Paris.
SELECT ST_Distance(
'SRID=4326;POINT(-118.4079 33.9434)'::geography, -- Los Angeles (LAX)
'SRID=4326;POINT(2.5559 49.0083)'::geography -- Paris (CDG)
);
9124665.27317673
A big number! All return values from geography calculations are in meters, so our answer is 9125km.
Older versions of PostGIS supported very basic calculations over the sphere using the
ST_Distance_Spheroid(point, point, measurement) function. However,
Working with geographic coordinates on a Cartesian plane (the purple line) yields a very wrong answer
indeed! Using great circle routes (the red lines) gives the right answer. If we convert our LAX-CDG
flight into a line string and calculate the distance to a point in Iceland using geography we’ll get the
right answer (recall) in meters.
SELECT ST_Distance(
ST_GeographyFromText('LINESTRING(-118.4079 33.9434, 2.5559 49.0083)'), --
˓→ LAX-CDG
ST_GeographyFromText('POINT(-22.6056 63.9850)') --
˓→ Iceland (KEF)
);
502454.906643729
So the closest approach to Iceland (as measured from its international airport) on the LAX-CDG route
is a relatively small 502km.
The Cartesian approach to handling geographic coordinates breaks down entirely for features that cross
the international dateline. The shortest great-circle route from Los Angeles to Tokyo crosses the Pacific
Ocean. The shortest Cartesian route crosses the Atlantic and Indian Oceans.
105
Introduction to PostGIS, Release 1.0
SELECT ST_Distance(
ST_GeometryFromText('Point(-118.4079 33.9434)'), -- LAX
ST_GeometryFromText('Point(139.733 35.567)')) -- NRT (Tokyo/Narita)
AS geometry_distance,
ST_Distance(
ST_GeographyFromText('Point(-118.4079 33.9434)'), -- LAX
ST_GeographyFromText('Point(139.733 35.567)')) -- NRT (Tokyo/Narita)
AS geography_distance;
geometry_distance | geography_distance
-------------------+--------------------
258.146005837336 | 8833954.76996256
In order to load geometry data into a geography table, the geometry first needs to be
projected into EPSG:4326 (longitude/latitude), then it needs to be changed into geography.
The ST_Transform(geometry,srid) function converts coordinates to geographics and the
Geography(geometry) function or the ::geography suffix „casts“ to geography.
CREATE TABLE nyc_subway_stations_geog AS
SELECT
ST_Transform(geom,4326)::geography AS geog,
name,
routes
FROM nyc_subway_stations;
Building a spatial index on a geography table is exactly the same as for geometry:
CREATE INDEX nyc_subway_stations_geog_gix
ON nyc_subway_stations_geog USING GIST (geog);
The difference is under the covers: the geography index will correctly handle queries that cover the poles
or the international date-line, while the geometry one will not.
Here’s a query to find all the subway stations within 500 meters of the Empire State Building.
There are only a small number of native functions for the geography type:
• ST_AsText(geography) returns text
• ST_GeographyFromText(text) returns geography
• ST_AsBinary(geography) returns bytea
• ST_GeogFromWKB(bytea) returns geography
• ST_AsSVG(geography) returns text
• ST_AsGML(geography) returns text
• ST_AsKML(geography) returns text
• ST_AsGeoJson(geography) returns text
• ST_Distance(geography, geography) returns double
• ST_DWithin(geography, geography, float8) returns boolean
• ST_Area(geography) returns double
• ST_Length(geography) returns double
• ST_Covers(geography, geography) returns boolean
• ST_CoveredBy(geography, geography) returns boolean
• ST_Intersects(geography, geography) returns boolean
• ST_Buffer(geography, float8) returns geography1
• ST_Intersection(geography, geography) returns geography1
The SQL for creating a new table with a geography column is much like that for creating a geometry
table. However, geography includes the ability to specify the object type directly at the time of table
creation. For example:
1
The buffer and intersection functions are actually wrappers on top of a cast to geometry, and are not carried out natively
in spherical coordinates. As a result, they may fail to return correct results for objects with very large extents that cannot be
cleanly converted to a planar representation.
For example, the ST_Buffer(geography,distance) function transforms the geography object into a „best“ projec-
tion, buffers it, and then transforms it back to geographics. If there is no „best“ projection (the object is too large), the operation
can fail or return a malformed buffer.
In the table definition, the GEOGRAPHY(Point) specifies our airport data type as points. The new
geography fields don’t get registered in the geometry_columns view. Instead, they are registered in
a view called geography_columns.
While the basic functions for geography types can handle many use cases, there are times when you
might need access to other functions only supported by the geometry type. Fortunately, you can convert
objects back and forth from geography to geometry.
The PostgreSQL syntax convention for casting is to append ::typename to the end of the value
you wish to cast. So, 2::text with convert a numeric two to a text string ‚2‘. And 'POINT(0
0)'::geometry will convert the text representation of point into a geometry point.
The ST_X(point) function only supports the geometry type. How can we read the X coordinate from
our geographies?
code | longitude
------+-----------
LAX | -118.4079
CDG | 2.5559
KEF | -21.8628
By appending ::geometry to our geography value, we convert the object to a geometry with an SRID
of 4326. From there we can use as many geometry functions as strike our fancy. But, remember – now
that our object is a geometry, the coordinates will be interpretted as Cartesian coordinates, not spherical
ones.
Geographics are universally accepted coordinates – everyone understands what latitude/longitude mean,
but very few people understand what UTM coordinates mean. Why not use geography all the time?
• First, as noted earlier, there are far fewer functions available (right now) that directly support the
geography type. You may spend a lot of time working around geography type limitations.
• Second, the calculations on a sphere are computationally far more expensive than Cartesian cal-
culations. For example, the Cartesian formula for distance (Pythagoras) involves one call to sqrt().
The spherical formula for distance (Haversine) involves two sqrt() calls, an arctan() call, four
sin() calls and two cos() calls. Trigonometric functions are very costly, and spherical calculations
involve a lot of them.
The conclusion?
If your data is geographically compact (contained within a state, county or city), use the geometry
type with a Cartesian projection that makes sense with your data. See the http://epsg.io site and type
in the name of your region for a selection of possible reference systems.
If you need to measure distance with a dataset that is geographically dispersed (covering much of
the world), use the geography type. The application complexity you save by working in geography
will offset any performance issues. And casting to geometry can offset most functionality limitations.
ST_Distance(geometry, geometry): For geometry type Returns the 2-dimensional Cartesian minimum
distance (based on spatial ref) between two geometries in projected units. For geography type defaults
to return spheroidal minimum distance between two geographies in meters.
ST_GeographyFromText(text): Returns a specified geography value from Well-Known Text representa-
tion or extended (WKT).
ST_Transform(geometry, srid): Returns a new geometry with its coordinates transformed to the SRID
referenced by the integer parameter.
ST_X(point): Returns the X coordinate of the point, or NULL if not available. Input must be a point.
ST_Azimuth(geography_A, geography_B): Returns the direction from A to B in radians.
ST_DWithin(geography_A, geography_B, R): Returns true if A is within R meters of B.
Geography Exercises
Here’s a reminder of all the functions we have seen so far. They should be useful for the exercises!
• Sum(number) adds up all the numbers in the result set
• ST_GeogFromText(text) returns a geography
• ST_Distance(geography, geography) returns the distance between geographies
• ST_Transform(geometry, srid) returns geometry, in the new projection
• ST_Length(geography) returns the length of the line
• ST_Intersects(geometry, geometry) returns true if the objects are not disjoint in pla-
nar space
• ST_Intersects(geography, geography) returns true if the objects are not disjoint in
spheroidal space
Also remember the tables we have available:
• nyc_streets
– name, type, geom
• nyc_neighborhoods
– name, boroname, geom
111
Introduction to PostGIS, Release 1.0
19.1 Exercises
• How far is New York from Seattle? What are the units of the answer?
SELECT ST_Distance(
'POINT(-74.0064 40.7142)'::geography,
'POINT(-122.3331 47.6097)'::geography
);
3875538.57141352
• What is the total length of all streets in New York, calculated on the spheroid?
SELECT Sum(
ST_Length(Geography(
ST_Transform(geom,4326)
)))
FROM nyc_streets;
10421999.666
Bemerkung: The length calculated in the planar „UTM Zone 18“ projection is 10418904.717,
0.02% different. UTM is good at preserving area and distance, within the zone boundaries.
SELECT ST_Intersects(
'POINT(1 2.0001)'::geography,
'POLYGON((0 0,0 2,2 2,2 0,0 0))'::geography
);
SELECT ST_Intersects(
'POINT(1 2.0001)'::geometry,
'POLYGON((0 0,0 2,2 2,2 0,0 0))'::geometry
);
Bemerkung: The upper edge of the square is a straight line in geometry, and passes below the
point, so the square does not contain the point. The upper edge of the square is a great circle in
geography, and passes above the point, so the square does contain the point.
All the functions we have seen so far work with geometries „as they are“ and returns
• analyses of the objects (ST_Length(geometry), ST_Area(geometry)),
• serializations of the objects (ST_AsText(geometry), ST_AsGML(geometry)),
• parts of the object (ST_RingN(geometry,n)) or
• true/false tests (ST_Contains(geometry,geometry), ST_Intersects(geometry,
geometry)).
„Geometry constructing functions“ take geometries as inputs and output new shapes.
A common need when composing a spatial query is to replace a polygon feature with a point representa-
tion of the feature. This is useful for spatial joins (as discussed in Polygon/Polygon Joins) because using
ST_Intersects(geometry,geometry) on two polygon layers often results in double-counting:
a polygon on a boundary will intersect an object on both sides; replacing it with a point forces it to be
on one side or the other, not both.
• ST_Centroid(geometry) returns a point that is approximately on the center of mass of the
input argument. This simple calculation is very fast, but sometimes not desirable, because the
returned point is not necessarily in the feature itself. If the input feature has a convexity (imagine
the letter ‚C‘) the returned centroid might not be in the interior of the feature.
• ST_PointOnSurface(geometry) returns a point that is guaranteed to be inside the input
argument. This makes it more useful for computing „proxy points“ for spatial joins.
113
Introduction to PostGIS, Release 1.0
centroid_inside | pos_inside
-----------------+------------
f | t
20.2 ST_Buffer
The buffering operation is common in GIS workflows, and is also available in PostGIS.
ST_Buffer(geometry,distance) takes in a buffer distance and geometry type and outputs a
polygon with a boundary the buffer distance away from the input geometry.
For example, if the US Park Service wanted to enforce a marine traffic zone around Liberty Island, they
might build a 500 meter buffer polygon around the island. Liberty Island is a single census block in our
nyc_census_blocks table, so we can easily extract and buffer it.
The ST_Buffer function also accepts negative distances and builds inscribed polygons within poly-
gonal inputs. For lines and points you will just get an empty return.
20.3 ST_Intersection
Another classic GIS operation – the „overlay“ – creates a new coverage by calculating the intersection
of two superimposed polygons. The resultant has the property that any polygon in either of the parents
can be built by merging polygons in the resultant.
The ST_Intersection(geometry A, geometry B) function returns the spatial area (or line,
or point) that both arguments have in common. If the arguments are disjoint, the function returns an
empty geometry.
SELECT ST_AsText(ST_Intersection(
ST_Buffer('POINT(0 0)', 2),
ST_Buffer('POINT(3 0)', 2)
));
20.4 ST_Union
In the previous example we intersected geometries, creating a new geometry that had lines from both
the inputs. The ST_Union function does the reverse; it takes inputs and removes common lines. There
are two forms of the ST_Union function:
• ST_Union(geometry, geometry): A two-argument version that takes in two geometries
and returns the merged union. For example, our two-circle example from the previous section
looks like this when you replace the intersection with a union.
SELECT ST_AsText(ST_Union(
ST_Buffer('POINT(0 0)', 2),
ST_Buffer('POINT(3 0)', 2)
));
So, we can create a county map by merging all geometries that share the same first 5 digits of their
blkid. Be patient; this is computationally expensive and can take a minute or two.
An area test can confirm that our union operation did not lose any geometry. First, we calculate the area
of each individual census block, and sum those areas grouping by census county id.
countyid | area
----------+------------------
36005 | 110196022.906506
36047 | 181927497.678368
36061 | 59091860.6261323
36081 | 283194473.613692
36085 | 150758328.111199
Then we calculate the area of each of our new county polygons from the county table:
countyid | area
----------+------------------
36005 | 110196022.906507
36047 | 181927497.678367
36061 | 59091860.6261324
36081 | 283194473.593646
36085 | 150758328.111199
The same answer! We have successfully built an NYC county table from our census blocks data.
ST_Centroid(geometry): Returns a point geometry that represents the center of mass of the input geo-
metry.
ST_PointOnSurface(geometry): Returns a point geometry that is guaranteed to be in the interior of the
input geometry.
ST_Buffer(geometry, distance): For geometry: Returns a geometry that represents all points whose di-
stance from this Geometry is less than or equal to distance. Calculations are in the Spatial Reference
System of this Geometry. For geography: Uses a planar transform wrapper.
ST_Intersection(geometry A, geometry B): Returns a geometry that represents the shared portion of
geomA and geomB. The geography implementation does a transform to geometry to do the intersection
and then transform back to WGS84.
ST_Union(): Returns a geometry that represents the point set union of the Geometries.
ST_AsText(text): Returns the Well-Known Text (WKT) representation of the geometry/geography wi-
thout SRID metadata.
substring(string [from int] [for int]): PostgreSQL string function to extract substring matching SQL
regular expression.
sum(expression): PostgreSQL aggregate function that returns the sum of records in a set of records.
Here’s a reminder of some of the functions we have seen. Hint: they should be useful for the exercises!
• sum(expression) aggregate to return a sum for a set of records
• ST_Area(geometry) returns the area of the geometry
• ST_Centroid(geometry) returns the geometry centroid
• ST_Transform(geometry, srid) converts geometries into different spatial reference
systems
• ST_Buffer(geometry, radius) returns an expanded geometry shape
• ST_Contains(geometry1, geometry2) returns truw if geometry1 contains geometry2
• ST_Union(geometry[]) returns the aggregate union of all geometries in the group
• ST_GeometryType(geometry) returns the type of the geometry
• ST_NumGeometries(geometry) returns the number of geometries in a collection or 1 for
simple geometries
• ST_Intersection(geometry, geometry) returns the area that the two input geometries
share in common
Remember the tables we have available:
• nyc_census_blocks
– name, popn_total, boroname, geom
• nyc_streets
– name, type, geom
• nyc_subway_stations
– name, geom
• nyc_neighborhoods
121
Introduction to PostGIS, Release 1.0
21.1 Exercises
SELECT Count(*)
FROM nyc_census_blocks
WHERE NOT
ST_Contains(
geom,
ST_Centroid(geom)
);
481
• Union all the census blocks into a single output. What kind of geometry is it? How many
parts does it have?
SELECT ST_GeometryType(geom)
FROM nyc_census_blocks_merge;
ST_MultiPolygon
SELECT ST_NumGeometries(geom)
FROM nyc_census_blocks_merge;
63
• What is the area of a one unit buffer around the origin? How different is it from what you
would expect? Why?
3.121445152258052
Bemerkung: A unit circle (circle with radius of one) should have an area of pi, 3.1415926. . . The
difference is due to the linear stroking of the edges of the buffer. The buffer has a finite number
of edges. Increasing the number of edges in the buffer will get the value closer to pi, but it will
always be smaller due to the linearization.
• The Brooklyn neighborhoods of ‘Park Slope’ and ‘Carroll Gardens’ are going to war! Con-
struct a polygon delineating a 100 meter wide DMZ on the border between the neighbor-
hoods. What is the area of the DMZ?
Bemerkung: It is easy to buffer both the neighborhoods of interest, but to get the intersection
requires a self-join of the table, creating one relation (ps) with just the „Park Slope“ record and
another (cg) with just the „Carroll Gardens“ record. Note that the area of the intersection is in
square meters because we are still working in UTM 18 (EPSG:26918).
180990.964207547
In the workshop \data\ directory, is a file that includes attribute data, but no geometry,
nyc_census_sociodata.sql. The table includes interesting socioeconomic data about New
York: commute times, incomes, and education attainment. There is just one problem. The data are sum-
marized by „census tract“ and we have no census tract spatial data!
In this section we will
• Load the nyc_census_sociodata.sql table
• Create a spatial table for census tracts
• Join the attribute data to the spatial data
• Carry out some analysis using our new data
125
Introduction to PostGIS, Release 1.0
As we saw in the previous section, we can build up higher level geometries from the census block
by summarizing on substrings of the blkid key. In order to get census tracts, we need to summarize
grouping on the first 11 characters of the blkid.
Join the table of tract geometries to the table of tract attributes with a standard attribute join
Answer an interesting question! „List top 10 New York neighborhoods ordered by the proportion of
people who have graduate degrees.“
SELECT
100.0 * Sum(t.edu_graduate_dipl) / Sum(t.edu_total) AS graduate_pct,
n.name, n.boroname
FROM nyc_neighborhoods n
JOIN nyc_census_tracts t
ON ST_Intersects(n.geom, t.geom)
WHERE t.edu_total > 0
GROUP BY n.name, n.boroname
ORDER BY graduate_pct DESC
LIMIT 10;
We sum up the statistics we are interested, then divide them together at the end. In order to avoid divide-
by-zero errors, we don’t bother bringing in tracts that have a population count of zero.
Bemerkung: New York geographers will be wondering at the presence of „Flatbush“ in this list of
over-educated neighborhoods. The answer is discussed in the next section.
SELECT
100.0 * Sum(t.edu_graduate_dipl) / Sum(t.edu_total) AS graduate_pct,
n.name, n.boroname
FROM nyc_neighborhoods n
JOIN nyc_census_tracts t
ON ST_Contains(n.geom, ST_Centroid(t.geom))
WHERE t.edu_total > 0
GROUP BY n.name, n.boroname
ORDER BY graduate_pct DESC
LIMIT 10;
Note that the query takes longer to run now, because the ST_Centroid function has to be run on every
census tract.
In particular, the Flatbush neighborhood has dropped off the list. The reason why can be seen by looking
more closely at the map of the Flatbush neighborhood in our table.
As defined by our data source, Flatbush is not really a neighborhood in the conventional sense, since
it just covers the area of Prospect Park. The census tract for that area records, naturally, zero residents.
However, the neighborhood boundary does scrape one of the expensive census tracts bordering the north
side of the park (in the gentrified Park Slope neighborhood). When using polygon/polygon tests, this
single tract was added to the otherwise empty Flatbush, resulting in the very high score for that query.
A query that is fun to ask is „How do the commute times of people near (within 500 meters) subway
stations differ from those of people far away from subway stations?“
However, the question runs into some problems of double counting: many people will be within 500
meters of multiple subway stations. Compare the population of New York:
SELECT Sum(popn_total)
FROM nyc_census_blocks;
8175032
With the population of the people in New York within 500 meters of a subway station:
SELECT Sum(popn_total)
FROM nyc_census_blocks census
JOIN nyc_subway_stations subway
ON ST_DWithin(census.geom, subway.geom, 500);
10855873
There’s more people close to the subway than there are people! Clearly, our simple SQL is making a big
double-counting error. You can see the problem looking at the picture of the buffered subways.
The solution is to ensure that we have only distinct census blocks before passing them into the summa-
rization portion of the query. We can do that by breaking our query up into a subquery that finds the
distinct blocks, wrapped in a summarization query that returns our answer:
WITH distinct_blocks AS (
SELECT DISTINCT ON (blkid) popn_total
FROM nyc_census_blocks census
JOIN nyc_subway_stations subway
ON ST_DWithin(census.geom, subway.geom, 500)
)
SELECT Sum(popn_total)
FROM distinct_blocks;
5005743
That’s better! So a bit over half the population of New York is within 500m (about a 5-7 minute walk)
of the subway.
Validity
In 90% of the cases the answer to the question, „why is my query giving me a ‚TopologyException‘
error“ is „one or more of the inputs are invalid“. Which begs the question: what does it mean to be
invalid, and why should we care?
Validity is most important for polygons, which define bounded areas and require a good deal of structure.
Lines are very simple and cannot be invalid, nor can points.
Some of the rules of polygon validity feel obvious, and others feel arbitrary (and in fact, are arbitrary).
• Polygon rings must close.
• Rings that define holes should be inside rings that define exterior boundaries.
• Rings may not self-intersect (they may neither touch nor cross themselves).
• Rings may not touch other rings, except at a point.
• Elements of multi-polygons may not touch each other.
The last three rules are in the arbitrary category. There are other ways to define polygons that are equally
self-consistent but the rules above are the ones used by the OGC SFSQL standard that PostGIS conforms
to.
The reason the rules are important is because algorithms for geometry calculations depend on consistent
structure in the inputs. It is possible to build algorithms that have no structural assumptions, but those
routines tend to be very slow, because the first step in any structure-free routine is to analyze the inputs
and build structure into them.
Here’s an example of why structure matters. This polygon is invalid:
POLYGON((0 0, 0 1, 2 1, 2 2, 1 2, 1 0, 0 0));
133
Introduction to PostGIS, Release 1.0
You can see the invalidity a little more clearly in this diagram:
The outer ring is actually a figure-eight, with a self-intersection in the middle. Note that the graphic
routines successfully render the polygon fill, so that visually it is appears to be an „area“: two one-unit
squares, so a total area of two units of area.
Let’s see what the database thinks the area of our polygon is:
SELECT ST_Area(ST_GeometryFromText(
'POLYGON((0 0, 0 1, 1 1, 2 1, 2 2, 1 2, 1 1, 1 0, 0 0))'
));
st_area
---------
0
What’s going on here? The algorithm that calculates area assumes that rings do not self-intersect. A
well-behaved ring will always have the area that is bounded (the interior) on one side of the bounding
line (it doesn’t matter which side, just that it is on one side). However, in our (poorly behaved) figure-
eight, the bounded area is to the right of the line for one lobe and to the left for the other. This causes the
areas calculated for each lobe to cancel out (one comes out as 1, the other as -1) hence the „zero area“
result.
In the previous example we had one polygon that we knew was invalid. How do we detect invalidity in
a table with millions of geometries? With the ST_IsValid(geometry) function. Used against our
figure-eight, we get a quick answer:
SELECT ST_IsValid(ST_GeometryFromText(
'POLYGON((0 0, 0 1, 1 1, 2 1, 2 2, 1 2, 1 1, 1 0, 0 0))'
));
Now we know that the feature is invalid, but we don’t know why. We can use the
ST_IsValidReason(geometry) function to find out the source of the invalidity:
SELECT ST_IsValidReason(ST_GeometryFromText(
'POLYGON((0 0, 0 1, 1 1, 2 1, 2 2, 1 2, 1 1, 1 0, 0 0))'
));
Self-intersection[1 1]
Note that in addition to the reason (self-intersection) the location of the invalidity (coordinate (1 1)) is
also returned.
We can use the ST_IsValid(geometry) function to test our tables too:
Repairing invalidity involves stripping a polygon down to its simplest structures (rings), ensuring the
rings follow the rules of validity, then building up new polygons that follow the rules of ring enclosure.
Frequently the results are intuitive, but in the case of extremely ill-behaved inputs, the valid outputs may
not conform to your intuition of how they should look. Recent versions of PostGIS include different
algorithms for geometry repair: read the manual page carefully and choose the one you like best.
For example, here’s a classic invalidity – the „banana polygon“ – a single ring that encloses an area but
bends around to touch itself, leaving a „hole“ which is not actually a hole.
POLYGON((0 0, 2 0, 1 1, 2 2, 3 1, 2 0, 4 0, 4 4, 0 4, 0 0))
Running ST_MakeValid on the polygon returns a valid OGC polygon, consisting of an outer and inner
ring that touch at one point.
SELECT ST_AsText(
ST_MakeValid(
ST_GeometryFromText('POLYGON((0 0, 2 0, 1 1, 2 2, 3 1, 2 0, 4 0,
˓→ 4 4, 0 4, 0 0))')
)
);
POLYGON((0 0,0 4,4 4,4 0,2 0,0 0),(2 0,3 1,2 2,1 1,2 0))
Bemerkung: The „banana polygon“ (or „inverted shell“) is a case where the OGC topology model for
valid geometry and the model used internally by ESRI differ. The ESRI model considers rings that touch
to be invalid, and prefers the banana form for this kind of shape. The OGC model is the reverse. Neither
is „correct“, they are just different ways to model the same situation.
Here’s an example of SQL to flag invalid geometries for review while adding a repaired version to the
table.
A good tool for visually repairing invalid geometry is OpenJump (http://openjump.org) which includes
a validation routine under Tools->QA->Validate Selected Layers.
Equality
24.1 Equality
Determining equality when dealing with geometries can be tricky. PostGIS supports three different func-
tions that can be used to determine different levels of equality, though for clarity we will use the defini-
tions below. To illustrate these functions, we will use the following polygons.
139
Introduction to PostGIS, Release 1.0
Exact equality is determined by comparing two geometries, vertex by vertex, in order, to ensure they are
identical in position. The following examples show how this method can be limited in its effectiveness.
In this example, the polygons are only equal to themselves, not to other seemingly equivalent polygons
(as in the case of Polygons 1 through 3). In the case of Polygons 1, 2, and 3, the vertices are in identical
positions but are defined in differing orders. Polygon 4 has colinear (and thus redundant) vertices on the
hexagon edges causing inequality with Polygon 1.
As we saw above, exact equality does not take into account the spatial nature of the geometries. There is
an function, aptly named ST_Equals, available to test the spatial equality or equivalence of geometries.
These results are more in line with our intuitive understanding of equality. Polygons 1 through 4 are
considered equal, since they enclose the same area. Note that neither the direction of the polygon is
drawn, the starting point for defining the polygon, nor the number of points used are important here.
What is important is that the polygons contain the same space.
Exact equality requires, in the worst case, comparison of each and every vertex in the geometry to deter-
mine equality. This can be slow, and may not be appropriate for comparing huge numbers of geometries.
To allow for speedier comparison, the equal bounds operator, ~=, is provided. This operates only on the
bounding box (rectangle), ensuring that the geometries occupy the same two dimensional extent, but not
necessarily the same space.
As you can see, all of our spatially equal geometries also have equal bounds. Unfortunately, Polygon 5 is
also returned as equal under this test, because it shares the same bounding box as the other geometries.
Why is this useful, then? Although this will be covered in detail later, the short answer is that this enables
the use of spatial indexing that can quickly reduce huge comparison sets into more manageable blocks
when joining or filtering data.
Linear Referencing
Linear referencing (sometimes called „dynamic segmentation“) is a means of representing features that
can be described by referencing a base set of linear features. Common examples of features that are
modelled using linear referencing are:
• Highway assets, which are referenced using miles along a highway network
• Road maintenance operations, which are referenced as occurring along a road network between a
pair of mile measurements.
• Aquatic inventories, where fish presence is recorded as existing between a pair of mileage-
upstream measurements.
• Hydrologic characterizations („reaches“) of streams, recorded with a from- and to- mileage.
The benefit of linear referencing models is that the dependent spatial observations do not need to be
separately recorded from the base observations, and updates to the base observation layer can be carried
out knowing that the dependent observations will automatically track the new geometry.
Bemerkung: The Esri terminological convention for linear referencing is to have a base table of linear
spatial features, and a non-spatial table of „events“ which includes a foreign key reference to the spatial
feature and a measure along the referenced feature. We will use the term „event table“ to refer to the
non-spatial tables we build.
145
Introduction to PostGIS, Release 1.0
If you have an existing point table that you want to reference to a linear network, use the
ST_LineLocatePoint function, which takes a line and point, and returns the proportion along the
line that the point can be found.
We can convert the nyc_subway_stations into an „event table“ relative to the streets by using
ST_LineLocatePoint.
-- All the SQL below is in aid of creating the new event table
CREATE TABLE nyc_subway_station_events AS
-- We first need to get a candidate set of maybe-closest
-- streets, ordered by id and distance...
WITH ordered_nearest AS (
SELECT
ST_GeometryN(streets.geom,1) AS streets_geom,
streets.gid AS streets_gid,
subways.geom AS subways_geom,
subways.gid AS subways_gid,
ST_Distance(streets.geom, subways.geom) AS distance
FROM nyc_streets streets
JOIN nyc_subway_stations subways
ON ST_DWithin(streets.geom, subways.geom, 200)
ORDER BY subways_gid, distance ASC
)
-- We use the 'distinct on' PostgreSQL feature to get the first
-- street (the nearest) for each unique street gid. We can then
-- pass that one street into ST_LineLocatePoint along with
-- its candidate subway station to calculate the measure.
SELECT
DISTINCT ON (subways_gid)
subways_gid,
streets_gid,
ST_LineLocatePoint(streets_geom, subways_geom) AS measure,
distance
FROM ordered_nearest;
Once we have an event table, it’s fun to turn it back into a spatial view, so we can visualize the events
relative to the original points they were derived from.
To go from a measure to a point, we use the ST_LineInterpolatePoint function. Here’s our
previous simple examples reversed:
-- Answer POINT(1 1)
And we can join the nyc_subway_station_events tables back to the nyc_streets table and use the mea-
sure attribute to generate the spatial event points, without referencing the original nyc_subway_stations
table.
Viewing the original (red star) and event (blue circle) points with the streets, you can see how the events
are snapped directly to the closest street lines.
Bemerkung: One surprising use of the linear referencing functions has nothing to do with linear refe-
rencing models. As shown above, it’s possible to use the functions to snap points to linear features. For
use cases like GPS tracks or other inputs that are expected to reference a linear network, snapping is a
handy feature to have available.
The „Dimensionally Extended 9-Intersection Model“ (DE9IM) is a framework for modelling how two
spatial objects interact.
First, every spatial object has:
• An interior
• A boundary
• An exterior
For polygons, the interior, boundary and exterior are obvious:
149
Introduction to PostGIS, Release 1.0
The interior is the part bounded by the rings; the boundary is the rings themselves; the exterior is every-
thing else in the plane.
For linear features, the interior, boundary and exterior are less well-known:
The interior is the part of the line bounded by the ends; the boundary is the ends of the linear feature,
and the exterior is everything else in the plane.
For points, things are even stranger: the interior is the point; the boundary is the empty set and the
For the polygons in the example above, the intersection of the interiors is a 2-dimensional area, so that
portion of the matrix is filled out with a „2“. The boundaries only intersect at points, which are zero-
dimensional, so that portion of the matrix is filled out with a 0.
When there is no intersection between components, the square the matrix is filled out with an „F“.
Here’s another example, of a linestring partially entering a polygon:
151
Introduction to PostGIS, Release 1.0
Note that the boundaries of the two objects don’t actually intersect at all (the end point of the line
interacts with the interior of the polygon, not the boundary, and vice versa), so the B/B cell is filled in
with an „F“.
While it’s fun to visually fill out DE9IM matrices, it would be nice if a computer could do it, and that’s
what the ST_Relate function is for.
The previous example can be simplified using a simple box and line, with the same spatial relationship
as our polygon and linestring:
The answer (1010F0212) is the same as we calculated visually, but returned as a 9-character string, with
the first row, second row and third row of the table appended together.
101
0F0
212
However, the power of DE9IM matrices is not in generating them, but in using them as a matching key
to find geometries with very specific relationships to one another.
CREATE TABLE lakes ( id serial primary key, geom geometry );
CREATE TABLE docks ( id serial primary key, good boolean, geom geometry );
Suppose we have a data model that includes Lakes and Docks, and suppose further that Docks must be
inside lakes, and must touch the boundary of their containing lake at one end. Can we find all the docks
in our database that obey that rule?
153
Introduction to PostGIS, Release 1.0
So to find all the legal docks, we would want to find all the docks that intersect lakes (a super-set of
potential candidates we use for our join key), and then find all the docks in that set which have the legal
relate pattern.
SELECT docks.*
FROM docks JOIN lakes ON ST_Intersects(docks.geom, lakes.geom)
WHERE ST_Relate(docks.geom, lakes.geom, '1FF00F212');
Note the use of the three-parameter version of ST_Relate, which returns true if the pattern matches
or false if it does not. For a fully-defined pattern like this one, the three-parameter version is not needed
– we could have just used a string equality operator.
However, for looser pattern searches, the three-parameter allows substitution characters in the pattern
string:
• „*“ means „any value in this cell is acceptable“
• „T“ means „any non-false value (0, 1 or 2) is acceptable“
So for example, one possible dock we did not include in our example graphic is a dock with a two-
dimensional intersection with the lake boundary:
155
Introduction to PostGIS, Release 1.0
If we are to include this case in our set of „legal“ docks, we need to change the relate pattern in our
query. In particular, the intersection of the dock interior lake boundary can now be either 1 (our new
case) or F (our original case). So we use the „*“ catchall in the pattern.
SELECT docks.*
FROM docks JOIN lakes ON ST_Intersects(docks.geom, lakes.geom)
WHERE ST_Relate(docks.geom, lakes.geom, '1*F00F212');
Confirm that the stricter SQL from the previous example does not return the new dock.
The TIGER data is carefully quality controlled when it is prepared, so we expect our data to meet strict
standards. For example: no census block should overlap any other census block. Can we test for that?
Sure!
Similarly, we would expect that the roads data is all end-noded. That is, we expect that intersections only
occur at the ends of lines, not at the mid-points.
We can test for that by looking for streets that intersect (so we have a join) but where the intersection
between the boundaries is not zero-dimensional (that is, the end points don’t touch):
ST_Relate(geometry A, geometry B): Returns a text string representing the DE9IM relationship between
the geometries.
Clustering on Indices
Databases can only retrieve information as fast as they can get it off of disk. Small databases will float
up entirely into RAM cache, and get away from physical disk limitations, but for large databases, access
to the physical disk will be a limiting stop in disk access speed.
Data is written to disk opportunistically, so there is not necessarily any correlation between the order
data is stored on the disk and the way it will be accessed or organized by applications.
One way to speed up access to data is to ensure that records which is likely to be retrieved together
in the same result set are located in similar physical locations on the hard disk platters. This is called
„clustering“.
The right clustering scheme to use can be tricky, but a general rule applies: indexes define a natural
ordering scheme for data which is similar to the access pattern that will be used in retrieving the data.
159
Introduction to PostGIS, Release 1.0
Because of this, ordering the data on the disk in the same order as the index can provide a speed advan-
tage in some cases.
Spatial data tends to be accessed in spatially correlated windows: think of the map window in a web
or desktop application. All the data in the windows has similar location value (or it wouldn’t be in the
window!)
So, clustering based on a spatial index makes sense for spatial data that is going to be accessed with
spatial queries: similar things tend to have similar locations.
Let’s cluster our nyc_census_blocks based on their spatial index:
The command re-writes the nyc_census_blocks in the order defined by the spatial index
nyc_census_blocks_geom_gist. Can you perceive a speed difference? Maybe not, because the
original data may have already had some pre-existing spatial ordering (this is not uncommon in GIS data
sets).
Most modern databases are run using SSD storage, which is much faster at random access than old
spinning magnetic media. Also, most modern databases are running on top of data which is small enough
to fit into the RAM of the database server, and ends up there as the operating system „virtual filesystem“
caches it.
Is clustering still necessary?
Surprisingly, yes. Keeping records that are „near each other“ in space „near each other“ in memory
increases the odds that related records will move up the servers „memory cache heirarchy“ together, and
thus make memory accesses faster.
System RAM is not the fastest memory on a modern computer. There are several levels of cache between
system RAM and the actual CPU, and the underlying operating system and processor will move data
up and down the cache heirarchy in blocks. If the block getting moved up happens to include the piece
of data the system will need next. . . that’s a big win. Correlating the memory structure with the spatial
structure is a way in increase the odds of that win happening.
In theory, yes. In practice, no really. As long as the index is a „pretty good“ spatial decomposition of the
data, the main determinant of performance will be the order of the actual table tuples.
The difference between „no index“ and „index“ is generally huge and highly measurable. The difference
between „mediocre index“ and „great index“ usually takes quite careful measurement to discern, and
can be very sensitive to the workload being tested.
CLUSTER: Re-orders the data in a table to match the ordering in the index.
3-D
So far, we have been working with 2-D geometries, with only X and Y coordinates. But PostGIS supports
additional dimensions on all geometry types, a „Z“ dimension to add height information and a „M“
dimension for additional dimensional information (commonly time, or road-mile, or upstream-distance
information) for each coordinate.
For 3-D and 4-D geometries, the extra dimensions are added as extra coordinates for each vertex in the
geometry, and the geometry type is enhanced to indicate how to interpret the extra dimensions. Adding
the extra dimensions results in three extra possible geometry types for each geometry primitive:
• Point (a 2-D type) is joined by PointZ, PointM and PointZM types.
• Linestring (a 2-D type) is joined by LinestringZ, LinestringM and LinestringZM types.
• Polygon (a 2-D type) is joined by PolygonZ, PolygonM and PolygonZM types.
• And so on.
For well-known text (WKT) representation, the format for higher dimensional geometries is given by
the ISO SQL/MM specification. The extra dimensionality information is simply added to the text string
after the type name, and the extra coordinates added after the X/Y information. For example:
• POINT ZM (1 2 3 4)
• LINESTRING M (1 1 0, 1 2 0, 1 3 1, 2 2 0)
• POLYGON Z ((0 0 0, 0 1 0, 1 1 0, 1 0 0, 0 0 0))
The ST_AsText() function will return the above representations when dealing with 3-D and 4-D geome-
tries.
For well-known binary (WKB) representation, the format for higher dimensional geometries is given
by the ISO SQL/MM specification. The BNF form of the format is available from https://git.osgeo.org/
gitea/postgis/postgis/src/branch/master/doc/bnf-wkb.txt.
163
Introduction to PostGIS, Release 1.0
In addition to higher-dimensional forms of the standard types, PostGIS includes a few new types that
make sense in a 3-D space:
• The TIN type allows you to model triangular meshes as rows in your database.
• The POLYHEDRALSURFACE allows you to model volumetric objects in your database.
Since both these types are for modelling 3-D objects, it only really makes sense to use the Z variants.
An example of a POLYHEDRALSURFACE Z would be the 1 unit cube:
POLYHEDRALSURFACE Z (
((0 0 0, 0 1 0, 1 1 0, 1 0 0, 0 0 0)),
((0 0 0, 0 1 0, 0 1 1, 0 0 1, 0 0 0)),
((0 0 0, 1 0 0, 1 0 1, 0 0 1, 0 0 0)),
((1 1 1, 1 0 1, 0 0 1, 0 1 1, 1 1 1)),
((1 1 1, 1 0 1, 1 0 0, 1 1 0, 1 1 1)),
((1 1 1, 1 1 0, 0 1 0, 0 1 1, 1 1 1))
)
There are a number of functions built to calculate relationships between 3-D objects:
• ST_3DClosestPoint — Returns the 3-dimensional point on g1 that is closest to g2. This is the first
point of the 3D shortest line.
• ST_3DDistance — For geometry type Returns the 3-dimensional cartesian minimum distance
(based on spatial ref) between two geometries in projected units.
• ST_3DDWithin — For 3d (z) geometry type Returns true if two geometries 3d distance is within
number of units.
• ST_3DDFullyWithin — Returns true if all of the 3D geometries are within the specified distance
of one another.
• ST_3DIntersects — Returns TRUE if the Geometries „spatially intersect“ in 3d - only for points
and linestrings
• ST_3DLongestLine — Returns the 3-dimensional longest line between two geometries
• ST_3DMaxDistance — For geometry type Returns the 3-dimensional cartesian maximum distan-
ce (based on spatial ref) between two geometries in projected units.
• ST_3DShortestLine — Returns the 3-dimensional shortest line between two geometries
For example, we can calculate the distance between our unit cube and a point using the ST_3DDistance
function:
Once you have data in higher dimensions it may make sense to index it. However, you should think
carefully about the distribution of your data in all dimensions before applying a multi-dimensional index.
Indexes are only useful when they allow the database to drastically reduce the number of return rows as
a result of a WHERE condition. For a higher dimension index to be useful, the data must cover a wide
range of that dimension, relative to the kinds of queries you are constructing.
• A set of DEM points would probably be a poor candidate for a 3-D index, since the queries would
usually be extracting a 2-D box of points, and rarely attempting to select a Z-slice of points.
• A set of GPS traces in X/Y/T space might be a good candidate for a 3-D index, if the GPS tracks
overlapped each other frequently in all dimensions (for example, driving the same route over and
over at different times), since there would be large variability in all dimensions of the data set.
You can create a multi-dimensional index on data of any dimensionality (even mixed dimensionality).
For example, to create a multi-dimensional index on the nyc_streets table,
The gist_geometry_ops_nd parameter tells PostGIS to use the N-D index instead of the standard
2-D index.
Once you have the index built, you can use it in queries with the &&& index operator. &&& has the same
semantics as &&, „bounding boxes interact“, but applies those semantics using all the dimensions of the
input geometries. Geometries with mis-matching dimensionality do not interact.
-- Returns true (the volume around the linestring interacts with the point)
SELECT 'LINESTRING Z(0 0 0, 1 1 1)'::geometry &&&
'POINT(0 1 1)'::geometry;
To search the nyc_streets table using the N-D index, just replace the usual && 2-D index operator
with the &&& operator.
The results should be the same. In general the N-D index is very slightly slower than the 2-D index,
so only use the N-D index where you are certain that N-D queries will improve the selectivity of your
queries.
Nearest-Neighbour Searching
A frequently posed spatial query is: „what is the nearest <candidate feature> to <query feature>?“
Unlike a distance search, the „nearest neighbour“ search doesn’t include any measurement restricting
how far away candidate geometries might be, features of any distance away will be accepted, as long as
they are the nearest.
PostgreSQL solves the nearest neighbor problem by introducing an „order by distance“ (<->) operator
that induces the database to use an index to speed up a sorted return set. With an „order by distance“
operator in place, a nearest neighbor query can return the „N nearest features“ just by adding an ordering
and limiting the result set to N entries.
The „order by distance“ operator works for both geometry and geography types. The only difference
between how they work between the two types is the distance value returned. For geometry <-> returns
the same answer as ST_Distance which is dependent on the units of the spatial reference system in use.
For geography the distance value returned is the sphere distance, instead of the more accurate spheroidal
distance that ST_Distance(geography,geography) returns.
Here’s the 3 nearest streets to ‚Broad St‘ subway station:
SRID=26918;POINT(583571.9 4506714.3)
167
Introduction to PostGIS, Release 1.0
How can we be sure we are getting an index-assisted query? It’s a good idea to check the EXPLAIN
output for a nearest-neighbor query, because it’s possible to get correct answers from non-indexed SQL
and the lack of an index might not be obvious until the size of the tables scales up.
This is the output from EXPLAIN, note the index scan over the order by:
QUERY PLAN
---------------------------------------------------------------------------
˓→------
Limit (cost=0.28..79.58 rows=3 width=31)
-> Index Scan using nyc_streets_geom_idx on nyc_streets streets
(cost=0.28..504685.12 rows=19091 width=31)
Order By:
(geom <-> '0101000020266900000EEBD4CF27CF2141BC17D69516315141
˓→'::geometry)
The index assisted order by operator has one major draw back: it only works with a single geometry
literal on one side of the operator. This is fine for finding the objects nearest to one query object, but
does not help for a spatial join, where the goal is to find the nearest neighbor for each of a full set of
candidates.
Fortunately, there’s a SQL language feature that allows us to run a query repeatedly driven in a loop: the
LATERAL join.
Here we will find the nearest street to each subway station:
Note the way the CROSS JOIN LATERAL acts as the inner part of a loop driven by the subways table.
Each record in the subways table gets fed into the lateral subquery, one at a time, so you get a nearest
result for each subway record.
The explain shows the loop on the subway stations, and the index-assisted order by inside the loop where
we want it:
QUERY PLAN
-------------------------------------------------------------------------
Nested Loop (cost=0.28..13140.71 rows=491 width=37)
-> Seq Scan on nyc_subway_stations subways
(cost=0.00..15.91 rows=491 width=46)
-> Limit
(cost=0.28..1.71 rows=1 width=170)
-> Index Scan using nyc_streets_geom_idx on nyc_streets streets
(cost=0.28..27410.12 rows=19091 width=170)
Order By: (geom <-> subways.geom)
A common requirement for production databases is the ability to track history: how has the data changed
between two dates, who made the changes, and where did they occur? Some GIS systems track changes
by including change management in the client interface, but that adds a lot of complexity to editing
tools.
Using the database and the trigger system, it’s possible to add history tracking to any table, while main-
taining simple „direct edit“ access to the primary table.
History tracking works by keeping a history table that records, for every edit:
• If a record was created, when it was added and by whom.
• If a record was deleted, when it was deleted and by whom.
• If a record was updated, adding a deletion record (for the old state) and a creation record (for the
new state).
The history table uses a PostgreSQL-specific feature–the „timestamp range“ type–to store the time range
that a history record was the „live“ record. All the timestamp ranges in the history table for a particular
feature can be expected to be non-overlapping but adjacent.
The range for a new record will start at now() and have an open end point, so that the range covers all
time from the current time into the future.
tstzrange
------------------------------------
["2021-06-01 14:49:40.910074-07",)
Similarly, the time range for a deleted record will be updated to include the current time as the end point
of the time range.
171
Introduction to PostGIS, Release 1.0
Searching time ranges is much simpler than searching a pair of timestamps, because of the way an open
time range encompasses all time from the start point to infinity. The „contains“ operator @> for ranges
is the one we will use.
-- Does the range of "ten minutes ago to the future" include now?
-- It should! :)
--
SELECT tstzrange(current_timestamp - '10m'::interval, NULL) @> current_
˓→timestamp;
Ranges can be very efficiently indexed using a GIST index, just like spatial data, as we will show below.
This makes history queries very efficient.
Using this information it is possible to reconstruct the state of the edit table at any point in time. In this
example, we will add history tracking to our nyc_streets table.
• First, add a new nyc_streets_history table. This is the table we will use to store all the historical
edit information. In addition to all the fields from nyc_streets, we add five more fields.
– hid the primary key for the history table
– created_by the database user that caused the record to be created
– deleted_by the database user that caused the record to be marked as deleted
– valid_range the time range within which the record was „live“
Note that we don’t actually delete any records in the history table, we just mark the time they
ceased to be part of the current state of the edit table.
• Next, we import the current state of the active table, nyc_streets into the history table, so we have
a starting point to trace history from. Note that we fill in the creation time and creation user, but
leave the end of the time range and the deleted by information NULL.
• Now we need three triggers on the active table, for INSERT, DELETE and UPDATE actions. First
we create the trigger functions, then bind them to the table as triggers.
For an insert, we just add a new record into the history table with the creation time/user.
For a deletion, we just mark the currently active history record (the one with a NULL deletion
time) as deleted.
For an update, we first mark the active history record as deleted, then insert a new record for the
updated state.
UPDATE nyc_streets_history
SET valid_range = tstzrange(lower(valid_range), current_
˓→timestamp),
deleted_by = current_user
WHERE valid_range @> current_timestamp AND gid = OLD.gid;
RETURN NEW;
END;
$$
LANGUAGE plpgsql;
Now that the history table is enabled, we can make edits on the main table and watch the log entries
appear in the history table.
Note the power of this database-backed approach to history: no matter what tool is used to make the
edits, whether the SQL command line, a web-based JDBC tool, or a desktop tool like QGIS, the
history is consistently tracked.
Let’s turn the two streets named „Cumberland Walk“ to the more stylish „Cumberland Wynde“:
Updating the two streets will cause the original streets to be marked as deleted in the history table, with
a deletion time of now, and two new streets with the new name added, with an addition time of now. You
can inspect the historical records:
Now that we have a history table, what use is it? It’s useful for time travel! To travel to a particular time
T, you need to construct a query that includes:
• All records created before T, and not yet deleted; and also
• All records created before T, but deleted after T.
We can use this logic to create a query, or a view, of the state of the data in the past. Since presumably
all your test edits have happened in the past couple minutes, let’s create a view of the history table that
shows the state of the table 10 minutes ago, before you started editing (so, the original data).
We can also create views that show just what a particular used has added, for example:
PostgreSQL is a very versatile database system, capable of running efficiently in very low-resource
environments and environments shared with a variety of other applications. In order to ensure it will run
properly for many different environments, the default configuration is very conservative and not terribly
appropriate for a high-performance production database. Add the fact that geospatial databases have
different usage patterns, and the data tend to consist of fewer, much larger records than non-geospatial
databases, and you can see that the default configuration will not be totally appropriate for our purposes.
All of these configuration parameters can edited in the postgresql.conf database configuration file. This
is a regular text file and can be edited using any text editor. The changes will not take effect until the
server is restarted.
This section describes some of the configuration parameters that can be adjusted for a more production-
ready geospatial database.
Bemerkung: These values are recommendations only; each environment will differ and testing is re-
quired to determine the optimal configuration. But this section should get you off to a good start.
31.1 shared_buffers
Sets the amount of memory the database server uses for shared memory buffers. These are shared
amongst the back-end processes, as the name suggests. The default values are typically woefully in-
adequate for production databases.
Default value: typically 32MB
Recommended value: about 75% of database memory up to a max of about 2GB
177
Introduction to PostGIS, Release 1.0
31.2 effective_cache_size
In addition to the memory PostgreSQL sets aside for shared_buffers the query planner al-
so takes into account how many disk blocks the operating system may have cached as part of
its virtual file system. For systems with large amount of memory, this can be quite large. The
effective_cache_size is approximately the amount of memory on the machine, less the
shared_buffers, less the work_mem times the expected number of connections, less any memory
required for any other processes running on the machine, less about 1GB for other random operating
system needs. The database will not use the extra cache directly, but it will compute plans expecting that
the operating system has cached filesystem data in about that much memory.
Default value: typically 4GB
Recommended value: any amount of „free“ memory expected to be around under ordinary
operating conditions
31.3 work_mem
Defines the amount of memory that internal sorting operations, indexing operations and hash tables
can consume before the database switches to on-disk files. This value defines the available memory for
each operation; complex queries may have several sort or hash operations running in parallel, and each
connected session may be executing a query.
As such you must consider how many connections and the complexity of expected queries before incre-
asing this value. The benefit to increasing is that the processing of more of these operations, including
ORDER BY, and DISTINCT clauses, merge and hash joins, hash-based aggregation and hash-based
processing of subqueries, can be accomplished without incurring disk writes. The cost of increasing is
memory that will be used per connection, which can be quite high with production levels of connecti-
ons.
Default value: 1MB
Recommended value: 32MB
31.4 maintenance_work_mem
Defines the amount of memory used for maintenance operations, including vacuuming, index and for-
eign key creation. As these operations are not terribly common, a higher value will only exact an oc-
casional cost, and may substantially speed up maintenance activities This parameter can alternately be
increased for a single session before the execution of a number of CREATE INDEX or VACUUM calls
as shown below.
31.5 wal_buffers
Sets the amount of memory used for write-ahead log (WAL) data. Write-ahead logs provide a high-
performance mechanism for insuring data-integrity. During each change command, the effects of the
changes are written first to the WAL files and flushed to disk. Only once the WAL files have been
flushed will the changes be written to the data files themselves. This allows the data files to be written to
disk in an optimal and asynchronous manner while ensuring that, in the event of a crash, all data changes
can be recovered from the WAL.
The size of this buffer only needs to be large enough to hold WAL data for a single typical transacti-
on. While the default value is often sufficient for most data, geospatial data tends to be much larger.
Therefore, it is recommended to increase the size of this parameter.
Default value: 64kB
Recommended value: 1MB
31.6 checkpoint_segments
This value sets the maximum number of log file segments (typically 16MB) that can be filled between
automatic WAL checkpoints. A WAL checkpoint is a point in the sequence of WAL transactions at which
it is guaranteed that the data files have been updated with all information before the checkpoint. At this
time all dirty data pages are flushed to disk and a checkpoint record is written to the log file. This allows
the crash recovery process to find the latest checkpoint record and apply all following log segments to
complete the data recovery.
Because the checkpoint process requires the flushing of all dirty data pages to disk, it creates a significant
I/O load. The same argument from above applies; geospatial data is large enough to unbalance non-
geospatial optimizations. Increasing this value will prevent excessive checkpoints, though it may cause
the server to restart more slowly in the event of a crash.
Default value: 3
Recommended value: 6
31.7 random_page_cost
This is a unit-less value that represents the cost of a random page access from disk. This value is relative
to a number of other cost parameters including sequential page access, and CPU operation costs. While
there is no magic bullet for this value, the default is generally conservative and for databases running on
spinning media. The random access cost for SSD should be set even lower.
This value can be set on a per-session basis using the SET random_page_cost TO 2.0 command,
which can be useful for testing how it effects query plans.
Default value: 4.0
Recommended value: 2.0 for spinning media, 1.0 for SSD
31.8 seq_page_cost
This is the parameter that controls the cost of a sequential page access. This value does not generally
require adjustment but the difference between this value and random_page_cost greatly affects the
choices made by the query planner. This value can also be set on a per-session basis.
Default value: 1.0
Recommended value: 1.0
After these changes are made, save changes and reload the configuration. The easiest way to do this is
to restart the PostgreSQL service.
• In pgAdmin, right-click the server PostGIS (localhost:5432) and select Disconnect.
• In Windows Services (services.msc) right-click PostgreSQL and select Restart.
• Back in pgAdmin, click the server again select Disconnect.
PostgreSQL Security
PostgreSQL has a rich and flexible permissions system, with the ability to parcel out particular privileges
to particular roles, and provide users with the powers of one or more of those roles.
In addition, the PostgreSQL server can use multiple different systems to authenticate users. This means
that the database can use the same authentication infrastructure as other architecture components, sim-
plifying password management.
A role is a user and a user is a role. The only difference is that a „user“ can be said to be a role with the
„login“ privilege.
So functionally, the two SQL statements below are the same, they both create a „role with the login
privilege“, which is to say, a „user“.
181
Introduction to PostGIS, Release 1.0
Our read-only user will be for a web application to use to query the nyc_streets table.
The application will have specific access to the nyc_streets table, but will inherit the necessary
system access for PostGIS operations from the postgis_reader role.
Now, when we login as app1, we can select rows from the nyc_streets table. However, we cannot
run an ST_Transform call! Why not?
-- This works!
SELECT * FROM nyc_streets LIMIT 1;
The answer is contained in the error statement. Though our app1 user can view the contents of
the nyc_streets table fine, it cannot view the contents of spatial_ref_sys, so the call to
ST_Transform fails.
So, we need to also grant the postgis_reader role read access to all the PostGIS metadata tables:
Now we have a nice generic postgis_reader role we can apply to any user that need to read from
PostGIS tables.
These kinds of permissions would be required for a read/write WFS service, for example.
For developers and analysts, a little more access is needed to the main PostGIS metadata tables. We will
need a postgis_writer role that can edit the PostGIS metadata tables!
Now try the table creation SQL above as the app1 user and see how it goes!
32.2 Encryption
PostgreSQL provides a lot of encryption facilities, many of them optional, some of them on by default.
• By default, all passwords are MD5 encrypted. The client/server handshake double encrypts the
MD5 password to prevent re-use of the hash by anyone who intercepts the password.
• SSL connections are optionally available between the client and server, to encrypt all data and
login information. SSL certificate authentication is also available when SSL connections are used.
• Columns inside the database can be encrypted using the pgcrypto module, which includes hashing
algorithms, direct ciphers (blowfish, aes) and both public key and symmetric PGP encryption.
In order to use SSL connections, both your client and server must support SSL.
• First, turn off PostgreSQL, since activating SSL will require a restart.
• Next, we acquire or generate an SSL certificate and key. The certificate will need to have no
passphrase on it, or the database server won’t be able to start up. You can generate a self-signed
key as follows:
• Copy the server.crt and server.key into the PostgreSQL data directory.
• Enable SSL support in the postgresql.conf file by turning the „ssl“ parameter to „on“.
• Now re-start PostgreSQL; the server is ready for SSL operation.
With the server enabled for SSL, creating an encrypted connection is easy. In PgAdmin, create a new
server connection (File > Add Server. . . ), and set the SSL parameter to “require”.
Once you connect with the new connection, you can see in its properties that it is using an SSL connec-
tion.
Since the default SSL connection mode is „prefer“, you don’t even need to specify an SSL preference
when connecting. A connection with the command line psql terminal will pick up the SSL option and
use it by default:
psql (8.4.9)
SSL connection (cipher: DHE-RSA-AES256-SHA, bits: 256)
Type "help" for help.
postgres=#
Note how the terminal reports the SSL status of the connection.
The pgcrypto module has a huge range of encryption options, so we will only demonstrate the simplest
use case: encrypting a column of data using a symmetric cipher.
• First, enable pgcrypto by loading the contrib SQL file, either in PgAdmin or psql.
pgsql/8.4/share/postgresql/contrib/pgcrypto.sql
32.3 Authentication
PostgreSQL supports many different authentication methods, to allow easy integration into existing
enterprise architectures. For production purposes, the following methods are commonly used:
• Password is the basic system where the passwords are stored by the database, with MD5 encryp-
tion.
• Kerberos is a standard enterprise authentication method, which is used by both the GSSAPI and
SSPI schemes in PostgreSQL. Using SSPI, PostgreSQL can authenticate against Windows servers.
• LDAP is another common enterprise authentication method. The OpenLDAP server bundled with
most Linux distributions provides an open source implementation of LDAP.
• Certificate authentication is an option if you expect all client connections to be via SSL and are
able to manage the distribution of keys.
• PAM authentication is an option if you are on Linux or Solaris and use the PAM scheme for
transparent authentication provision.
Authentication methods are controlled by the pg_hba.conf file. The „HBA“ in the file name stands
for „host based access“, because in addition to allowing you to specify the authentication method to use
for each database, it allows you to limit host access using network addresses.
Here is an example pg_hba.conf file:
and only to the nyc database. Depending on the security of your network, you will use more or less strict
versions of these rules in your production set-up.
32.4 Links
• PostgreSQL Authentication
• PostgreSQL Encrpyption
• PostgreSQL SSL Support
PostgreSQL Schemas
Production databases inevitably have a large number of tables and views, and managing them all in one
schema can become unwieldy quickly. Fortunately, PostgreSQL includes the concept of a „_Schema“.
Schemas are like folders, and can hold tables, views, functions, sequences and other relations. Every
database starts out with one schema, the public schema.
189
Introduction to PostGIS, Release 1.0
Inside that schema, the default install of PostGIS creates the geometry_columns,
geography_columns and spatial_ref_sys metadata relations, as well as all the types
and functions used by PostGIS. So users of PostGIS always need access to the public schema.
In the public schema you can also see all the tables we have created so far in the workshop.
Let’s create a new schema and move a table into it. First, create a new schema in the database:
If you’re using the psql command-line program, you’ll notice that nyc_census_blocks has dis-
appeared from your table listing now! If you’re using PgAdmin, you might have to refresh your view to
see the new schema and the table inside it.
You can access tables inside schemas in two ways:
• by referencing them using schema.table notation
• by adding the schema to your search_path
Explicit referencing is easy, but it gets tiring to type after a while:
Manipulating the search_path is a nice way to provide access to tables in multiple schemas without
lots of extra typing.
You can set the search_path at run time using the SET command:
This ensures that all references to relations and functions are searched in both the census and the
public schemas. Remember that all the PostGIS functions and types are in public so we don’t want
to drop that from the search path.
Setting the search path every time you connect can get tiring too, but fortunately it’s possible to per-
manently set the search path for a user:
Now the postgres user will always have the census schema in their search path.
Users like to create tables, and PostGIS users do so particularly: analysis operations with SQL demand
temporary tables for visualization or interim results, so spatial SQL tends to require that users have
CREATE privileges more than ordinary database workloads.
By default, every role in Oracle is given a personal schema. This is a nice practice to use for PostgreSQL
users too, and is easy to replicate using PostgreSQL roles, schemas, and search paths.
Create a new user with table creation privileges (see PostgreSQL Security for information about the
postgis_writer role), then create a schema with that user as the authorization:
If you log in as that user, you’ll find the default search_path for PostgreSQL is actually this:
show search_path;
search_path
----------------
"$user",public
The first schema on the search path us the user’s named schema! So now the following conditions exist:
• The user exists, with the ability to create spatial tables.
• The user’s named schema exists, and the user owns it.
• The user’s search path has the user schema first, so new tables are automatically created there, and
queries automatically search there first.
That’s all there is to it, the user’s default work area is now nicely separated from any tables in other
schemas.
There are lots of ways to backup a PostgreSQL database, and the one you choose will depend a great
deal on how you are using the database.
• For relatively static databases, the basic pg_dump/pg_restore tools can be used to take periodic
snapshots of the data.
• For frequently changing data, using an „online backup“ scheme allows continuous archiving of
updates to a secure location.
Online backup is the basis for replication and stand-by systems for high availability, particularly for
versions of PostgreSQL >= 9.0.
As discussed in PostgreSQL Schemas, ensuring that production data is always stored in separate schemas
is a very important best practice in managing data. There are two reasons:
• Backing up and restoring data in schemas is much simpler than managing lists of tables to be
backed up individually.
• Keeping data tables out of the „public“ schema allows far easier upgrades, as discussed in Software
Upgrades.
193
Introduction to PostGIS, Release 1.0
Backing up a full database is easy using the pg_dump utility. The utility is a command-line tool, which
makes it easy to automate with scripting, and it can also be invoke via a GUI in the PgAdmin utility.
To backup our nyc database, we can use the GUI, just right-click the database you want to backup:
Note that there are three backup format options: compress, tar and plain.
• Plain is just a textual SQL file. This is the simplest format and in many ways the most flexible,
since it can be editing or altered easily and then loaded back into a database, allowing offline
changes to things like ownership or other global information.
• Tar using a UNIX archive format to hold components of the dump in separate files. Using the tar
format allows the pg_restore utility to selectively restore parts of the dump.
• Compress is like the Tar format, but compresses the internal components individually, allowing
them to be selectively restored without decompressing the entire archive.
We’ll check the Compress option and go, saving out a backup file.
The same operation can be done with the command line like this:
Because the backup file is in Compress format, we can view the contents using the pg_restore command
to list the manifest. In the PgAdmin GUI, „View“ is an option in the panel.
When you look at the manifest, one of the things you might notice is that there are a lot of „FUNCTION“
signatures in there.
That’s because the pg_dump utility dumps every non-system object in the database, and that includes
the PostGIS function definitions.
Bemerkung: PostgreSQL 9.1+ includes an „EXTENSION“ feature that allows add-on packages like
PostGIS to be installed as registered system components and therefore excluded from pg_dump output.
PostGIS 2.0 and higher support installation using this extension system.
We can see the same manifest from the command-line using pg_restore directly:
The problem with a dump file full of PostGIS function signatures is that we really wanted a dump of our
data, not our system functions.
Since every object is in the dump file, we can restore to a blank database and get full functionality. In
doing so, we are expecting that system we are restoring to has exactly the same version of PostGIS as
the one we dumped from (since the function signature definitions reference a particular version of the
PostGIS shared library).
From the command-line the restore looks like this:
Dumping just data, without function signatures, is where having data in schemas is handy, because there
is a command-line flag to only dump a particular schema:
Now when we list the contents of the dump, we see just the data tables we wanted:
;
; Archive created at Thu Aug 9 11:02:49 2012
; dbname: nyc
; TOC Entries: 11
; Compression: -1
; Dump Version: 1.11-0
; Format: CUSTOM
; Integer: 4 bytes
; Offset: 8 bytes
; Dumped from database version: 8.4.9
; Dumped by pg_dump version: 8.4.9
;
;
; Selected TOC Entries:
;
6; 2615 20091 SCHEMA - census postgres
146; 1259 19845 TABLE census nyc_census_blocks postgres
145; 1259 19843 SEQUENCE census nyc_census_blocks_gid_seq postgres
2691; 0 0 SEQUENCE OWNED BY census nyc_census_blocks_gid_seq postgres
2692; 0 0 SEQUENCE SET census nyc_census_blocks_gid_seq postgres
2681; 2604 19848 DEFAULT census gid postgres
2688; 0 19845 TABLE DATA census nyc_census_blocks postgres
2686; 2606 19853 CONSTRAINT census nyc_census_blocks_pkey postgres
2687; 1259 20078 INDEX census nyc_census_blocks_geom_gist postgres
Having just the data tables is handy, because it means we can store to a database with any version of
PostGIS installed, as we talk about in Software Upgrades.
The pg_dump utility operates a database at a time (or a schema or table at a time, if you restrict it).
However, information about users is is stored across an entire cluster, it’s not stored in any one database!
To backup your user information, use the pg_dumpall utility, with the „–globals-only“ flag.
You can also use pg_dumpall in its default mode to backup an entire cluster, but be aware that, as with
pg_dump, you will end up backing up the PostGIS function signatures, so the dump will have to be
restored against an identical software installation, it can’t be used as part of an upgrade process.
Online backup and restore allows an administrator to keep an extremely up-to-date set of backup files
without the overhead of repeatedly dumping the entire database. If the database is under frequent insert
and update load, then online backup might be preferable to basic backup.
Bemerkung: The best way to learn about online backup is to read the relevant sections of the PostgreS-
QL manual on continuous archiving and point-in-time recovery. This section of the PostGIS workshop
will just provide a brief snapshot of online backup set-up.
Rather than continually write to the main data tables, PostgreSQL stores changes initially in „write-
ahead logs“ (WAL). Taken together, these logs are a complete record of all changes made to a database.
Online backup consists of taking a copy of the database main data table, then taking a copy of each WAL
that is generated from then on.
When it is time to recover to a new database, the system starts on the main data copy, then replays all
the WAL files into the database. The end result is a restored database in the same state as the original at
the time of the last WAL received.
Because WAL are being written anyways, and transferring copies to an archive server is computationally
cheap, online backup is an effective means of keeping a very up-to-date backup of a system without
resorting to intensive regular full dumps.
The first thing to do in setting up online backup is to create an archiving method. PostgreSQL archiving
methods are the ultimate in flexibility: the PostgreSQL backend simply calls a script specified in the
archive_command configuration parameter.
That means archiving can be as simple as copying the file to a network-mounted drive, and as complex
as encrypting and emailing the files to the remote archive. Any process you can script you can use to
archive the files.
To turn on archiving we will edit postgresql.conf, first turning on WAL archiving:
wal_level = archive
archive_mode = on
And then setting the archive_command to copy our archive files to a safe location (changing the
destination paths as appropriate):
# Unix
archive_command = 'test ! -f /archivedir/%f && cp %p /archivedir/%f'
# Windows
archive_command = 'copy "%p" "C:\\archivedir\\%f"'
It is important that the archive command not over-write existing files, so the unix command includes an
initial test to ensure that the files aren’t already there. It is also important that the command returns a
non-zero status if the copy process fails.
Once the changes are made you can re-start PostgreSQL to make them effective.
Once the archiving process is in place, you need to take a base back-up.
Put the database into backup mode (this doesn’t do anything to alter operation of queries or data updates,
it just forces a checkpoint and writes a label file indicating when the backup was taken).
SELECT pg_start_backup('/archivedir/basebackup.tgz');
For the label, using the path to the backup file is a good practice, as it helps you track down where the
backup was stored.
Copy the database to an archival location:
# Unix
tar cvfz /archivedir/basebackup.tgz ${PGDATA}
SELECT pg_stop_backup();
All these steps can of course be scripted for regular base backups.
These steps are taking from the PostgreSQL manual on continuous archiving and point-in-time recovery.
• Stop the server, if it’s running.
• If you have the space to do so, copy the whole cluster data directory and any tablespaces to a
temporary location in case you need them later. Note that this precaution will require that you
have enough free space on your system to hold two copies of your existing database. If you do not
have enough space, you should at least save the contents of the cluster’s pg_xlog subdirectory, as
it might contain logs which were not archived before the system went down.
• Remove all existing files and subdirectories under the cluster data directory and under the root
directories of any tablespaces you are using.
• Restore the database files from your file system backup. Be sure that they are restored with the
right ownership (the database system user, not root!) and with the right permissions. If you are
using tablespaces, you should verify that the symbolic links in pg_tblspc/ were correctly restored.
• Remove any files present in pg_xlog/; these came from the file system backup and are therefore
probably obsolete rather than current. If you didn’t archive pg_xlog/ at all, then recreate it with
proper permissions, being careful to ensure that you re-establish it as a symbolic link if you had it
set up that way before.
• If you have unarchived WAL segment files that you saved in step 2, copy them into pg_xlog/. (It
is best to copy them, not move them, so you still have the unmodified files if a problem occurs and
you have to start over.)
• Create a recovery command file recovery.conf in the cluster data directory (see Chapter 26). You
might also want to temporarily modify pg_hba.conf to prevent ordinary users from connecting
until you are sure the recovery was successful.
• Start the server. The server will go into recovery mode and proceed to read through the archived
WAL files it needs. Should the recovery be terminated because of an external error, the server can
simply be restarted and it will continue recovery. Upon completion of the recovery process, the
server will rename recovery.conf to recovery.done (to prevent accidentally re-entering recovery
mode later) and then commence normal database operations.
• Inspect the contents of the database to ensure you have recovered to the desired state. If not, return
to step 1. If all is well, allow your users to connect by restoring pg_hba.conf to normal.
34.4 Links
• pg_dump
• pg_dumpall
• pg_restore
• PostgreSQL High Availability
• PostgreSQL High Availability Continuous Archiving and PITR
Software Upgrades
Because PostGIS resides within PostgreSQL every PostGIS installation actually consists of two versions
of software: the PostgreSQL version and the PostGIS version. As a general principle, each version of
PostGIS can be theoretically run within a number of versions of PostgreSQL, and vice versa.
In practice, the exact version pair available will be dictated by the packager who has built your PostgreS-
QL distribution. Most Linux packages includes a couple PostGIS versions for each PostgreSQL version
release, allowing the parts to be upgraded either independently or simultaneously, depending on your
preferences.
Upgrades can be considered in terms of upgrading each component.
For „minor upgrades“, no special process is necessary. Simply install the new software, and re-start the
server.
203
Introduction to PostGIS, Release 1.0
For „major upgrades“ there are two ways to carry out the upgrade.
Dump/Restore
Dumping and restoring involves converting all the data to a platform neutral format (text representations)
on dump, and back to native representations on restore, so it can be time consuming and CPU intensive.
However, if you are migrating to a new architecture or operating system, it’s a required process. It’s also
a time-tested and well-understood upgrade path, so if your database is not too big, there’s no reason not
to stick with it.
• Dump your data pg_dumpall from the old database.
• Install the new version of PostgreSQL and the same version of PostGIS you are using in your
old database. You need to match the PostGIS version so that the dump file function definitions
reference an expected version of the PostGIS library.
• Initialize the new data area using the initdb program from the new software.
• Start the new server on the new data area.
• Restore the dump file using pg_restore.
pg_upgrade
The pg_upgrade utility allows PostgreSQL data directories to be upgraded without the requirement for
a dump/restore step. The utility cannot handle changes to the data files themselves, but handles the more
common and frequent changes to system tables that occur in PostgreSQL major upgrades.
Bemerkung: The full instructions for running the upgrade process are in the pg_upgrade web page at
the PostgreSQL site.
The pg_upgrade program expects to have access to both versions of PostgreSQL it is working with, the
old and the new version, so you will have to install them both.
• Install the new version of PostgreSQL you will be using.
• Install the same version of PostGIS you are using in the old PostgreSQL into the new PostgreSQL.
• Initialize the new PostgreSQL data area with the new copy of initdb.
• Ensure both the old and new PostgreSQL servers are turned off.
• Run pg_upgrade, making sure to use the binary from the new software installation.
pg_upgrade
--old-datadir "/var/lib/postgres/12/data"
--new-datadir "/var/lib/postgres/13/data"
--old-bindir "/usr/pgsql/12/bin"
--new-bindir "/usr/pgsql/13/bin"
PostGIS deals with minor and upgrades through the EXTENSION mechanism. If you spatially-enabled
your database using CREATE EXTENSION postgis, you can update your database using the same
functionality.
First, install the new software so it is available to the database.
Then, run the SQL to upgrade your PostGIS extension.
The nyc_subway_stations layer has provided us with lots of interesting examples so far, but there
is something striking about it:
Although it is a database of all the stations, it doesn’t allow easy visualization of routes! In this chapter
we will use advanced features of PostgreSQL and PostGIS to build up a new linear routes layer from the
207
Introduction to PostGIS, Release 1.0
In this picture, the stops are labelled with their unique gid primary key.
If we start at one of the end stations, the next station on the line seems to always be the closest. We can
repeat the process each time as long as we exclude all the previously found stations from our search.
There are two ways to run such an iterative routine in a database:
• Using a procedural language, like PL/PgSQL.
• Using recursive common table expressions.
Common table expressions (CTE) have the virtue of not requiring a function definition to run. Here’s
the CTE to calculate the route line of the ‚Q‘ train, starting from the northernmost stop (where gid is
304).
WITH RECURSIVE next_stop(geom, idlist) AS (
(SELECT
geom,
ARRAY[gid] AS idlist
FROM nyc_subway_stations
(Fortsetzung auf der nächsten Seite)
209
Introduction to PostGIS, Release 1.0
Success!
Except, two problems:
• We are only calculating one subway route here, we want to calculate all the routes.
• Our query includes a piece of a priori knowledge, the initial station identifier that serves as the
seed for the search algorithm that builds the route.
211
Introduction to PostGIS, Release 1.0
Let’s tackle the hard problem first, figuring out the first station on a route without manually eyeballing
the set of stations that make up the route.
Our ‚Q‘ train stops can serve as a starting point. What characterizes the end stations of the route?
One answer is „they are the most northerly and southerly stations“. However, imagine if the ‚Q‘ train
ran from east to west. Would the condition still hold?
A less directional characterization of the end stations is „they are the furthest stations from the middle
of the route“. With this characterization it doesn’t matter if the route runs north/south or east/west, just
that it run in more-or-less one direction, particularly at the ends.
Since there is no 100% heuristic to figure out the end points, let’s try this second rule out.
Bemerkung: An obvious failure mode of the „furthest from middle“ rule is a circular line, like the
Circle Line in London, UK. Fortunately, New York doesn’t have any such lines!
To work out the end stations of every route, we first have to work out what routes there are! We find the
distinct routes.
WITH routes AS (
SELECT DISTINCT unnest(string_to_array(routes,',')) AS route
FROM nyc_subway_stations ORDER BY route
)
SELECT * FROM routes;
• string_to_array takes in a string and splits it into an array using a separator character. PostgreSQL
supports arrays of any type, so it’s possible to build arrays of strings, as in this case, but also of
geometries and geographies as we’ll see later in this example.
• unnest takes in an array and builds a new row for each entry in the array. The effect is to take a
„horizontal“ array embedded in a single row and turn it into a „vertical“ array with a row for each
value.
The result is a list of all the unique subway route identifiers.
route
-------
1
2
3
4
5
6
7
A
B
C
D
E
F
G
J
L
M
N
Q
R
S
V
W
Z
(24 rows)
We can build on this result by joining it back to the nyc_subway_stations table to create a new
table that has, for each route, a row for every station on that route.
WITH routes AS (
SELECT DISTINCT unnest(string_to_array(routes,',')) AS route
FROM nyc_subway_stations ORDER BY route
),
stops AS (
SELECT s.gid, s.geom, r.route
FROM routes r
JOIN nyc_subway_stations s
ON (strpos(s.routes, r.route) <> 0)
)
SELECT * FROM stops;
213
Introduction to PostGIS, Release 1.0
Now we can find the center point by collecting all the stations for each route into a single multi-point,
and calculating the centroid of that multi-point.
WITH routes AS (
SELECT DISTINCT unnest(string_to_array(routes,',')) AS route
FROM nyc_subway_stations ORDER BY route
),
stops AS (
SELECT s.gid, s.geom, r.route
FROM routes r
JOIN nyc_subway_stations s
ON (strpos(s.routes, r.route) <> 0)
),
centers AS (
SELECT ST_Centroid(ST_Collect(geom)) AS geom, route
FROM stops
GROUP BY route
)
SELECT * FROM centers;
The center point of the collection of ‚Q‘ train stops looks like this:
So the northern most stop, the end point, appears to also be the stop furthest from the center. Let’s
calculate the furthest point for every route.
WITH routes AS (
SELECT DISTINCT unnest(string_to_array(routes,',')) AS route
FROM nyc_subway_stations ORDER BY route
),
stops AS (
SELECT s.gid, s.geom, r.route
FROM routes r
JOIN nyc_subway_stations s
ON (strpos(s.routes, r.route) <> 0)
),
centers AS (
SELECT ST_Centroid(ST_Collect(geom)) AS geom, route
FROM stops
GROUP BY route
(Fortsetzung auf der nächsten Seite)
215
Introduction to PostGIS, Release 1.0
217
Introduction to PostGIS, Release 1.0
As usual, there are some problems with our simple understanding of the data:
• there are actually two ‚S‘ (short distance „shuttle“) trains, one in Manhattan and one in the Rocka-
ways, and we join them together because they are both called ‚S‘;
• the ‚4‘ train (and a few others) splits at the end of one line into two terminuses, so the „follow one
line“ assumption breaks and the result has a funny hook on the end.
Hopefully this example has provided a taste of some of the complex data manipulations that are possible
combining the advanced features of PostgreSQL and PostGIS.
• PostgreSQL Arrays
• PostgreSQL Array Functions
• PostgreSQL Recursive Common TABLE Expressions
• PostGIS ST_MakeLine
37.1 Constructors
ST_MakePoint(Longitude, Latitude) Returns a new point. Note the order of the coordina-
tes (longitude then latitude).
ST_GeomFromText(WellKnownText, srid) Returns a new geometry from a standard WKT
string and srid.
ST_SetSRID(geometry, srid) Updates the srid on a geometry. Returns the same geometry.
This does not alter the coordinates of the geometry, it just updates the srid. This function is useful
for conditioning geometries created without an srid.
ST_Expand(geometry, Radius) Returns a new geometry that is an expanded bounding box of
the input geometry. This function is useful for creating envelopes for use in indexed searches.
221
Introduction to PostGIS, Release 1.0
37.2 Outputs
37.3 Measurements
ST_Area(geometry) Returns the area of the geometry in the units of the spatial reference system.
ST_Length(geometry) Returns the length of the geometry in the units of the spatial reference
system.
ST_Perimeter(geometry) Returns the perimeter of the geometry in the units of the spatial refe-
rence system.
ST_NumPoints(linestring) Returns the number of vertices in a linestring.
ST_NumRings(polygon) Returns the number of rings in a polygon.
ST_NumGeometries(geometry) Returns the number of geometries in a geometry collection.
37.4 Relationships
Appendix B: Glossary
CRS A „coordinate reference system“. The combination of a geographic coordinate system and a pro-
jected coordinate system.
GDAL Geospatial Data Abstraction Library, pronounced „GOO-duhl“, an open source raster access
library with support for a large number of formats, used widely in both open source and proprietary
software.
GeoJSON „Javascript Object Notation“, a text format that is very fast to parse in Javascript virtual
machines. In spatial, the extended specification for GeoJSON is commonly used.
GIS Geographic information system or geographical information system captures, stores, analyzes, ma-
nages, and presents data that is linked to location.
GML Geography Markup Language. GML is the OGC standard XML format for representing spatial
feature information.
JSON „Javascript Object Notation“, a text format that is very fast to parse in Javascript virtual machi-
nes. In spatial, the extended specification for GeoJSON is commonly used.
JSTL „JavaServer Page Template Library“, a tag library for JSP that encapsulates many of the standard
functions handled in JSP (database queries, iteration, conditionals) into a terse syntax.
JSP „JavaServer Pages“ a scripting system for Java server applications that allows the interleaving of
markup and Java procedural code.
KML „Keyhole Markup Language“, the spatial XML format used by Google Earth. Google Earth was
originally written by a company named „Keyhole“, hence the (now obscure) reference in the name.
OGC The Open Geospatial Consortium (OGC) is a standards organization that develops specifications
for geospatial services.
OSGeo The Open Source Geospatial Foundation (OSGeo) is a non-profit foundation dedicated to the
promotion and support of open source geospatial software.
SFSQL The Simple Features for SQL (SFSQL) specification from the OGC defines the types and func-
tions that make up a standard spatial database.
223
Introduction to PostGIS, Release 1.0
SLD The Styled Layer Descriptor (SLD) specification from the OGC defines an format for describing
cartographic rendering of vector features.
SRID „Spatial reference ID“ a unique number assigned to a particular „coordinate reference system“.
The PostGIS table spatial_ref_sys contains a large collection of well-known srid values and text
representations of the coordinate reference systems.
SQL „Structured query language“ is the standard means for querying relational databases.
SQL/MM SQL Multimedia; includes several sections on extended types, including a substantial section
on spatial types.
SVG „Scalable vector graphics“ is a family of specifications of an XML-based file format for describing
two-dimensional vector graphics, both static and dynamic (i.e. interactive or animated).
WFS The Web Feature Service (WFS) specification from the OGC defines an interface for reading and
writing geographic features across the web.
WMS The Web Map Service (WMS) specification from the OGC defines an interface for requesting
rendered map images across the web.
WKB „Well-known binary“. Refers to the binary representation of geometries described in the Simple
Features for SQL specification (SFSQL).
WKT „Well-known text“. Can refer either to the text representation of geometries, with strings star-
ting „POINT“, „LINESTRING“, „POLYGON“, etc. Or can refer to the text representation of a
CRS, with strings starting „PROJCS“, „GEOGCS“, etc. Well-known text representations are OGC
standards, but do not have their own specification documents. The first descriptions of WKT (for
geometries and for CRS) appeared in the SFSQL 1.0 specification.
Appendix C: License
This work is licensed under the Creative Commons Attribution-Share Alike, United States License. To
view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/us/ or send a letter to
Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.
Our attribution requirement is that you retain the visible copyright notices in all materials.
225
Introduction to PostGIS, Release 1.0
C
CRS, 223
G
GDAL, 223
GeoJSON, 223
GIS, 223
GML, 223
J
JSON, 223
JSP, 223
JSTL, 223
K
KML, 223
O
OGC, 223
OSGeo, 223
S
SFSQL, 223
SLD, 224
SQL, 224
SQL/MM, 224
SRID, 224
SVG, 224
W
WFS, 224
WKB, 224
WKT, 224
WMS, 224
227