Amazon Redshift Best Practices
Amazon Redshift Best Practices
Best Practices
Saman Irfan
Senior GTM Specialist Solutions Architect Analytics
AWS
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark.
Discussion Topics
• Data Models Bradley Todd
Liberty Mutual, Technology Architect
• Table Design Best Practices Redshift allows us to quickly spin up
clusters and provide our data scientists
• Data Lake Modelling Best Practices with a fast and easy method to access
data and generate insights
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark.
Data Models
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark.
Redshift: Use Popular Data Models
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark.
Amazon Redshift data storage &
Data Types
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark.
Data storage in Redshift
• Data loaded into Redshift is stored in Redshift Managed Storage (RMS), storage is columnar
• Does not require indexes or db hints. Leverages sort keys, distribution keys, compression instead, to
achieve fast performance through parallelism and efficient data storage
• Data is organized as: Namespace > database > schema > objects
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark.
Redshift Datatypes
Scalar Vector
Datatypes Datatype
DOUBLE
INT NCHAR TIMETZ
PRECISION
BPCHAR TIMESTAMPTZ
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark.
C h oosin g D a ta T y p es: Best Pr a ctices
• Make columns only as wide as they need to be. Redshift
performance is about efficient I/O. Do not arbitrarily assign
maximum length/precision. This can slow down query execution
time.
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark.
SUPER datatype: Best Practices
• For low latency inserts or for small batch inserts, insert into SUPER. /*customer-orders-lineitem*/
Inserts into SUPER datatype are quicker. CREATE TABLE
customer_orders_lineitem
(c_custkey bigint
,c_name varchar
• If you join frequently using attributes stored in SUPER, create separate ,c_address varchar
,c_nationkey smallint
scalar datatype columns for those attributes to improve performance ,c_phone varchar
,c_acctbal decimal(12,2)
,c_mktsegment varchar
,c_comment varchar
• If you filter frequently using attributes stored in SUPER, create ,c_orders super );
separate scalar datatype columns for those attributes to improve JSON
{…}
usefulAttr1
performance
usefulAttr2
Complete
JSON record
• Use SUPER when your queries require strong consistency, predictable usefulAttrN
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark.
Table Design Best Practices
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark.
Redshift table design
THREE MAIN CONCEPTS
13
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark.
Automatic table optimization
TABLE OPTIMIZATION DONE AUTOMATICALLY FOR YOU – NO MANUAL
INTERVENTION NEEDED
Optimizations are applied to Best Practice: Use auto options for compression,
tables/columns when load on distribution and sort keys
compute is less
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark.
Compression/Encoding
Goals Impact
• Allow more data to be stored • Allows 3x to 4x times more data to be stored
• Improve query performance by decreasing I/O
aid loc dt
Column data is Blocks are individually A full block can
persisted to 1 MB encoded with 1 of contain millions
immutable blocks 13 encodings of values after
compression
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark.
C omp r ession Best Pr a ctices
• Use default compression
• By default compression encoding is set to AUTO for all columns in
a table, which means Redshift automatically determines the best
compression encoding for that column
• Rely on that default compression
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark.
Distribution Style
Distribution style is a table property that dictates how that table’s data is distributed
Goal Impact
• Distribute data evenly for parallel processing • Minimizes data redistribution by achieving
• Minimize data movement during query processing collocation
KEY ALL
ke
EVEN
yD
yA
key
B
key
ke
Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice
1 2 3 4 1 2 3 4 1 2 3 4
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark.
D istr ib ution Best Pr a ctices
• Use KEY style distribution for tables that are frequently joined
• Use high cardinality join column as the distribution key.
• Avoid date columns as the distribution key.
• When joining fact table with multiple dimension tables, use the same
distribution key for fact table and the large dimension table for co-
located join.
• Use ALL style distribution for small tables, <= 5 Million rows
• Use EVEN style distribution if a table is largely denormalized and does
not participate in joins, or if you don't have a clear choice for another
distribution style
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark.
Data sorting
Redshift uses sort keys to physically order data on disk
Goal Impact
• Make queries run faster by increasing • Enables range-restricted scans
the effectiveness of zone maps and to prune blocks by leveraging
reducing I/O zone maps
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark.
Sor t K ey Best Pr a ctices
• Don’t use sort keys for small tables <= 5 Million rows
• In case you decide to fine-tune by choosing a sort key yourselves for a large table,
• Pick the column/s that are most commonly used in filters as SORT KEY. For eg: If you query
most recent data frequently, date is the most appropriate sort key
• If you frequently join a table, specify the join column as both the sort key and the distribution
key on both tables. This results in merge join which is faster than the otherwise hash join.
• Don’t pick more than 4 columns to be in SORT KEY. When there are more than 4, there is no
added benefit from the additional columns
• When there are more than one column in SORT KEY, their order matters
• Effective sort key order is lower to higher cardinality
• Low cardinality columns come first high cardinality columns come last
• Always use the leading sort key column in the filter condition
• Don’t apply compression encoding on sort key columns
• Don’t apply functions in queries when using SORT KEY in filters. For Eg: If business_date column
is SORT KEY, don’t apply a filter to_char(business_date,’YYYY’) = ‘2023’
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark.
Materialized Views
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark.
Materialized Views
• Improve performance of complex, SLA sensitive, predictable and
repeated queries using Materialized views
• Materialized view persists the result set of the associated SQL
Redshift Materialized Views
• Materialized views can be refreshed automatically or manually
Materialized views can be created using the
• Redshift automatically determines best way to update data in
CREATE statement, and can be included
the materialized view (incremental or full refresh)
(default) or excluded from Redshift backups.
• Automatic query rewrite leverages relevant materialized views Materialized views can also have table
and can improve query performance by order(s) of magnitude attributes such as dist style and sort keys, and
• Automated materialized views: Redshift continuously monitors be refreshed at any time
workload to identify queries that will benefit from having a MV
CREATE MATERIALIZED VIEW mv_name
and automatically creates and manages MVs for them
[ BACKUP { YES | NO } ]
• Incremental Materialized views on external data lake tables: [ table_attributes ]
Materialized views in Redshift offer cost-effective incremental AS query
updates for external data lake tables, avoiding full re-
computation. REFRESH MATERIALIZED VIEW mv_name;
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark.
Materialized Views – Best Practices
• Create materialized views that can be incrementally refreshed in order
to avoid full refresh.
• Schedule manual refresh for nested materialized views or those not
eligible for auto refresh.
• Follow query best practices when writing Materialized View queries.
• Follow table design best practices on distribution style and sort key
when creating the Materialized View.
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark.
Data Lake Modelling
Best Practices
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark.
Data modeling for Data Lakes Queries
• With Data Lakes, tables are collections of files
• Set the table statistics (numRows) manually for Amazon S3 external tables.
• Avoid very large size files (> than 512 MB) and large number of small KB sized files.
• Supports parallel reads – between 128 MB and 1 GB.
• Does not support parallel reads – between 64 MB and 128 MB.
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark. 29
Data Lake Queries Best practices
• Optimize query cost using query monitoring rules (QMR) such as spectrum_scan_size_mb or
spectrum_scan_row_count and also set query performance boundaries on data lake queries .
• Use GROUP BY clause - Replace complex DISTINCT operations with GROUP BY in your queries.
• Monitor and control your Amazon Redshift Spectrum usage and costs using usage limits.
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark. 30
Amazon Redshift Data Sharing
Best Practices
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark.
Data Sharing in Amazon Redshift
Hub and spoke
• Secure, live data access across clusters, account, and regions
Sharing Levels
• Database, schemas, tables, views and SQL UDFs
• Fine-grained access control
Data Consistency
• Transactional consistency across producer and consumer clusters
• Immediate availability of committed changes Data mesh
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark.
Data Sharing Queries Best practices
Performance Best Practices
• Size consumer cluster compute capacity appropriately for read query performance.
• For frequently updated data, create and share materialized views from the producer cluster.
• For slowly changing data, share tables and build materialized views on the consumer cluster.
• Be aware of potential performance differences in cross-region data sharing due to network latency.
• Utilize Concurrency Scaling on both producer and consumer clusters for read/write operations.
• Use VACUUM RECLUSTER instead of full VACUUM for maintenance, especially on large shared objects.
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark. 33
Workload management
Best Practices
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark.
Workload management
Allows for the separation of different query workloads
Goal
Prioritize important queries
Throttle / abort less important queries
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark.
Types of workload management
Manual WLM Auto WLM
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark.
Workload management: Best Practices
• Use Auto WLM if your workload is highly unpredictable, or you are using default WLM.
• To maximize query throughput, use ensure the total concurrent queries is 15 or less.
• Save the superuser queue for administration tasks and canceling queries.
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark.
Data Load
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark.
Data Ingestion: AWS Services
Batch Near Real Time
• COPY command
S3 file • Redshift Spectrum
• COPY job – S3 triggered
Amazon Data
• Third Party Data available in ADX • Third Party Data available in ADX
Exchange (ADX)
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark. 39
COPY Command
§ Used to ingest data from
RA3.4XL compute
§ Amazon S3 (most common source)
§ Amazon EMR 0 1 2 3
§ Amazon DynamoDB
§ Remote host (SSH)
1 input file
§ Encryption
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark. 41
COPY Command Best Practices
§ Use COPY command to load data whenever possible.
RA3.4XL compute
§ Use a single COPY command per table.
0 1 2 3
§ When using COPY, avoid loading from many small files or large non-
splittable files.
§ For large amounts of data, load in small sequential blocks according to sort order:
§ eliminates the need to vacuum.
§ you use much less intermediate sort space during each load, and makes it easier to
restart if the COPY fails and is rolled back.
§ For data with fixed retention period, organize your data as a sequence of time-series
tables.
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark.
Post load Best Practices
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark.
(AUTO) VACUUM
• The VACUUM process runs either manually or automatically in the background
• Goals
• VACUUM will remove rows that are marked as deleted
• VACUUM will globally sort tables
• For tables with a sort key, ingestion operations will locally sort new data and write it into the
unsorted region
• Best practices
• VACUUM should be run only as necessary
• For the majority of workloads, AUTO VACUUM DELETE will reclaim space and
AUTO TABLE SORT will sort the needed portions of the table
• In cases where you know your workload – VACUUM can be run manually
• Run vacuum operations on a regular schedule
• Perform vacuum re-cluster on large tables
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark.
(AUTO) ANALYZE
§ The ANALYZE process collects table statistics for optimal query planning
§ In the vast majority of cases, AUTO ANALYZE automatically handles statistics gathering
• Best practices
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark.
Amazon Redshift Advisor
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark.
Amazon Redshift Advisor
To improve the performance and decrease the operating costs, Amazon Redshift Advisor offers specific
recommendations by analyzing performance and usage metrics. Advisor ranks recommendations by order
of impact.
• Amazon Redshift Advisor available in
Amazon Redshift console
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark.
Best Practices for
Multitenancy Architectures
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark.
Multi Tenant Strategy
Multi Cluster Model Multi Database Model Multi Schema Model Multi ID Model
Cluster1/ Cluster2/
Workgroup1 Workgroup2
Data sharing
Workload Type • Completely Isolated workloads, yet Workloads having different Workloads requiring Workloads requiring
share data among tenants • security policies • common security policies • same storage constructs. Access
• Distributed data repositories • Isolation levels, collation • same isolation & collation control with RLS
• frequent queries across tenants • Same tables, views across
tenants.
Scalability Highly scalable model Scaling is limited to cluster Scaling is limited to cluster Scaling is limited to cluster
Cross-tenant R/W Supported for Reads. No writes Supported for Reads. No writes Both Reads/Writes Both Reads/Writes
Examples • Separating ETL, BI workloads • Independent business units with • Multiple departments, functional Each tenant is a persona accessing
• Dev, QA, PROD DBs no/limited cross querying units often cross query data from same storage structure
• Banking DB and Insurance DB • Sales, Finance, Marketing
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark. 51