Published at Oct 31, 2024

Structured vs. Semi-Structured vs. Unstructured Data in PostgreSQL

Start supercharging your PostgreSQL today.

Written by Team Timescale

When building applications, you will encounter different kinds of data. Some data fits neatly into rows and columns, while other data may be more unpredictable or lack structure. PostgreSQL’s wide range of data type support lets you easily handle both—whether you're storing, querying, or manipulating that data.

In this blog, we will explore structured, semi-structured, and unstructured data and the basics of handling them in PostgreSQL.

Understanding Data Structures

Let’s start by defining each type of data, which will set the stage for the rest of the blog.

Structured data

Structured data is highly organized and easily stored in tables with rows and columns. Each column holds a specific data type, like integers, strings, or dates. This data type follows a fixed schema, making it easy to query using SQL.

Example: a typical customer table with predefined columns like id, name, email, and address.

Semi-structured data

Semi-structured data doesn’t conform to a rigid schema but still maintains some organizational properties, such as key-value pairs. JSON and XML are common examples. Semi-structured data offers more flexibility than structured data but is still queryable using specialized functions.

Example: a JSON object containing customer data where fields like phone or address can vary between records.

Unstructured data

Unstructured data is data that has no defined structure and doesn’t easily fit into a relational database. It could be anything from text documents, emails, images, or videos. Managing and searching unstructured data calls for specialized techniques, such as full-text search can use special indices or semantic search uses vector and vector distances to find similarity.

Example: a text field storing customer feedback or support tickets.

Developers often spend much of their time converting unstructured data into more structured types, i.e., parsing data. In a previous blog post, developer advocate Jônatas explained one of his methods to parse data using open-source tools, such as pgai. This extension enhances PostgreSQL with AI capabilities like embeddings and similarity search, allowing you to incorporate machine learning features seamlessly.

Handling Structured Data in PostgreSQL

Since PostgreSQL is a relational database, handling structured data is its superpower. Tables, schemas, columns, and constraints are built to work seamlessly with structured data, ensuring integrity, performance, and ease of querying.

Data types in PostgreSQL

PostgreSQL supports a rich set of data types. Here are some examples:

Numeric types: INTEGER, BIGINT, DECIMAL
Textual types: VARCHAR, TEXT
Date/time types: DATE, TIMESTAMP
Arrays: you can store arrays of any data type
UUID: globally unique identifiers for entities like users or orders
ENUM: useful for storing predefined sets of values, such as status codes (active, inactive).

Example: defining a structured customer table with various data types.

CREATE TABLE customers ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), name VARCHAR(100) NOT NULL, email VARCHAR(100) UNIQUE NOT NULL, phone VARCHAR(20), address TEXT, date_of_birth DATE, signup_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP );

Constraints and relationships

PostgreSQL enforces data integrity using constraints such as NOT NULL, UNIQUE, and FOREIGN KEY. These constraints ensure that the data is accurate and valid at all times.

Example: enforcing data integrity with constraints.

CREATE TABLE orders ( id SERIAL PRIMARY KEY, customer_id UUID REFERENCES customers(id), order_date DATE NOT NULL, total_amount DECIMAL(10, 2) CHECK (total_amount > 0) );

Here, the orders table references the customers table through the customer_id field, ensuring relational integrity between orders and customers.

Indexing for performance

By default, PostgreSQL leverages B-tree indexes to handle most indexing tasks. These indexes enhance query performance, especially with large structured datasets.

Example: creating an index on the email field.

CREATE INDEX idx_customers_email ON customers(email);

Working With Semi-Structured Data in PostgreSQL

Semi-structured data lies somewhere between the orderliness of structured data and the chaos of unstructured data. PostgreSQL’s support for JSON and XML makes it an ideal choice for storing and querying semi-structured data without needing to overhaul your schema every time the data format changes.

JSON and JSONB in PostgreSQL

PostgreSQL supports two JSON types:

JSON: stores data as raw text, preserving formatting, whitespace, and key ordering.
JSONB: stores data in a binary format, which is more efficient for indexing and querying, though it doesn’t preserve formatting or key ordering.

When to use JSON vs. JSONB

Use json if you need to preserve the exact input format or frequently insert/update JSON data.
Use jsonb for better performance when querying and indexing JSON data.

Storing semi-structured data

Let’s create an orders table that stores order details in a semi-structured format using JSONB:

CREATE TABLE orders ( id SERIAL PRIMARY KEY, customer_id UUID REFERENCES customers(id), order_date DATE NOT NULL, order_details JSONB );

In this setup, the order_details field can store various attributes for each order, such as the product, quantity, and price, which may differ between orders.

Inserting data:

INSERT INTO orders (customer_id, order_date, order_details) VALUES ('469bfefb-4ab9-471e-bfda-3e3422f43af4', '2024-10-09', '{"product": "Laptop", "quantity": 1, "price": 999.99}');

Querying JSON data

One of the advantages of semi-structured data is that you can query it just like structured data if your database can handle it natively. PostgreSQL provides operators and increasing functions with every new release to extract and manipulate JSONB data.

Example: fetch all orders where the product is a laptop.

SELECT customer_id, order_details->>'product' AS product_name FROM orders WHERE order_details->>'product' = 'Laptop';

Updating JSON fields

PostgreSQL also allows you to update specific fields inside a JSON document without rewriting the entire structure.

Example: update the price of a product in an order.

UPDATE orders SET order_details = jsonb_set(order_details, '{price}', '899.99') WHERE id = 1;

Indexing JSONB fields

Indexing JSON fields can dramatically improve performance for complex queries. PostgreSQL offers the GIN (Generalized Inverted Index) for JSONB data, making it fast to search nested objects or arrays.

Example: creating a GIN index on order_details.

CREATE INDEX idx_orders_order_details ON orders USING GIN (order_details);With this index, querying nested JSON fields will be much faster, even with large datasets.

Managing Unstructured Data in PostgreSQL

Unstructured data is the trickiest to manage because it doesn’t have a predefined schema or structure. PostgreSQL, however, provides several tools to handle this type of data, whether it’s large text, logs, or even binary data.

Storing unstructured text

You can store unstructured text in a TEXT field or even a BYTEA field for binary data. Here’s an example of how you can store customer feedback in an unstructured format.

Example: storing unstructured text data.

CREATE TABLE feedback ( id SERIAL PRIMARY KEY, customer_id UUID REFERENCES customers(id), feedback_content TEXT );

You can store long-form customer feedback in the feedback_content column.

Full-text search in PostgreSQL

For large amounts of unstructured text, PostgreSQL’s built-in full-text search is incredibly useful. It allows you to efficiently search for terms or phrases within the text.

To enable full-text search, you need to convert the text into a tsvector data type and use the GIN index.

Example: setting up a full-text search for the feedback_content column:

CREATE INDEX idx_feedback_content ON feedback USING GIN (to_tsvector('english', feedback_content));

SELECT customer_id, feedback_content FROM feedback WHERE to_tsvector('english', feedback_content) @@ to_tsquery('refund');

This query retrieves all customer feedback containing the word “refund.”

Storing binary data

For unstructured data like images or other binary files, PostgreSQL provides the BYTEA data type. You can store binary objects, such as images, directly in the database, although external storage solutions might be better for larger files.

Example: storing an image in a BYTEA field.

CREATE TABLE images ( id SERIAL PRIMARY KEY, image_data BYTEA );

-- Insert image data INSERT INTO images (image_data) VALUES (pg_read_binary_file('/path/to/image.jpg'));

Performance considerations for unstructured data

Storing large volumes of unstructured data requires careful attention to performance optimization. Here are some tips:

Index text data: use GIN indexes for full-text search to speed up querying large text fields.
External storage: for very large binary objects (like videos), consider using an external file system or cloud storage service (e.g., S3) and storing just the references (URLs) in PostgreSQL.

Performance Optimization and Best Practices

No matter the type of data you are dealing with, PostgreSQL provides several options for performance optimization:

1. Indexing strategies

For structured data, use B-tree indexes for columns involved in filtering or sorting. For JSONB fields, GIN indexes are the best option when querying nested objects or arrays.

2. Schema design

For structured data, follow good schema design principles such as normalization and using foreign keys to enforce relationships. For semi-structured data, balance flexibility with performance and ensure you don’t overload your JSON fields with too much data.

3. Regular maintenance

VACUUM is a process in PostgreSQL that removes dead tuples (row versions) left after updates or deletes, reclaiming space for reuse. (Check this blog post for some best practices on how to use VACUUM.) However, if a table grows too large, vacuuming can become time-consuming. One solution is using hypertables.

Hypertables partition data into smaller, more manageable chunks, which allows for more targeted vacuuming. Only the most active chunks need frequent vacuuming, reducing the overall time and system overhead. This structure also ensures accurate statistics for recent data, leading to better query planning and improved performance.

Advanced Tools and Extensions

PostgreSQL also has a rich ecosystem of extensions that can help you handle structured, semi-structured, and unstructured data more efficiently.

Pg_trgm for text search

The pg_trgm extension provides additional full-text search capabilities, which are particularly useful for fuzzy matching and similarity searches on unstructured data.

Example: fuzzy search using pg_trgm.

CREATE EXTENSION pg_trgm; SELECT customer_id, feedback_content FROM feedback WHERE feedback_content % 'refund'; -- % operator allows fuzzy search

Foreign Data Wrappers (FDW)

Foreign Data Wrappers are one of our favorite PostgreSQL extensions. They allow PostgreSQL to query external data sources like regular tables. For instance, you could use FDW to query unstructured data stored in Hadoop or MongoDB while still using PostgreSQL for structured and semi-structured data.

For those using a Timescale database, you can use FDWs to fetch and query data from other PostgreSQL databases in the same Timescale project or outside of Timescale, including time-series databases with hypertables (regular PostgreSQL tables that automatically partition your data, speeding up your queries). Read the docs to learn more.

Conclusion: Choosing the Right Data Type

Whether you are managing highly structured relational data, flexible semi-structured data in JSON, or unstructured data like text or binary files, PostgreSQL provides robust tools to handle them effectively.

Structured data: use traditional tables with well-defined schemas and B-tree indexing for fast, reliable queries.
Semi-structured data: Use JSONB fields for flexibility, but ensure proper indexing and performance tuning.
Unstructured data: take advantage of PostgreSQL’s full-text search and consider external storage for large binary files.

By understanding how to manage these types of data in PostgreSQL, you can build scalable and flexible applications that perform well and adapt to changing data requirements. To learn how you can parse data with ease, converting it from unstructured to structured, check out this article, where we use an open-source PostgreSQL extension called pgai (GitHub ⭐s welcome!).

Pgai brings AI workflows to PostgreSQL and is open source under the PostgreSQL License. You can find installation instructions on the pgai GitHub repository. You can also access pgai on any database service on Timescale’s cloud PostgreSQL platform.