TigerData logo
TigerData logo
  • Product

    Tiger Cloud

    Robust elastic cloud platform for startups and enterprises

    Agentic Postgres

    Postgres for Agents

    TimescaleDB

    Postgres for time-series, real-time analytics and events

  • Docs
  • Pricing

    Pricing

    Enterprise Tier

  • Developer Hub

    Changelog

    Benchmarks

    Blog

    Community

    Customer Stories

    Events

    Support

    Integrations

    Launch Hub

  • Company

    Contact us

    About

    Timescale

    Partners

    Security

    Careers

Log InTry for free
Home
AWS Time-Series Database: Understanding Your OptionsStationary Time-Series AnalysisThe Best Time-Series Databases ComparedTime-Series Analysis and Forecasting With Python Alternatives to TimescaleWhat Are Open-Source Time-Series Databases—Understanding Your OptionsWhy Consider Using PostgreSQL for Time-Series Data?Time-Series Analysis in RWhat Is Temporal Data?What Is a Time Series and How Is It Used?Is Your Data Time Series? Data Types Supported by PostgreSQL and TimescaleUnderstanding Database Workloads: Variable, Bursty, and Uniform PatternsHow to Work With Time Series in Python?Tools for Working With Time-Series Analysis in PythonGuide to Time-Series Analysis in PythonUnderstanding Autoregressive Time-Series ModelingCreating a Fast Time-Series Graph With Postgres Materialized Views
Understanding PostgreSQLOptimizing Your Database: A Deep Dive into PostgreSQL Data TypesUnderstanding FROM in PostgreSQL (With Examples)How to Address ‘Error: Could Not Resize Shared Memory Segment’ How to Install PostgreSQL on MacOSUnderstanding FILTER in PostgreSQL (With Examples)Understanding GROUP BY in PostgreSQL (With Examples)PostgreSQL Join Type TheoryA Guide to PostgreSQL ViewsStructured vs. Semi-Structured vs. Unstructured Data in PostgreSQLUnderstanding Foreign Keys in PostgreSQLUnderstanding PostgreSQL User-Defined FunctionsUnderstanding PostgreSQL's COALESCE FunctionUnderstanding SQL Aggregate FunctionsUsing PostgreSQL UPDATE With JOINHow to Install PostgreSQL on Linux5 Common Connection Errors in PostgreSQL and How to Solve ThemUnderstanding HAVING in PostgreSQL (With Examples)How to Fix No Partition of Relation Found for Row in Postgres DatabasesHow to Fix Transaction ID Wraparound ExhaustionUnderstanding LIMIT in PostgreSQL (With Examples)Understanding PostgreSQL FunctionsUnderstanding ORDER BY in PostgreSQL (With Examples)Understanding WINDOW in PostgreSQL (With Examples)Understanding PostgreSQL WITHIN GROUPPostgreSQL Mathematical Functions: Enhancing Coding EfficiencyUnderstanding DISTINCT in PostgreSQL (With Examples)Using PostgreSQL String Functions for Improved Data AnalysisData Processing With PostgreSQL Window FunctionsPostgreSQL Joins : A SummaryUnderstanding OFFSET in PostgreSQL (With Examples)Understanding PostgreSQL Date and Time FunctionsWhat Is Data Compression and How Does It Work?What Is Data Transformation, and Why Is It Important?Understanding the Postgres string_agg FunctionWhat Is a PostgreSQL Left Join? And a Right Join?Understanding PostgreSQL SELECTSelf-Hosted or Cloud Database? A Countryside Reflection on Infrastructure ChoicesUnderstanding ACID Compliance Understanding percentile_cont() and percentile_disc() in PostgreSQLUnderstanding PostgreSQL Conditional FunctionsUnderstanding PostgreSQL Array FunctionsWhat Characters Are Allowed in PostgreSQL Strings?Understanding WHERE in PostgreSQL (With Examples)What Is a PostgreSQL Full Outer Join?What Is a PostgreSQL Cross Join?What Is a PostgreSQL Inner Join?Data Partitioning: What It Is and Why It MattersStrategies for Improving Postgres JOIN PerformanceUnderstanding the Postgres extract() FunctionUnderstanding the rank() and dense_rank() Functions in PostgreSQL
Guide to PostgreSQL PerformanceHow to Reduce Bloat in Large PostgreSQL TablesDesigning Your Database Schema: Wide vs. Narrow Postgres TablesBest Practices for Time-Series Data Modeling: Single or Multiple Partitioned Table(s) a.k.a. Hypertables Best Practices for (Time-)Series Metadata Tables A Guide to Data Analysis on PostgreSQLA Guide to Scaling PostgreSQLGuide to PostgreSQL SecurityHandling Large Objects in PostgresHow to Query JSON Metadata in PostgreSQLHow to Query JSONB in PostgreSQLHow to Use PostgreSQL for Data TransformationOptimizing Array Queries With GIN Indexes in PostgreSQLPg_partman vs. Hypertables for Postgres PartitioningPostgreSQL Performance Tuning: Designing and Implementing Your Database SchemaPostgreSQL Performance Tuning: Key ParametersPostgreSQL Performance Tuning: Optimizing Database IndexesDetermining the Optimal Postgres Partition SizeNavigating Growing PostgreSQL Tables With Partitioning (and More)Top PostgreSQL Drivers for PythonWhen to Consider Postgres PartitioningGuide to PostgreSQL Database OperationsUnderstanding PostgreSQL TablespacesWhat Is Audit Logging and How to Enable It in PostgreSQLGuide to Postgres Data ManagementHow to Index JSONB Columns in PostgreSQLHow to Monitor and Optimize PostgreSQL Index PerformanceSQL/JSON Data Model and JSON in SQL: A PostgreSQL PerspectiveA Guide to pg_restore (and pg_restore Example)PostgreSQL Performance Tuning: How to Size Your DatabaseAn Intro to Data Modeling on PostgreSQLExplaining PostgreSQL EXPLAINWhat Is a PostgreSQL Temporary View?A PostgreSQL Database Replication GuideHow to Compute Standard Deviation With PostgreSQLHow PostgreSQL Data Aggregation WorksBuilding a Scalable DatabaseRecursive Query in SQL: What It Is, and How to Write OneGuide to PostgreSQL Database DesignHow to Use Psycopg2: The PostgreSQL Adapter for Python
Best Practices for Scaling PostgreSQLHow to Design Your PostgreSQL Database: Two Schema ExamplesHow to Handle High-Cardinality Data in PostgreSQLHow to Store Video in PostgreSQL Using BYTEABest Practices for PostgreSQL Database OperationsHow to Manage Your Data With Data Retention PoliciesBest Practices for PostgreSQL AggregationBest Practices for Postgres Database ReplicationHow to Use a Common Table Expression (CTE) in SQLBest Practices for Postgres Data ManagementBest Practices for Postgres PerformanceBest Practices for Postgres SecurityBest Practices for PostgreSQL Data AnalysisTesting Postgres Ingest: INSERT vs. Batch INSERT vs. COPYHow to Use PostgreSQL for Data Normalization
PostgreSQL Extensions: amcheckPostgreSQL Extensions: Unlocking Multidimensional Points With Cube PostgreSQL Extensions: hstorePostgreSQL Extensions: ltreePostgreSQL Extensions: Secure Your Time-Series Data With pgcryptoPostgreSQL Extensions: pg_prewarmPostgreSQL Extensions: pgRoutingPostgreSQL Extensions: pg_stat_statementsPostgreSQL Extensions: Install pg_trgm for Data MatchingPostgreSQL Extensions: Turning PostgreSQL Into a Vector Database With pgvectorPostgreSQL Extensions: Database Testing With pgTAPPostgreSQL Extensions: PL/pgSQLPostgreSQL Extensions: Using PostGIS and Timescale for Advanced Geospatial InsightsPostgreSQL Extensions: Intro to uuid-ossp
Columnar Databases vs. Row-Oriented Databases: Which to Choose?Data Analytics vs. Real-Time Analytics: How to Pick Your Database (and Why It Should Be PostgreSQL)How to Choose a Real-Time Analytics DatabaseUnderstanding OLTPOLAP Workloads on PostgreSQL: A GuideHow to Choose an OLAP DatabasePostgreSQL as a Real-Time Analytics DatabaseWhat Is the Best Database for Real-Time AnalyticsHow to Build an IoT Pipeline for Real-Time Analytics in PostgreSQL
When Should You Use Full-Text Search vs. Vector Search?HNSW vs. DiskANNA Brief History of AI: How Did We Get Here, and What's Next?A Beginner’s Guide to Vector EmbeddingsPostgreSQL as a Vector Database: A Pgvector TutorialUsing Pgvector With PythonHow to Choose a Vector DatabaseVector Databases Are the Wrong AbstractionUnderstanding DiskANNA Guide to Cosine SimilarityStreaming DiskANN: How We Made PostgreSQL as Fast as Pinecone for Vector DataImplementing Cosine Similarity in PythonVector Database Basics: HNSWVector Database Options for AWSVector Store vs. Vector Database: Understanding the ConnectionPgvector vs. Pinecone: Vector Database Performance and Cost ComparisonHow to Build LLM Applications With Pgvector Vector Store in LangChainHow to Implement RAG With Amazon Bedrock and LangChainRetrieval-Augmented Generation With Claude Sonnet 3.5 and PgvectorRAG Is More Than Just Vector SearchPostgreSQL Hybrid Search Using Pgvector and CohereImplementing Filtered Semantic Search Using Pgvector and JavaScriptRefining Vector Search Queries With Time Filters in Pgvector: A TutorialUnderstanding Semantic SearchWhat Is Vector Search? Vector Search vs Semantic SearchText-to-SQL: A Developer’s Zero-to-Hero GuideNearest Neighbor Indexes: What Are IVFFlat Indexes in Pgvector and How Do They WorkBuilding an AI Image Gallery With OpenAI CLIP, Claude Sonnet 3.5, and Pgvector
Understanding IoT (Internet of Things)A Beginner’s Guide to IIoT and Industry 4.0Storing IoT Data: 8 Reasons Why You Should Use PostgreSQLMoving Past Legacy Systems: Data Historian vs. Time-Series DatabaseWhy You Should Use PostgreSQL for Industrial IoT DataHow to Choose an IoT DatabaseHow to Simulate a Basic IoT Sensor Dataset on PostgreSQLFrom Ingest to Insights in Milliseconds: Everactive's Tech Transformation With TimescaleHow Ndustrial Is Providing Fast Real-Time Queries and Safely Storing Client Data With 97 % CompressionHow Hopthru Powers Real-Time Transit Analytics From a 1 TB Table Migrating a Low-Code IoT Platform Storing 20M Records/DayHow United Manufacturing Hub Is Introducing Open Source to ManufacturingBuilding IoT Pipelines for Faster Analytics With IoT CoreVisualizing IoT Data at Scale With Hopara and TimescaleDB
What Is ClickHouse and How Does It Compare to PostgreSQL and TimescaleDB for Time Series?Timescale vs. Amazon RDS PostgreSQL: Up to 350x Faster Queries, 44 % Faster Ingest, 95 % Storage Savings for Time-Series DataWhat We Learned From Benchmarking Amazon Aurora PostgreSQL ServerlessTimescaleDB vs. Amazon Timestream: 6,000x Higher Inserts, 5-175x Faster Queries, 150-220x CheaperHow to Store Time-Series Data in MongoDB and Why That’s a Bad IdeaPostgreSQL + TimescaleDB: 1,000x Faster Queries, 90 % Data Compression, and Much MoreEye or the Tiger: Benchmarking Cassandra vs. TimescaleDB for Time-Series Data
Alternatives to RDSWhy Is RDS so Expensive? Understanding RDS Pricing and CostsEstimating RDS CostsHow to Migrate From AWS RDS for PostgreSQL to TimescaleAmazon Aurora vs. RDS: Understanding the Difference
5 InfluxDB Alternatives for Your Time-Series Data8 Reasons to Choose Timescale as Your InfluxDB Alternative InfluxQL, Flux, and SQL: Which Query Language Is Best? (With Cheatsheet)What InfluxDB Got WrongTimescaleDB vs. InfluxDB: Purpose Built Differently for Time-Series Data
5 Ways to Monitor Your PostgreSQL DatabaseHow to Migrate Your Data to Timescale (3 Ways)Postgres TOAST vs. Timescale CompressionBuilding Python Apps With PostgreSQL: A Developer's GuideData Visualization in PostgreSQL With Apache SupersetMore Time-Series Data Analysis, Fewer Lines of Code: Meet HyperfunctionsIs Postgres Partitioning Really That Hard? An Introduction To HypertablesPostgreSQL Materialized Views and Where to Find ThemTimescale Tips: Testing Your Chunk Size
Postgres cheat sheet
HomeTime series basicsPostgres basicsPostgres guidesPostgres best practicesPostgres extensionsPostgres for real-time analytics
Sections

AI and vector fundamentals

A Brief History of AI: How Did We Get Here, and What's Next?A Beginner’s Guide to Vector EmbeddingsPostgreSQL as a Vector Database: A Pgvector TutorialUsing Pgvector With PythonHow to Choose a Vector DatabaseVector Databases Are the Wrong Abstraction

Cosine similarity

A Guide to Cosine SimilarityImplementing Cosine Similarity in Python

Vector databases

Vector Database Options for AWSVector Store vs. Vector Database: Understanding the Connection

Tutorials

How to Build LLM Applications With Pgvector Vector Store in LangChainHow to Implement RAG With Amazon Bedrock and LangChainRetrieval-Augmented Generation With Claude Sonnet 3.5 and PgvectorRAG Is More Than Just Vector Search

Hybrid search & filtering

PostgreSQL Hybrid Search Using Pgvector and CohereImplementing Filtered Semantic Search Using Pgvector and JavaScriptRefining Vector Search Queries With Time Filters in Pgvector: A Tutorial

Image search

Building an AI Image Gallery With OpenAI CLIP, Claude Sonnet 3.5, and Pgvector

Semantic search

Fundamentals

Understanding Semantic SearchWhat Is Vector Search? Vector Search vs Semantic SearchWhen Should You Use Full-Text Search vs. Vector Search?

Vectorscale

Fundamentals

Understanding DiskANN

Schema design

Streaming DiskANN: How We Made PostgreSQL as Fast as Pinecone for Vector Data
Vector Database Basics: HNSW

Benchmarks

Pgvector vs. Pinecone: Vector Database Performance and Cost Comparison

Fundamentals

HNSW vs. DiskANN
Nearest Neighbor Indexes: What Are IVFFlat Indexes in Pgvector and How Do They Work

AI query interfaces

Text-to-SQL: A Developer’s Zero-to-Hero Guide

Products

Time Series and Analytics AI and Vector Enterprise Plan Cloud Status Support Security Cloud Terms of Service

Learn

Documentation Blog Forum Tutorials Changelog Success Stories Time Series Database

Company

Contact Us Careers About Brand Community Code Of Conduct Events

Subscribe to the Tiger Data Newsletter

By submitting, you acknowledge Tiger Data's Privacy Policy

2025 (c) Timescale, Inc., d/b/a Tiger Data. All rights reserved.

Privacy preferences
LegalPrivacySitemap

Published at Oct 16, 2024

A Guide to Cosine Similarity

Written by Haziqa Sajid

Embeddings bridge the gap between human language and AI comprehension by representing words or sentences as points in a multidimensional space that captures their meanings. Various metrics, such as cosine similarity, are then used to measure how closely related these points are in that space.

Cosine similarity is a core computation that engines many systems in natural language processing (NLP). Imagine a component that gives computers the power to understand text-related meaning. That’s how effective this component is. In this article, we discuss cosine similarity. We will also learn its connection with NLP and review tools that enable cosine similarity computations.

What Is Cosine Similarity?

Cosine similarity calculates the cosine of the angle between two vectors, revealing how closely the vectors are aligned. For instance, words like "cat" and "dog" will have a higher cosine similarity than "cat" and "car."

Inner product and cosine angle

The inner product (or dot product) of two vectors, X and Y, involves multiplying their corresponding components and summing the results. For example, if X and Y are 100-dimensional vectors:

image

The inner product (〈X, Y 〉) is calculated as:

image

Geometrically, this inner product relates to the cosine of the angle (θ) between (X) and (Y):

image

(|X|) and (|Y|) represent the vectors' Euclidean lengths (or magnitudes). The length (|X|) is given by:

image

Thus, cosine similarity is computed as:

image

This computation uses basic operations like multiplication, addition, and square roots.

Explanation of cosine similarity

Cosine similarity is always in the interval ([-1, 1]):

  • Cosine_sim = 1: The angle (θ) is 0, meaning the vectors point in the same direction (one is a positive scalar multiple of the other).

  • Cosine_sim = 0: The angle (θ) is 90 degrees, meaning the vectors are orthogonal, which often indicates they are unrelated in context.

  • Cosine_sim = -1: The angle (θ) is 180 degrees, meaning the vectors are aligned but point in opposite directions.

While cosine similarity effectively compares the direction of vectors, it does not define a proper distance between them.

How Does Cosine Similarity Relate to Natural Language Processing?

In natural language processing, embeddings transform text data into high-dimensional vectors, capturing the meaning of words and phrases. These vectors encode the frequency and context of words within a text corpus. The vector for a single word is designed to be cosine, similar to phrases or contexts where it frequently appears. This is achieved by training neural networks, such as word2vec, to create these embedding functions.

When applying semantic embedding to a body of text, cosine similarity effectively measures the relatedness of meanings between vectors. Vectors with cosine similarity values close to one correspond to texts that have similar meanings.

While other vector comparison metrics, like cosine distance or Euclidean distance, exist, cosine similarity is particularly popular for embedded vectors because of its simplicity and scalability. 

  • Scalability: Cosine similarity involves basic computations like multiplication and addition, making it efficient even when dealing with vectors with millions of dimensions.

  • Sparse vectors: Embedding vectors are often sparse, meaning many entries are zero. The inner product * and vector length * can be heavily optimized in such cases.

  • Simplicity: Cosine similarity is frequently chosen for its straightforwardness. Unlike Euclidean distance, which can grow unbounded towards infinity, cosine similarity is confined to a range between -1 and 1. This bounded behavior helps avoid overflow issues, making it more practical for most models.

  • Alignment with embedding algorithms: Embedding algorithms are often designed with cosine-similarity-based comparisons in mind, making cosine similarity a natural and efficient choice for relating the meaning of vectors.

How Is Cosine Similarity Used in AI Systems?

Cosine similarity in AI systems measures the similarity between text data. This powers applications like search engines and recommendation systems to identify related content. Here are the typical steps involved:

Collect and store text documents 

Gather a corpus of text documents, including anything from an organization’s internal records to public datasets. These documents are stored in a database like PostgreSQL for easy access and management.

Convert text into vectors 

The next step involves transforming the text data into a numerical format using semantic embeddings. Embeddings are high-dimensional vectors that represent the semantic meaning of text. For instance, models like OpenAI's text-embedding-ada-002 can convert a piece of text into a 1,536-dimensional vector. This is just one approach to converting text into embeddings; many others exist. 

For instance, you can use word embeddings like Word2Vec, which captures semantic relationships between individual words, or sentence embeddings like SBERT (Sentence-BERT), which represent entire sentences as vectors. There are also proprietary, closed-source models like OpenAI’s text-embedding-ada-002 and Cohere’s embedding models.

Implement a search function

A function is developed to facilitate semantic-based search by converting the user's search query into a vector using the same embedding model. This vector serves as a query to search through the stored document vectors.

Calculate cosine similarity

Cosine similarity compares the query vector against the document vectors. It computes the cosine of the angle between two vectors, providing a measure of similarity that ranges from -1 (entirely dissimilar) to 1 (identical in direction). Documents with a similarity score closer to 1 are considered more relevant to the query.

Check out this blog post on creating, storing, and querying OpenAI embeddings for a step-by-step guide with code snippets.

Cosine Similarity Tools

Several tools and methods are available for computing cosine similarity, each suited to different use cases:

  • Manual implementation: You can implement cosine similarity directly in Python or other programming languages. We just need to replicate the formula discussed above in a function. Here’s an implementation:

from numpy import dot  from numpy.linalg import norm 

def cosine_similarity(v1, v2):      return dot(v1, v2) / (norm(v1) * norm(v2))

dot and norm are numpy functions for dot product and Euclidean distance, respectively.

  • Vector databases: Implementing many optimizations ourselves might result in us overlooking important details. Consequently, specialized databases have been developed specifically for handling vectors, incorporating various operations such as cosine similarity. Traditional databases like PostgreSQL use extensions like pgvector, which include functions for cosine similarity, making it easier to work with high-dimensional data.

  • Nearest neighbor search algorithms: Searching for the closest vectors, or “nearest neighbors,” can be computationally expensive, especially with high-dimensional vectors. Various algorithms optimize this process by instead finding approximate nearest neighbors (ANNs):

    • Hierarchical Navigable Small World (HSNW) algorithms improve search efficiency by structuring data in a network-like format. In simple terms, HNSW builds a multi-layered graph structure that resembles a small-world network. Think of it like a digital map of a road network. When you zoom out, you see major highways connecting cities. Zooming in shows connections in a town, and at the closest level, you can see how neighborhoods and communities are connected. 

    • Centroid Indexing is a measure for assessing the similarity between two clustering solutions based on their cluster structures. It works by mapping the centroids of each clustering and counting the number of centroids from one solution that does not match the other.

    • DiskANN Algorithm from Microsoft is a graph-based approximate nearest neighbor search algorithm that scales to large amounts of data while retaining high recall.

How Timescale Does Cosine Similarity Search

Timescale Cloud extends PostgreSQL’s capabilities by offering a built-in cosine similarity search function through the pgvector extension. This inclusion allows comparisons between high-dimensional vectors for semantic search. Timescale also supports other comparison metrics if needed.

Timescale uses a state-of-the-art nearest neighbor algorithm based on DiskANN—StreamingDiskANN—ensuring top-tier performance for ANN queries. This new index method for pgvector data is part of a PostgreSQL open-source extension developed by the Timescale team called pgvectorscale.

Pgvectorscale’s Streaming DiskANN index has no “ef_search” type cutoff. Instead, it uses a streaming model allowing the index to continuously retrieve the “next closest” item for a given query, potentially traversing the entire graph!

Benchmark tests reveal that, compared to Pinecone’s storage-optimized index (s1), Timescale’s supercharged PostgreSQL with pgvector and pgvectorscale achieves 28x lower p95 latency and 16x higher query throughput for approximate nearest neighbor queries at 99 % recall, all at 75 % lower monthly cost when self-hosted on AWS EC2.

image

Additionally, there are several other benefits:

  • Rich support for backups: Timescale Cloud supports consistent backups, streaming backups, and incremental and full backups. In contrast, Pinecone only supports a manual operation to take a non-consistent copy of its data called “Collections.”

  • Point-in-time recovery: Timescale Cloud offers PITR to recover from operator errors.

  • High availability: This feature is designed for applications that need high-uptime guarantees.

  • Security: Timescale Cloud offers enterprise-grade security, including SOC2 Type II and GDPR compliance, with encryption and multi-factor authentication.

  • Flexibility: It supports various index types, allowing tailored solutions for AI needs.

  • Familiarity: Timescale Cloud complements, rather than replaces, pgvector, ensuring a low barrier to adoption for users already familiar with PostgreSQL.

Next Steps

Embeddings serve as points in space, and cosine similarity gives us a numerical number that tells their relationship. With this approach, you can group content with similar meanings, allowing systems to deliver results beyond simple keyword matching and offer more profound, context-aware responses. 

To learn more about cosine similarity, check out these blog posts:

  • PostgreSQL as a Vector Database: Create, Store, and Query OpenAI Embeddings With pgvector

  • A Beginner’s Guide to Vector Embeddings

Developers can leverage powerful tools like Timescale Cloud to unlock the full potential of cosine similarity. Its open-source AI stack includes pgvector and pgvectorscale, a PostgreSQL extension that builds on pgvector for more performance and scale. Pgvectorscale adds a StreamingDiskANN index to pgvector and statistical binary quantization, unlocking large-scale, high-performance AI use cases previously achievable only with specialized vector databases like Pinecone.

Pgvectorscale is open source under the PostgreSQL License and is available for you to use in your AI projects today. You can find installation instructions on the pgvectorscale GitHub repository. You can also access pgvectorscale on any database service on Timescale’s cloud PostgreSQL platform.

On this page