Skip to content

Add new experimental rank_vectors mapping for late-interaction second order ranking #118804

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
b5decc4
Moving rank_vectors to a plugin and removing feature flag
benwtrent Dec 16, 2024
ef15068
cleanups and docs
benwtrent Dec 16, 2024
9a65da8
Update docs/changelog/118804.yaml
benwtrent Dec 16, 2024
023f11f
giving whitelist resource access to painless spi
benwtrent Dec 17, 2024
341662a
[CI] Auto commit changes from spotless
elasticsearchmachine Dec 17, 2024
fedd90a
fixing tests and headers
benwtrent Dec 17, 2024
01cb425
fixing notes
benwtrent Dec 17, 2024
0155f60
adding tests and doc values
benwtrent Dec 17, 2024
3c27c9d
Merge remote-tracking branch 'upstream/main' into feature/rank-vector…
benwtrent Dec 17, 2024
e12ba1e
Merge remote-tracking branch 'upstream/main' into feature/rank-vector…
benwtrent Dec 18, 2024
b8b1338
Merge remote-tracking branch 'upstream/main' into feature/rank-vector…
benwtrent Dec 18, 2024
239564d
fixing checks
benwtrent Dec 18, 2024
996c6ec
Merge branch 'main' into feature/rank-vectors-plugin
benwtrent Dec 18, 2024
5922e2a
fixing javadocs
benwtrent Dec 18, 2024
ad75198
Merge branch 'feature/rank-vectors-plugin' of github.com:benwtrent/el…
benwtrent Dec 18, 2024
63094f1
Merge remote-tracking branch 'upstream/main' into feature/rank-vector…
benwtrent Dec 19, 2024
399e9d7
trying to fix tests
benwtrent Dec 19, 2024
e1bb29b
Merge remote-tracking branch 'upstream/main' into feature/rank-vector…
benwtrent Jan 6, 2025
e9ad982
fixing tests
benwtrent Jan 6, 2025
99276ad
fixing gradle
benwtrent Jan 6, 2025
ca4a8a2
fixing tests
benwtrent Jan 6, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions docs/changelog/118804.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
pr: 118804
summary: Add new experimental `rank_vectors` mapping for late-interaction second order
ranking
area: Vector Search
type: feature
issues: []
highlight:
title: Add new experimental `rank_vectors` mapping for late-interaction second order
ranking
body:
Late-interaction models are powerful rerankers. While their size and overall
cost doesn't lend itself for HNSW indexing, utilizing them as second order reranking
can provide excellent boosts in relevance. The new `rank_vectors` mapping allows for rescoring
over new and novel multi-vector late-interaction models like ColBERT or ColPali.
notable: true
2 changes: 2 additions & 0 deletions docs/reference/mapping/types.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -180,6 +180,8 @@ include::types/rank-feature.asciidoc[]

include::types/rank-features.asciidoc[]

include::types/rank-vectors.asciidoc[]

include::types/search-as-you-type.asciidoc[]

include::types/semantic-text.asciidoc[]
Expand Down
1 change: 0 additions & 1 deletion docs/reference/mapping/types/dense-vector.asciidoc
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
[role="xpack"]
[[dense-vector]]
=== Dense vector field type
++++
Expand Down
201 changes: 201 additions & 0 deletions docs/reference/mapping/types/rank-vectors.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
[role="xpack"]
[[rank-vectors]]
=== Rank Vectors
++++
<titleabbrev> Rank Vectors </titleabbrev>
++++
experimental::[]

The `rank_vectors` field type enables late-interaction dense vector scoring in Elasticsearch. The number of vectors
per field can vary, but they must all share the same number of dimensions and element type.

The purpose of vectors stored in this field is second order ranking documents with max-sim similarity.

Here is a simple example of using this field with `float` elements.

[source,console]
--------------------------------------------------
PUT my-rank-vectors-float
{
"mappings": {
"properties": {
"my_vector": {
"type": "rank_vectors"
}
}
}
}

PUT my-rank-vectors-float/_doc/1
{
"my_vector" : [[0.5, 10, 6], [-0.5, 10, 10]]
}

--------------------------------------------------
// TESTSETUP

In addition to the `float` element type, `byte` and `bit` element types are also supported.

Here is an example of using this field with `byte` elements.

[source,console]
--------------------------------------------------
PUT my-rank-vectors-byte
{
"mappings": {
"properties": {
"my_vector": {
"type": "rank_vectors",
"element_type": "byte"
}
}
}
}

PUT my-rank-vectors-byte/_doc/1
{
"my_vector" : [[1, 2, 3], [4, 5, 6]]
}
--------------------------------------------------

Here is an example of using this field with `bit` elements.

[source,console]
--------------------------------------------------
PUT my-rank-vectors-bit
{
"mappings": {
"properties": {
"my_vector": {
"type": "rank_vectors",
"element_type": "bit"
}
}
}
}

POST /my-rank-vectors-bit/_bulk?refresh
{"index": {"_id" : "1"}}
{"my_vector": [127, -127, 0, 1, 42]}
{"index": {"_id" : "2"}}
{"my_vector": "8100012a7f"}
--------------------------------------------------

[role="child_attributes"]
[[rank-vectors-params]]
==== Parameters for rank vectors fields

The `rank_vectors` field type supports the following parameters:

[[rank-vectors-element-type]]
`element_type`::
(Optional, string)
The data type used to encode vectors. The supported data types are
`float` (default), `byte`, and bit.

.Valid values for `element_type`
[%collapsible%open]
====
`float`:::
indexes a 4-byte floating-point
value per dimension. This is the default value.

`byte`:::
indexes a 1-byte integer value per dimension.

`bit`:::
indexes a single bit per dimension. Useful for very high-dimensional vectors or models that specifically support bit vectors.
NOTE: when using `bit`, the number of dimensions must be a multiple of 8 and must represent the number of bits.

====

`dims`::
(Optional, integer)
Number of vector dimensions. Can't exceed `4096`. If `dims` is not specified,
it will be set to the length of the first vector added to the field.

[[rank-vectors-synthetic-source]]
==== Synthetic `_source`

IMPORTANT: Synthetic `_source` is Generally Available only for TSDB indices
(indices that have `index.mode` set to `time_series`). For other indices
synthetic `_source` is in technical preview. Features in technical preview may
be changed or removed in a future release. Elastic will work to fix
any issues, but features in technical preview are not subject to the support SLA
of official GA features.

`rank_vectors` fields support <<synthetic-source,synthetic `_source`>> .

[[rank-vectors-scoring]]
==== Scoring with rank vectors

Rank vectors can be accessed and used in <<query-dsl-script-score-query,`script_score` queries>>.

For example, the following query scores documents based on the maxSim similarity between the query vector and the vectors stored in the `my_vector` field:

[source,console]
--------------------------------------------------
GET my-rank-vectors-float/_search
{
"query": {
"script_score": {
"query": {
"match_all": {}
},
"script": {
"source": "maxSimDotProduct(params.query_vector, 'my_vector')",
"params": {
"query_vector": [[0.5, 10, 6], [-0.5, 10, 10]]
}
}
}
}
}
--------------------------------------------------

Additionally, asymmetric similarity functions can be used to score against `bit` vectors. For example, the following query scores documents based on the maxSimDotProduct similarity between a floating point query vector and bit vectors stored in the `my_vector` field:

[source,console]
--------------------------------------------------
PUT my-rank-vectors-bit
{
"mappings": {
"properties": {
"my_vector": {
"type": "rank_vectors",
"element_type": "bit"
}
}
}
}

POST /my-rank-vectors-bit/_bulk?refresh
{"index": {"_id" : "1"}}
{"my_vector": [127, -127, 0, 1, 42]}
{"index": {"_id" : "2"}}
{"my_vector": "8100012a7f"}

GET my-rank-vectors-bit/_search
{
"query": {
"script_score": {
"query": {
"match_all": {}
},
"script": {
"source": "maxSimDotProduct(params.query_vector, 'my_vector')",
"params": {
"query_vector": [
[0.35, 0.77, 0.95, 0.15, 0.11, 0.08, 0.58, 0.06, 0.44, 0.52, 0.21,
0.62, 0.65, 0.16, 0.64, 0.39, 0.93, 0.06, 0.93, 0.31, 0.92, 0.0,
0.66, 0.86, 0.92, 0.03, 0.81, 0.31, 0.2 , 0.92, 0.95, 0.64, 0.19,
0.26, 0.77, 0.64, 0.78, 0.32, 0.97, 0.84]
] <1>
}
}
}
}
}
--------------------------------------------------
<1> Note that the query vector has 40 elements, matching the number of bits in the bit vectors.

Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,5 @@ static_import {
double cosineSimilarity(org.elasticsearch.script.ScoreScript, Object, String) bound_to org.elasticsearch.script.VectorScoreScriptUtils$CosineSimilarity
double dotProduct(org.elasticsearch.script.ScoreScript, Object, String) bound_to org.elasticsearch.script.VectorScoreScriptUtils$DotProduct
double hamming(org.elasticsearch.script.ScoreScript, Object, String) bound_to org.elasticsearch.script.VectorScoreScriptUtils$Hamming
double maxSimDotProduct(org.elasticsearch.script.ScoreScript, Object, String) bound_to org.elasticsearch.script.RankVectorsScoreScriptUtils$MaxSimDotProduct
double maxSimInvHamming(org.elasticsearch.script.ScoreScript, Object, String) bound_to org.elasticsearch.script.RankVectorsScoreScriptUtils$MaxSimInvHamming
}

Loading