-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Make knn search a query #98916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make knn search a query #98916
Changes from 21 commits
5a05304
5aca3c2
897266a
8b02234
88c3233
49f2b80
9b1e7c8
cf1c0ae
9046812
8a4cb9b
dccd3c7
d4a5758
00a5d5c
7d2d091
56b8bbb
b3ca311
21f40b4
9d825cf
e0ae1dc
348305f
dc0012c
6876c5a
c4b80ba
6450681
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
pr: 98916 | ||
summary: Make knn search a query | ||
area: Vector Search | ||
type: feature | ||
issues: [] |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,223 @@ | ||
[[query-dsl-knn-query]] | ||
=== Knn query | ||
++++ | ||
<titleabbrev>Knn</titleabbrev> | ||
++++ | ||
|
||
Finds the _k_ nearest vectors to a query vector, as measured by a similarity | ||
metric. _knn_ query finds nearest vectors through approximate search on indexed | ||
dense_vectors. The preferred way to do approximate kNN search is through the | ||
<<knn-search,top level knn section>> of a search request. _knn_ query is reserved for | ||
expert cases, where there is a need to combine this query with other queries. | ||
|
||
[[knn-query-ex-request]] | ||
==== Example request | ||
|
||
[source,console] | ||
---- | ||
PUT my-image-index | ||
{ | ||
"mappings": { | ||
"properties": { | ||
"image-vector": { | ||
"type": "dense_vector", | ||
"dims": 3, | ||
"index": true, | ||
"similarity": "l2_norm" | ||
}, | ||
"file-type": { | ||
"type": "keyword" | ||
} | ||
} | ||
} | ||
} | ||
---- | ||
-- | ||
|
||
. Index your data. | ||
+ | ||
[source,console] | ||
---- | ||
POST my-image-index/_bulk?refresh=true | ||
{ "index": { "_id": "1" } } | ||
{ "image-vector": [1, 5, -20], "file-type": "jpg" } | ||
{ "index": { "_id": "2" } } | ||
{ "image-vector": [42, 8, -15], "file-type": "png" } | ||
{ "index": { "_id": "3" } } | ||
{ "image-vector": [15, 11, 23], "file-type": "jpg" } | ||
---- | ||
//TEST[continued] | ||
|
||
. Run the search using the `knn` query, asking for the top 3 nearest vectors. | ||
+ | ||
[source,console] | ||
---- | ||
POST my-image-index/_search | ||
{ | ||
"size" : 3, | ||
"query" : { | ||
"knn": { | ||
"field": "image-vector", | ||
"query_vector": [-5, 9, -12], | ||
"num_candidates": 10 | ||
} | ||
} | ||
} | ||
---- | ||
//TEST[continued] | ||
|
||
NOTE: `knn` query doesn't have a separate `k` parameter. `k` is defined by | ||
`size` parameter of a search request similar to other queries. `knn` query | ||
collects `num_candidates` results from each shard, then merges them to get | ||
the top `size` results. | ||
|
||
|
||
[[knn-query-top-level-parameters]] | ||
==== Top-level parameters for `knn` | ||
|
||
`field`:: | ||
+ | ||
-- | ||
(Required, string) The name of the vector field to search against. Must be a | ||
<<index-vectors-knn-search, `dense_vector` field with indexing enabled>>. | ||
-- | ||
|
||
`query_vector`:: | ||
+ | ||
-- | ||
(Required, array of floats) Query vector. Must have the same number of dimensions | ||
as the vector field you are searching against. | ||
-- | ||
|
||
`num_candidates`:: | ||
+ | ||
-- | ||
(Required, integer) The number of nearest neighbor candidates to consider per shard. | ||
Cannot exceed 10,000. {es} collects `num_candidates` results from each shard, then | ||
merges them to find the top results. Increasing `num_candidates` tends to improve the | ||
accuracy of the final results. | ||
-- | ||
|
||
`filter`:: | ||
+ | ||
-- | ||
(Optional, query object) Query to filter the documents that can match. | ||
The kNN search will return the top documents that also match this filter. | ||
The value can be a single query or a list of queries. If `filter` is not provided, | ||
all documents are allowed to match. | ||
|
||
The filter is a pre-filter, meaning that it is applied **during** the approximate | ||
kNN search to ensure that `num_candidates` matching documents are returned. | ||
-- | ||
|
||
`similarity`:: | ||
+ | ||
-- | ||
(Optional, float) The minimum similarity required for a document to be considered | ||
a match. The similarity value calculated relates to the raw | ||
<<dense-vector-similarity, `similarity`>> used. Not the document score. The matched | ||
documents are then scored according to <<dense-vector-similarity, `similarity`>> | ||
and the provided `boost` is applied. | ||
-- | ||
|
||
`boost`:: | ||
+ | ||
-- | ||
(Optional, float) Floating point number used to multiply the | ||
scores of matched documents. This value cannot be negative. Defaults to `1.0`. | ||
-- | ||
|
||
`_name`:: | ||
+ | ||
-- | ||
(Optional, string) Name field to identify the query | ||
-- | ||
|
||
[[knn-query-filtering]] | ||
==== Pre-filters and post-filters in knn query | ||
|
||
There are two ways to filter documents that match a kNN query: | ||
|
||
. **pre-filtering** – filter is applied during the approximate kNN search | ||
to ensure that `k` matching documents are returned. | ||
. **post-filtering** – filter is applied after the approximate kNN search | ||
completes, which results in fewer than k results, even when there are enough | ||
matching documents. | ||
|
||
Pre-filtering is supported through the `filter` parameter of the `knn` query. | ||
Also filters from <<filter-alias,aliases>> are applied as pre-filters. | ||
|
||
All other filters found in the Query DSL tree are applied as post-filters. | ||
For example, `knn` query finds the top 3 documents with the nearest vectors | ||
(num_candidates=3), which are combined with `term` filter, that is | ||
post-filtered. The final set of documents will contain only a single document | ||
that passes the post-filter. | ||
|
||
|
||
[source,console] | ||
---- | ||
POST my-image-index/_search | ||
{ | ||
"size" : 10, | ||
"query" : { | ||
"bool" : { | ||
"must" : { | ||
"knn": { | ||
"field": "image-vector", | ||
"query_vector": [-5, 9, -12], | ||
"num_candidates": 3 | ||
} | ||
}, | ||
"filter" : { | ||
"term" : { "file-type" : "png" } | ||
} | ||
} | ||
} | ||
} | ||
---- | ||
//TEST[continued] | ||
|
||
[[knn-query-with-nested-query]] | ||
==== Knn query inside a nested query | ||
|
||
`knn` query can be used inside a nested query. The behaviour here is similar | ||
to <<nested-knn-search, top level nested kNN search>>: | ||
|
||
* kNN search over nested dense_vectors diversifies the top results over | ||
the top-level document | ||
* `filter` over the top-level document metadata is supported and acts as a | ||
post-filter | ||
* `filter` over `nested` field metadata is not supported | ||
|
||
A sample query can look like below: | ||
|
||
[source,js] | ||
---- | ||
{ | ||
"query" : { | ||
"nested" : { | ||
"path" : "paragraph", | ||
"query" : { | ||
"knn": { | ||
"query_vector": [ | ||
0.45, | ||
45 | ||
], | ||
"field": "paragraph.vector", | ||
"num_candidates": 2 | ||
} | ||
} | ||
} | ||
} | ||
} | ||
---- | ||
// NOTCONSOLE | ||
|
||
[[knn-query-aggregations]] | ||
==== Knn query with aggregations | ||
`knn` query calculates aggregations on `num_candidates` from each shard. | ||
Thus, the final results from aggregations contain | ||
`num_candidates * number_of_shards` documents. This is different from | ||
the <<knn-search,top level knn section>> where aggregations are | ||
calculated on the global top k nearest documents. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -61,6 +61,7 @@ | |
import org.elasticsearch.index.query.QueryShardException; | ||
import org.elasticsearch.index.query.Rewriteable; | ||
import org.elasticsearch.index.query.SearchExecutionContext; | ||
import org.elasticsearch.search.vectors.KnnVectorQueryBuilder; | ||
import org.elasticsearch.xcontent.XContentParser; | ||
|
||
import java.io.ByteArrayOutputStream; | ||
|
@@ -438,6 +439,8 @@ static QueryBuilder parseQueryBuilder(DocumentParserContext context) { | |
throw new IllegalArgumentException("the [has_child] query is unsupported inside a percolator query"); | ||
} else if (queryName.equals("has_parent")) { | ||
throw new IllegalArgumentException("the [has_parent] query is unsupported inside a percolator query"); | ||
} else if (queryName.equals(KnnVectorQueryBuilder.NAME)) { | ||
throw new IllegalArgumentException("the [knn] query is unsupported inside a percolator query"); | ||
Comment on lines
+442
to
+443
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this just because its too difficult to make work? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think this is OK for now. I will have to think more about if we should support it or not. IMO, it seems like we should. We are just matching the nearest "k", so it seems to fit OK. But I can also see the argument against (as you laid out). Additionally, we should have a "more_like_this" query that utilizes knn as well. Maybe open a Github issue to track discussion? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This as well, but more importantly, I think it does not make sense to percolate a document against knn query, as knn query matches any single document. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I understand that semantic similarity has no natural bounds like lexical. But, using this to find semantically similar queries to a stored/new docs is powerful. Especially when you consider hybrid search, and similarity filtering. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @benwtrent Let's discuss this offline (about percolator and more_like_this queries) and create github issues if we find them necessary. I will consider this PR will go without those queries. |
||
} | ||
}); | ||
} catch (IOException e) { | ||
|
Uh oh!
There was an error while loading. Please reload this page.