Skip to content

Make knn search a query #98916

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Nov 1, 2023
Merged
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
5a05304
Make knn search a query
mayya-sharipova Aug 22, 2023
5aca3c2
Update docs/changelog/98916.yaml
mayya-sharipova Aug 27, 2023
897266a
Merge remote-tracking branch 'upstream/main' into knn-as-query
mayya-sharipova Aug 27, 2023
8b02234
Correct transport version, remove byteQueryVector
mayya-sharipova Aug 27, 2023
88c3233
Merge remote-tracking branch 'upstream/main' into knn-as-query
mayya-sharipova Aug 29, 2023
49f2b80
Other adjustments
mayya-sharipova Aug 29, 2023
9b1e7c8
Add aliasFilter to the SearchExecutionContext instead of QueryBuilder
mayya-sharipova Aug 29, 2023
cf1c0ae
Remove query_vector_builder
mayya-sharipova Aug 30, 2023
9046812
Add query _name to tests
mayya-sharipova Sep 1, 2023
8a4cb9b
Add filter alias during doToQuery
mayya-sharipova Sep 1, 2023
dccd3c7
Merge remote-tracking branch 'upstream/main' into knn-as-query
mayya-sharipova Sep 1, 2023
d4a5758
Simplify over-wire protocol
mayya-sharipova Sep 1, 2023
00a5d5c
Merge remote-tracking branch 'upstream/main' into knn-as-query
mayya-sharipova Oct 23, 2023
7d2d091
Add nested support for knn query
mayya-sharipova Oct 26, 2023
56b8bbb
Merge remote-tracking branch 'upstream/main' into knn-as-query
mayya-sharipova Oct 26, 2023
b3ca311
Add documentation and other queries
mayya-sharipova Oct 30, 2023
21f40b4
Merge remote-tracking branch 'upstream/main' into knn-as-query
mayya-sharipova Oct 30, 2023
9d825cf
Updates to knn-query documentation
mayya-sharipova Oct 31, 2023
e0ae1dc
Merge remote-tracking branch 'upstream/main' into knn-as-query
mayya-sharipova Oct 31, 2023
348305f
Merge remote-tracking branch 'upstream/main' into knn-as-query
mayya-sharipova Oct 31, 2023
dc0012c
Adjust docs
mayya-sharipova Oct 31, 2023
6876c5a
Merge remote-tracking branch 'upstream/main' into knn-as-query
mayya-sharipova Oct 31, 2023
c4b80ba
Merge branch 'main' into knn-as-query
mayya-sharipova Nov 1, 2023
6450681
Fix an error in docs
mayya-sharipova Nov 1, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions docs/changelog/98916.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
pr: 98916
summary: Make knn search a query
area: Vector Search
type: feature
issues: []
223 changes: 223 additions & 0 deletions docs/reference/query-dsl/knn-query.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,223 @@
[[query-dsl-knn-query]]
=== Knn query
++++
<titleabbrev>Knn</titleabbrev>
++++

Finds the _k_ nearest vectors to a query vector, as measured by a similarity
metric. _knn_ query finds nearest vectors through approximate search on indexed
dense_vectors. The preferred way to do approximate kNN search is through the
<<knn-search,top level knn section>> of a search request. _knn_ query is reserved for
expert cases, where there is a need to combine this query with other queries.

[[knn-query-ex-request]]
==== Example request

[source,console]
----
PUT my-image-index
{
"mappings": {
"properties": {
"image-vector": {
"type": "dense_vector",
"dims": 3,
"index": true,
"similarity": "l2_norm"
},
"file-type": {
"type": "keyword"
}
}
}
}
----
--

. Index your data.
+
[source,console]
----
POST my-image-index/_bulk?refresh=true
{ "index": { "_id": "1" } }
{ "image-vector": [1, 5, -20], "file-type": "jpg" }
{ "index": { "_id": "2" } }
{ "image-vector": [42, 8, -15], "file-type": "png" }
{ "index": { "_id": "3" } }
{ "image-vector": [15, 11, 23], "file-type": "jpg" }
----
//TEST[continued]

. Run the search using the `knn` query, asking for the top 3 nearest vectors.
+
[source,console]
----
POST my-image-index/_search
{
"size" : 3,
"query" : {
"knn": {
"field": "image-vector",
"query_vector": [-5, 9, -12],
"num_candidates": 10
}
}
}
----
//TEST[continued]

NOTE: `knn` query doesn't have a separate `k` parameter. `k` is defined by
`size` parameter of a search request similar to other queries. `knn` query
collects `num_candidates` results from each shard, then merges them to get
the top `size` results.


[[knn-query-top-level-parameters]]
==== Top-level parameters for `knn`

`field`::
+
--
(Required, string) The name of the vector field to search against. Must be a
<<index-vectors-knn-search, `dense_vector` field with indexing enabled>>.
--

`query_vector`::
+
--
(Required, array of floats) Query vector. Must have the same number of dimensions
as the vector field you are searching against.
--

`num_candidates`::
+
--
(Required, integer) The number of nearest neighbor candidates to consider per shard.
Cannot exceed 10,000. {es} collects `num_candidates` results from each shard, then
merges them to find the top results. Increasing `num_candidates` tends to improve the
accuracy of the final results.
--

`filter`::
+
--
(Optional, query object) Query to filter the documents that can match.
The kNN search will return the top documents that also match this filter.
The value can be a single query or a list of queries. If `filter` is not provided,
all documents are allowed to match.

The filter is a pre-filter, meaning that it is applied **during** the approximate
kNN search to ensure that `num_candidates` matching documents are returned.
--

`similarity`::
+
--
(Optional, float) The minimum similarity required for a document to be considered
a match. The similarity value calculated relates to the raw
<<dense-vector-similarity, `similarity`>> used. Not the document score. The matched
documents are then scored according to <<dense-vector-similarity, `similarity`>>
and the provided `boost` is applied.
--

`boost`::
+
--
(Optional, float) Floating point number used to multiply the
scores of matched documents. This value cannot be negative. Defaults to `1.0`.
--

`_name`::
+
--
(Optional, string) Name field to identify the query
--

[[knn-query-filtering]]
==== Pre-filters and post-filters in knn query

There are two ways to filter documents that match a kNN query:

. **pre-filtering** – filter is applied during the approximate kNN search
to ensure that `k` matching documents are returned.
. **post-filtering** – filter is applied after the approximate kNN search
completes, which results in fewer than k results, even when there are enough
matching documents.

Pre-filtering is supported through the `filter` parameter of the `knn` query.
Also filters from <<filter-alias,aliases>> are applied as pre-filters.

All other filters found in the Query DSL tree are applied as post-filters.
For example, `knn` query finds the top 3 documents with the nearest vectors
(num_candidates=3), which are combined with `term` filter, that is
post-filtered. The final set of documents will contain only a single document
that passes the post-filter.


[source,console]
----
POST my-image-index/_search
{
"size" : 10,
"query" : {
"bool" : {
"must" : {
"knn": {
"field": "image-vector",
"query_vector": [-5, 9, -12],
"num_candidates": 3
}
},
"filter" : {
"term" : { "file-type" : "png" }
}
}
}
}
----
//TEST[continued]

[[knn-query-with-nested-query]]
==== Knn query inside a nested query

`knn` query can be used inside a nested query. The behaviour here is similar
to <<nested-knn-search, top level nested kNN search>>:

* kNN search over nested dense_vectors diversifies the top results over
the top-level document
* `filter` over the top-level document metadata is supported and acts as a
post-filter
* `filter` over `nested` field metadata is not supported

A sample query can look like below:

[source,js]
----
{
"query" : {
"nested" : {
"path" : "paragraph",
"query" : {
"knn": {
"query_vector": [
0.45,
45
],
"field": "paragraph.vector",
"num_candidates": 2
}
}
}
}
}
----
// NOTCONSOLE

[[knn-query-aggregations]]
==== Knn query with aggregations
`knn` query calculates aggregations on `num_candidates` from each shard.
Thus, the final results from aggregations contain
`num_candidates * number_of_shards` documents. This is different from
the <<knn-search,top level knn section>> where aggregations are
calculated on the global top k nearest documents.

6 changes: 6 additions & 0 deletions docs/reference/query-dsl/special-queries.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,10 @@ or collection of documents.
This query finds queries that are stored as documents that match with
the specified document.

<<query-dsl-knn-query,`knn` query>>::
A query that finds the _k_ nearest vectors to a query
vector, as measured by a similarity metric.

<<query-dsl-rank-feature-query,`rank_feature` query>>::
A query that computes scores based on the values of numeric features and is
able to efficiently skip non-competitive hits.
Expand All @@ -43,6 +47,8 @@ include::mlt-query.asciidoc[]

include::percolate-query.asciidoc[]

include::knn-query.asciidoc[]

include::rank-feature-query.asciidoc[]

include::script-query.asciidoc[]
Expand Down
5 changes: 3 additions & 2 deletions docs/reference/search/search-your-data/knn-search.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ based on a similarity metric, the better its match.
{es} supports two methods for kNN search:

* <<approximate-knn,Approximate kNN>> using the `knn` search
option
option or `knn` query

* <<exact-knn,Exact, brute-force kNN>> using a `script_score` query with a
vector function
Expand Down Expand Up @@ -129,7 +129,8 @@ POST image-index/_bulk?refresh=true
//TEST[continued]
//TEST[s/\.\.\.//]

. Run the search using the <<search-api-knn, `knn` option>>.
. Run the search using the <<search-api-knn, `knn` option>> or the
<<query-dsl-knn-query,`knn` query>> (expert case).
+
[source,console]
----
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@

import org.apache.lucene.search.join.ScoreMode;
import org.elasticsearch.ElasticsearchException;
import org.elasticsearch.action.index.IndexRequestBuilder;
import org.elasticsearch.action.search.MultiSearchResponse;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.action.support.WriteRequest;
Expand All @@ -22,10 +23,12 @@
import org.elasticsearch.index.query.MatchPhraseQueryBuilder;
import org.elasticsearch.index.query.MultiMatchQueryBuilder;
import org.elasticsearch.index.query.Operator;
import org.elasticsearch.index.query.QueryBuilder;
import org.elasticsearch.index.query.QueryBuilders;
import org.elasticsearch.plugins.Plugin;
import org.elasticsearch.search.fetch.subphase.highlight.HighlightBuilder;
import org.elasticsearch.search.sort.SortOrder;
import org.elasticsearch.search.vectors.KnnVectorQueryBuilder;
import org.elasticsearch.test.ESIntegTestCase;
import org.elasticsearch.xcontent.XContentBuilder;
import org.elasticsearch.xcontent.XContentFactory;
Expand Down Expand Up @@ -1295,4 +1298,34 @@ public void testWithWildcardFieldNames() throws Exception {
).get();
assertEquals(1, response.getHits().getTotalHits().value);
}

public void testKnnQueryNotSupportedInPercolator() throws IOException {
String mappings = org.elasticsearch.common.Strings.format("""
{
"properties": {
"my_query" : {
"type" : "percolator"
},
"my_vector" : {
"type" : "dense_vector",
"dims" : 5,
"index" : true,
"similarity" : "l2_norm"
}

}
}
""");
indicesAdmin().prepareCreate("index1").setMapping(mappings).get();
ensureGreen();
QueryBuilder knnVectorQueryBuilder = new KnnVectorQueryBuilder("my_vector", new float[] { 1, 1, 1, 1, 1 }, 10, null);

IndexRequestBuilder indexRequestBuilder = client().prepareIndex("index1")
.setId("knn_query1")
.setSource(jsonBuilder().startObject().field("my_query", knnVectorQueryBuilder).endObject());

DocumentParsingException exception = expectThrows(DocumentParsingException.class, () -> indexRequestBuilder.get());
assertThat(exception.getMessage(), containsString("the [knn] query is unsupported inside a percolator"));
}

}
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@
import org.elasticsearch.index.query.QueryShardException;
import org.elasticsearch.index.query.Rewriteable;
import org.elasticsearch.index.query.SearchExecutionContext;
import org.elasticsearch.search.vectors.KnnVectorQueryBuilder;
import org.elasticsearch.xcontent.XContentParser;

import java.io.ByteArrayOutputStream;
Expand Down Expand Up @@ -438,6 +439,8 @@ static QueryBuilder parseQueryBuilder(DocumentParserContext context) {
throw new IllegalArgumentException("the [has_child] query is unsupported inside a percolator query");
} else if (queryName.equals("has_parent")) {
throw new IllegalArgumentException("the [has_parent] query is unsupported inside a percolator query");
} else if (queryName.equals(KnnVectorQueryBuilder.NAME)) {
throw new IllegalArgumentException("the [knn] query is unsupported inside a percolator query");
Comment on lines +442 to +443
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this just because its too difficult to make work?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think? By default, knn and percolate don't work together and this requires more investigation. But more importantly, I think it does not make sense to percolate a document against knn query, as knn query matches any single document.

I think this is OK for now.

I will have to think more about if we should support it or not. IMO, it seems like we should. We are just matching the nearest "k", so it seems to fit OK. But I can also see the argument against (as you laid out).

Additionally, we should have a "more_like_this" query that utilizes knn as well.

Maybe open a Github issue to track discussion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This as well, but more importantly, I think it does not make sense to percolate a document against knn query, as knn query matches any single document.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as knn query matches any single document.

I understand that semantic similarity has no natural bounds like lexical. But, using this to find semantically similar queries to a stored/new docs is powerful. Especially when you consider hybrid search, and similarity filtering.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benwtrent Let's discuss this offline (about percolator and more_like_this queries) and create github issues if we find them necessary.

I will consider this PR will go without those queries.

}
});
} catch (IOException e) {
Expand Down
Loading