Make knn search a query #98916

mayya-sharipova · 2023-08-27T13:51:02Z

This adds a new knn query:

knn query is executed during the Query phase similar to all other queries.
No k parameter, k defaults to size
num_candidates - number of closest neighbours that knn query returns on each shard
For aggregations: "size" results are collected with total = size * shards. Aggregations will see size * shards results.
All filters from DSL are applied as post-filters, except: 1) alias filter is applied as pre-filter or 2) a filter provided as a parameter inside knn query.

This introduced a new knn query: - knn query is executed during the Query phase similar to all other queries. - No k parameter, k defaults to size - num_candidates is a size of queue for candidates to consider while search a graph on each shard - For aggregations: "size" results are collected with total = size * shards. Aggregations will see size * shards results. - All filters from DSL are applied as post-filters, except: 1) alias filter is applied as pre-filter or 2) a filter provided as a parameter inside knn query.

elasticsearchmachine · 2023-08-27T13:51:26Z

Pinging @elastic/es-search (Team:Search)

elasticsearchmachine · 2023-08-27T13:51:27Z

Hi @mayya-sharipova, I've created a changelog YAML for you.

benwtrent

I have a concern about inheriting designs that were required because of how the top-level knn was. I think we can do better here when adding a new query.

server/src/main/java/org/elasticsearch/search/SearchService.java

benwtrent · 2023-08-29T17:23:31Z

server/src/main/java/org/elasticsearch/search/vectors/KnnVectorQueryBuilder.java

+    public static final ParseField NUM_CANDS_FIELD = new ParseField("num_candidates");
+    public static final ParseField QUERY_VECTOR_FIELD = new ParseField("query_vector");
+    public static final ParseField QUERY_VECTOR_BUILDER_FIELD = new ParseField("query_vector_builder");
+    public static final ParseField VECTOR_SIMILARITY_FIELD = new ParseField("similarity");


I am not 100% sure we should have similarity here in the knn query object.

The reason it was added to the top level knn object was because we couldn't have a vectorSimilarity query. But now we have opened the flood gates for allowing more than one kind of vector query.

Do we think we should have similarity here or not?

@benwtrent Thanks for the comment.
In my opinion, it seems excessive to introduce a new user facing vectorSimilarity query just to exclude documents that don't pass certain threshold of a knn query.

I think we either should:

Remove similarity from a knn query, and ask users to use min_score search request parameter. But, in this case it will be minimum score (not similarity). This is the way how all other queries can exclude non-relevant docs.

Keep similarity parameter in the query.

WDYT of these options? I am ok with going with option#1

min_score in the search request applies to the entire search clause. It seems like it misses a key usage where knn is used within a filter clause and executed with aggregations. I don't think we can use the search request min_score as a substitute.

I also see that function_score has a min_score parameter :(. But that thing is so complicated.

So, my counter options would be:

keep it in the query

we remove it and add a separate one called vectorSimilarity.

I am just trying to think of what we would have done if knn was always a query and we added similarity thresholds later. Would we have made it a separate query or added a parameter?

I agree that min_score doesn't seem like a good candidate since it's global within a search. If we continue down the path of additional queries to simplify the API then it makes sense to me to have a wrapper for either a min_score query in general or a specific threshold query to wrap a knn query.

I am ok with having an extra vector_similarity query that wraps a knn query or semantic_search query (a new query that we will introduce for query_vector_builder):

"vector_similarity": { "knn": { "field": "dense-vector-field", "num_candidates": 100, "query_vector" : [...] }, "similarity" : 10 }

"vector_similarity": { "semantic_search": { "field": "dense-vector-field", "num_candidates": 100, "model_id": "my-text-embedding-model", "text": "The opposite of blue" }, "similarity" : 10 }

@giladgal What would be your opinion?

I don't think there is an obvious correct choice between these two options. Having a parent query seems more organized and then that same query could be used with different types of vector queries, as long as you know which queries it can be used with.

Personally I prefer a parameter than a parent query. I think having all vectors being candidates for any query vector is not how we humans think of similarity and relevance. When the engine returns irrelevant results just because they are the most similar results, it is probably not what the end user wanted. In other words I see the similarity threshold as a natural part of a vector search and I would like for it to be documented so that anyone that reads about vector search immediately becomes aware of that important option. I also like that a parameter is simpler to add and I find it easier to read.

Thanks @giladgal and others for your comments, it seems the majority of us agree to keep similarity as a parameter.

server/src/main/java/org/elasticsearch/search/vectors/KnnVectorQueryBuilder.java

benwtrent · 2023-08-31T14:03:54Z

@mayya-sharipova could you add a test that ensures the knn query supports _name? It would be good to know which knn vectors contributed to a score or not. The top level knn currently doesn't support this.

jdconrad

This is a great change! I added a couple of thoughts inline.

jdconrad · 2023-08-31T16:14:41Z

server/src/main/java/org/elasticsearch/search/vectors/KnnVectorQueryBuilder.java

+    public static final ParseField NUM_CANDS_FIELD = new ParseField("num_candidates");
+    public static final ParseField QUERY_VECTOR_FIELD = new ParseField("query_vector");
+    public static final ParseField QUERY_VECTOR_BUILDER_FIELD = new ParseField("query_vector_builder");
+    public static final ParseField VECTOR_SIMILARITY_FIELD = new ParseField("similarity");


I agree that min_score doesn't seem like a good candidate since it's global within a search. If we continue down the path of additional queries to simplify the API then it makes sense to me to have a wrapper for either a min_score query in general or a specific threshold query to wrap a knn query.

jdconrad · 2023-08-31T16:19:33Z

server/src/main/java/org/elasticsearch/index/query/SearchExecutionContext.java

+    // Set alias filter, so it can be applied for queries that need it (e.g. knn query)
+    public void setAliasFilter(QueryBuilder aliasFilter) {
+        this.aliasFilter = aliasFilter;
+    }
+
+    public QueryBuilder getAliasFilter() {
+        return aliasFilter;
+    }
+


I wonder if SearchContext should be made available as part of rewrite instead of SearchExecutionContext since alias filter is available as part of that? I do concede this may be too large a change for now.

Thanks for the feedback, but looks like indeed this would be too large a change

server/src/main/java/org/elasticsearch/search/vectors/KnnVectorQueryBuilder.java

mayya-sharipova · 2023-09-01T18:44:54Z

@mayya-sharipova could you add a test that ensures the knn query supports _name? It would be good to know which knn vectors contributed to a score or not. The top level knn currently doesn't support this.

@benwtrent Great feedback, indeed interesting and necessary test cases, addressed in 9046812

mshameti · 2023-09-03T22:28:12Z

...ec/src/yamlRestTest/resources/rest-api-spec/test/search.vectors/90_knn_query_with_filter.yml

+                    field: my_vector
+                    query_vector: [ 1, 1, 1, 1 ]
+                    num_candidates: 5
+                    filter:


Speaking purely from the viewpoint of a consumer of this API without deep knowledge of the underlying implementation of kNN search in ES:

In the spirit of simplifying the API, is there an opportunity to a) remove the 'filter' field from the knn query and b) have the kNN query be the post-filter instead?

Assuming the current spec changes, if I were to provide a 'term' and a 'knn' clause then I am actually more likely to see fewer results than if I were to provide a 'term' clause and a 'knn' clause with a 'filter', due to kNN trying to satisfy the 'size' requirement as quickly as possible.

More clauses yielding more results feels counter-intuituive in a context other than 'should'

@mshameti Thanks for your feedback.

Are you suggesting to always exercise filters for knn query as post filters?
Pre-filtering is very important aspect of knn query that ensures you will get some results, while post-filter can eliminate all results.

'term' and a 'knn' clause then I am actually more likely to see fewer results than if I were to provide a 'term' clause and a 'knn' clause with a 'filter', due to kNN trying to satisfy the 'size' requirement as quickly as possible.

Not sure how is it more likely? It is actually less likely. knn with a filter returns the same or less number of results than the same knn query without filter, but never more.

Thanks for looking!

I've created a Gist to demonstrate this a bit better hopefully. I've listed out my assumptions in there as well.

https://gist.github.com/mshameti/774639631363457dec310deee4f15766

More clauses yielding more results feels counter-intuitive in a context other than 'should'

It doesn't return more results. knn will apply the filter and when combined with other queries, those may be "post filtered" out. The doc count hit may still be k as we found there are at least k documents that match that filter and scored them.

But the scored hits returned must satisfy all query provided filters. I am not sure how it returns MORE results. Reading through the gist, I am still not 100% sure I understand the concern.

Could you clarify what you mean by "More clauses yielding more results feels counter-intuitive in a context other than 'should'"?

The scenario with a must

GET /knn-test/_search { "query": { "bool": { "must": [ { "match": { "author": "Martin" } }, { "knn": { "query_vector": [1, 1, 1, 1], "field": "paragraph_embedding", "num_candidates": 1, "filter": { "match": { "author": "Martin" } } } } ] } }, "size": 1 }

This would result in only documents that satisfy

{ "match": { "author": "Martin" } }

And knn will only score documents where

"filter": { "match": { "author": "Martin" } }

Is true.

So, it isn't finding MORE documents. It is just scoring different documents, which still matches the typical way a query works, its just that other queries have a "natural pre-filter" which is the matching terms :).

Hi @benwtrent! Thanks for the response.

In variant A: I have 2 clauses

match

knn

I get 0 hits.

In variant B: I have 3 clauses.

match

knn

knn.match

I get 1 hit.

I mispoke when I said it flat out returns 'more' results.

What I meant to say it returns 'better' results in my scenario of trying to find 1 document authored by 'Martin' among 4 documents that have the same vector.

In Variant A, I've already made my intention pretty clear that I want author: 'Martin'

GET /knn-test/_search { "query": { "bool": { "must": [ { "match": { "author": "Martin" } }, { "knn": { "query_vector": [1, 1, 1, 1], "field": "paragraph_embedding", "num_candidates": 1 } } ] } }, "size": 1 }

and it returns 0 hits in this case, because we're starting with the approximate kNN search, then applying the match.

But if I were to add the additional sub-match pre-filter that behaves like I originally 'intended' Variant A to behave, then I get the document, and that's the odd part imo.

So my original thought was: is there a possibility to align the API around the 'arguably most-likely' user intent and the pattern of any other queries that include 'must'? That would look like:

a. treat the top level match as a pre-filter
b. apply knn as the post-filter
c. remove the knn.filter pre-filter altogether (because we already provided it in a).

a. treat the top level match as a pre-filter

I understand now.

We considered that, but it gets very difficult to get correct and would have non-obvious edge cases.

What do you do in a multiple nested bool query?

What if a higher level bool has a knn query?

Do we ignore should clauses?

We may in the future consider making the filter optional and dynamically applying the correct or expected pre-filter.

Great questions! For future use, I created a Gist of the situtations you mentioned and a couple others to help me reason through this. https://gist.github.com/mshameti/2b87802869d83df6b06abade142a922e

I most likely have not covered all the bases but if you do decide to revisit this it's here for future use.

Thanks again for engaging in the discussion, and as the proposed API change is already an improvement to the existing API, please consider this conversation resolved from my end!

mayya-sharipova · 2023-10-26T19:17:57Z

@benwtrent @jdconrad This is ready for another round of review with nested support added.

benwtrent

This makes me so happy!

🎉 🎉 🎉 🎉 🎉

A couple of things we should make sure of:

Does this work with function_score dis_max etc.? I am think it should without issue, but we should make 100% sure.
How about percolate?
Does pinned work? (I would think so, since we just rewrite to the scoring docs on the shard...).
We need to add query-dsl-knn-query.asciidoc and put it under "specialized queries".

benwtrent · 2023-10-26T18:59:46Z

...src/yamlRestTest/resources/rest-api-spec/test/search.vectors/130_knn_query_nested_search.yml

@@ -0,0 +1,212 @@
+setup:


Awesome tests ❤️

benwtrent · 2023-10-26T19:29:17Z

...c/src/yamlRestTest/resources/rest-api-spec/test/search.vectors/110_knn_query_with_filter.yml

+  - match: { hits.hits.2._id: "3" }
+  - match: { hits.hits.2.fields.my_name.0: v1 }
+---
+"PRE_FILTER: knn query with alias filter as pre-filter":


IT JUST WORKS! I love this.

benwtrent · 2023-10-26T19:31:45Z

...src/yamlRestTest/resources/rest-api-spec/test/search.vectors/130_knn_query_nested_search.yml

+  # no hits because, regardless of num_candidates knn returns top 3 child vectors from distinct parents
+  # and they don't pass the post-filter
+  # TODO: fix it on Lucene leve so nested knn respects num_candidates
+  - match: {hits.total.value: 0}


This is indeed tricky. It does satisfy num_candidates, but how we are pre-filtering things can act weird with other nested queries. You are correct, it is worth revisiting in the future.

Along with allowing nested pre-filtering (filtering on child field values instead of just parent values).

rest-api-spec/src/yamlRestTest/resources/rest-api-spec/test/search.vectors/40_knn_search.yml

benwtrent · 2023-10-26T19:34:34Z

server/src/main/java/org/elasticsearch/search/vectors/KnnVectorQueryBuilder.java

+    public static final ParseField FIELD_FIELD = new ParseField("field");
+    public static final ParseField NUM_CANDS_FIELD = new ParseField("num_candidates");
+    public static final ParseField QUERY_VECTOR_FIELD = new ParseField("query_vector");
+    public static final ParseField VECTOR_SIMILARITY_FIELD = new ParseField("similarity");


@mayya-sharipova I am ok having it their for now. It is simplest and keeps uniformity with top-level knn.

If we ever introduce a vector_similarity_query, we would deprecate this parameter and point folks to use it instead.

mayya-sharipova · 2023-10-30T18:08:20Z

@benwtrent Thanks for your review so far.

A couple of things we should make sure of:
Does this work with function_score dis_max etc.? I am think it should without issue, but we should make 100% sure.
How about percolate?
Does pinned work? (I would think so, since we just rewrite to the scoring docs on the shard...).
We need to add query-dsl-knn-query.asciidoc and put it under "specialized queries".

With the latest commit, I've do the following:

Added more tests with other queries such as function_score, dis_max and pinned queries.
Modified percolate query NOT to accept knn as an internal query. What do you think? By default, knn and percolate don't work together and this requires more investigation. But more importantly, I think it does not make sense to percolate a document against knn query, as knn query matches any single document.
Added documentation for a new knn query and also put this query under "specialized queries".

benwtrent

Some minor things.

docs/reference/query-dsl/knn-query.asciidoc

benwtrent · 2023-10-30T18:16:20Z

modules/percolator/src/main/java/org/elasticsearch/percolator/PercolatorFieldMapper.java

+                } else if (queryName.equals(KnnVectorQueryBuilder.NAME)) {
+                    throw new IllegalArgumentException("the [knn] query is unsupported inside a percolator query");


is this just because its too difficult to make work?

What do you think? By default, knn and percolate don't work together and this requires more investigation. But more importantly, I think it does not make sense to percolate a document against knn query, as knn query matches any single document.

I think this is OK for now.

I will have to think more about if we should support it or not. IMO, it seems like we should. We are just matching the nearest "k", so it seems to fit OK. But I can also see the argument against (as you laid out).

Additionally, we should have a "more_like_this" query that utilizes knn as well.

Maybe open a Github issue to track discussion?

This as well, but more importantly, I think it does not make sense to percolate a document against knn query, as knn query matches any single document.

as knn query matches any single document.

I understand that semantic similarity has no natural bounds like lexical. But, using this to find semantically similar queries to a stored/new docs is powerful. Especially when you consider hybrid search, and similarity filtering.

@benwtrent Let's discuss this offline (about percolator and more_like_this queries) and create github issues if we find them necessary.

I will consider this PR will go without those queries.

...ugin/src/yamlRestTest/resources/rest-api-spec/test/search-business-rules/10_pinned_query.yml

mayya-sharipova · 2023-10-31T15:48:45Z

@elasticmachine run elasticsearch-ci/docs

mayya-sharipova · 2023-10-31T18:16:17Z

@elasticmachine run elasticsearch-ci/docs

docs/reference/query-dsl/knn-query.asciidoc

Relates to PR elastic#98916 Closes elastic/developer-docs-team#39

Relates to PR #98916 Closes elastic/developer-docs-team#39

Relates to PR elastic#98916 Closes elastic/developer-docs-team#39

Relates to PR #98916 Closes elastic/developer-docs-team#39

edwineve · 2024-05-08T09:30:39Z

This as well, but more importantly, I think it does not make sense to percolate a document against knn query, as knn query matches any single document.

I use knn in combination with a "min_score", which does not match every document

mayya-sharipova · 2024-05-08T10:58:02Z

@edwineve You can submit your question in https://discuss.elastic.co/. This PR is closed.

mayya-sharipova added >feature :Search Relevance/Vectors Vector search v8.11.0 labels Aug 27, 2023

elasticsearchmachine added the Team:Search Meta label for search team label Aug 27, 2023

Update docs/changelog/98916.yaml

5aca3c2

Merge remote-tracking branch 'upstream/main' into knn-as-query

897266a

mayya-sharipova marked this pull request as draft August 27, 2023 21:06

Correct transport version, remove byteQueryVector

8b02234

mayya-sharipova force-pushed the knn-as-query branch from f540d9a to 8b02234 Compare August 29, 2023 13:18

mayya-sharipova added 2 commits August 29, 2023 11:24

Merge remote-tracking branch 'upstream/main' into knn-as-query

88c3233

Other adjustments

49f2b80

mayya-sharipova marked this pull request as ready for review August 29, 2023 16:47

mayya-sharipova requested review from benwtrent and jdconrad August 29, 2023 16:47

benwtrent reviewed Aug 29, 2023

View reviewed changes

mayya-sharipova added 2 commits August 29, 2023 15:58

Add aliasFilter to the SearchExecutionContext instead of QueryBuilder

9b1e7c8

Remove query_vector_builder

cf1c0ae

jdconrad reviewed Aug 31, 2023

View reviewed changes

benwtrent reviewed Aug 31, 2023

View reviewed changes

server/src/main/java/org/elasticsearch/search/vectors/KnnVectorQueryBuilder.java Outdated Show resolved Hide resolved

mayya-sharipova added 4 commits September 1, 2023 13:41

Add query _name to tests

9046812

Add filter alias during doToQuery

8a4cb9b

Merge remote-tracking branch 'upstream/main' into knn-as-query

dccd3c7

Simplify over-wire protocol

d4a5758

mshameti mentioned this pull request Sep 3, 2023

Added the similarity parameter to the KnnQuery type elastic/elasticsearch-specification#2261

Merged

mshameti reviewed Sep 3, 2023

View reviewed changes

benwtrent approved these changes Oct 26, 2023

View reviewed changes

mayya-sharipova added 2 commits October 30, 2023 13:56

Add documentation and other queries

b3ca311

Merge remote-tracking branch 'upstream/main' into knn-as-query

21f40b4

benwtrent approved these changes Oct 30, 2023

View reviewed changes

mayya-sharipova added 3 commits October 31, 2023 10:29

Updates to knn-query documentation

9d825cf

Merge remote-tracking branch 'upstream/main' into knn-as-query

e0ae1dc

Merge remote-tracking branch 'upstream/main' into knn-as-query

348305f

Adjust docs

dc0012c

mayya-sharipova force-pushed the knn-as-query branch from 964d435 to dc0012c Compare October 31, 2023 18:46

Merge remote-tracking branch 'upstream/main' into knn-as-query

6876c5a

abdonpijpelink reviewed Nov 1, 2023

View reviewed changes

docs/reference/query-dsl/knn-query.asciidoc Outdated Show resolved Hide resolved

mayya-sharipova added 2 commits November 1, 2023 09:55

Merge branch 'main' into knn-as-query

c4b80ba

Fix an error in docs

6450681

mayya-sharipova merged commit 61c7483 into elastic:main Nov 1, 2023

mayya-sharipova deleted the knn-as-query branch November 1, 2023 18:21

mayya-sharipova mentioned this pull request Nov 20, 2023

Make knn search as a query #97940

Closed

mayya-sharipova added a commit to mayya-sharipova/elasticsearch that referenced this pull request Jan 18, 2024

Add hybrid search to knn query documentation

0e4aabd

Relates to PR elastic#98916 Closes elastic/developer-docs-team#39

mayya-sharipova mentioned this pull request Jan 18, 2024

Add hybrid search to knn query documentation #104562

Merged

mayya-sharipova added a commit that referenced this pull request Jan 18, 2024

Add hybrid search to knn query documentation (#104562)

669d4ae

Relates to PR #98916 Closes elastic/developer-docs-team#39

mayya-sharipova added a commit to mayya-sharipova/elasticsearch that referenced this pull request Jan 18, 2024

Add hybrid search to knn query documentation (elastic#104562)

add0f9b

Relates to PR elastic#98916 Closes elastic/developer-docs-team#39

elasticsearchmachine pushed a commit that referenced this pull request Jan 18, 2024

Add hybrid search to knn query documentation (#104562) (#104565)

5cf2adb

Relates to PR #98916 Closes elastic/developer-docs-team#39

pmpailis mentioned this pull request Feb 15, 2024

Adding support for hex-encoded byte vectors on knn-search #105393

Merged

israellias mentioned this pull request Apr 7, 2024

Add a knn method to elasticsearch_dsl.search.Search elastic/elasticsearch-dsl-py#1691

Merged

		} else if (queryName.equals(KnnVectorQueryBuilder.NAME)) {
		throw new IllegalArgumentException("the [knn] query is unsupported inside a percolator query");

Make knn search a query #98916

Make knn search a query #98916

Uh oh!

Conversation

mayya-sharipova commented Aug 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Aug 27, 2023

Uh oh!

elasticsearchmachine commented Aug 27, 2023

Uh oh!

benwtrent left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mayya-sharipova Aug 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mayya-sharipova Sep 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

benwtrent commented Aug 31, 2023

Uh oh!

jdconrad left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mayya-sharipova commented Sep 1, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mayya-sharipova Sep 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mayya-sharipova commented Oct 26, 2023

Uh oh!

benwtrent left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mayya-sharipova commented Aug 27, 2023 •

edited

Loading

mayya-sharipova Aug 30, 2023 •

edited

Loading

mayya-sharipova Sep 1, 2023 •

edited

Loading

mayya-sharipova Sep 5, 2023 •

edited

Loading

mayya-sharipova commented Oct 30, 2023 •

edited

Loading