-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Make knn search a query #98916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make knn search a query #98916
Conversation
This introduced a new knn query: - knn query is executed during the Query phase similar to all other queries. - No k parameter, k defaults to size - num_candidates is a size of queue for candidates to consider while search a graph on each shard - For aggregations: "size" results are collected with total = size * shards. Aggregations will see size * shards results. - All filters from DSL are applied as post-filters, except: 1) alias filter is applied as pre-filter or 2) a filter provided as a parameter inside knn query.
Pinging @elastic/es-search (Team:Search) |
Hi @mayya-sharipova, I've created a changelog YAML for you. |
f540d9a
to
8b02234
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a concern about inheriting designs that were required because of how the top-level knn
was. I think we can do better here when adding a new query.
server/src/main/java/org/elasticsearch/search/SearchService.java
Outdated
Show resolved
Hide resolved
public static final ParseField NUM_CANDS_FIELD = new ParseField("num_candidates"); | ||
public static final ParseField QUERY_VECTOR_FIELD = new ParseField("query_vector"); | ||
public static final ParseField QUERY_VECTOR_BUILDER_FIELD = new ParseField("query_vector_builder"); | ||
public static final ParseField VECTOR_SIMILARITY_FIELD = new ParseField("similarity"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not 100% sure we should have similarity
here in the knn
query object.
The reason it was added to the top level knn
object was because we couldn't have a vectorSimilarity
query. But now we have opened the flood gates for allowing more than one kind of vector query.
Do we think we should have similarity
here or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benwtrent Thanks for the comment.
In my opinion, it seems excessive to introduce a new user facing vectorSimilarity
query just to exclude documents that don't pass certain threshold of a knn query.
I think we either should:
- Remove
similarity
from a knn query, and ask users to usemin_score
search request parameter. But, in this case it will be minimum score (not similarity). This is the way how all other queries can exclude non-relevant docs. - Keep
similarity
parameter in the query.
WDYT of these options? I am ok with going with option#1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
min_score
in the search request applies to the entire search clause. It seems like it misses a key usage where knn
is used within a filter
clause and executed with aggregations. I don't think we can use the search request min_score
as a substitute.
I also see that function_score
has a min_score
parameter :(. But that thing is so complicated.
So, my counter options would be:
- keep it in the query
- we remove it and add a separate one called
vectorSimilarity
.
I am just trying to think of what we would have done if knn
was always a query and we added similarity thresholds later. Would we have made it a separate query or added a parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that min_score
doesn't seem like a good candidate since it's global within a search. If we continue down the path of additional queries to simplify the API then it makes sense to me to have a wrapper for either a min_score query in general or a specific threshold query to wrap a knn query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am ok with having an extra vector_similarity
query that wraps a knn query or semantic_search
query (a new query that we will introduce for query_vector_builder):
"vector_similarity": {
"knn": {
"field": "dense-vector-field",
"num_candidates": 100,
"query_vector" : [...]
},
"similarity" : 10
}
"vector_similarity": {
"semantic_search": {
"field": "dense-vector-field",
"num_candidates": 100,
"model_id": "my-text-embedding-model",
"text": "The opposite of blue"
},
"similarity" : 10
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@giladgal What would be your opinion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is an obvious correct choice between these two options. Having a parent query seems more organized and then that same query could be used with different types of vector queries, as long as you know which queries it can be used with.
Personally I prefer a parameter than a parent query. I think having all vectors being candidates for any query vector is not how we humans think of similarity and relevance. When the engine returns irrelevant results just because they are the most similar results, it is probably not what the end user wanted. In other words I see the similarity threshold as a natural part of a vector search and I would like for it to be documented so that anyone that reads about vector search immediately becomes aware of that important option. I also like that a parameter is simpler to add and I find it easier to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @giladgal and others for your comments, it seems the majority of us agree to keep similarity
as a parameter.
server/src/main/java/org/elasticsearch/search/vectors/KnnVectorQueryBuilder.java
Outdated
Show resolved
Hide resolved
@mayya-sharipova could you add a test that ensures the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great change! I added a couple of thoughts inline.
public static final ParseField NUM_CANDS_FIELD = new ParseField("num_candidates"); | ||
public static final ParseField QUERY_VECTOR_FIELD = new ParseField("query_vector"); | ||
public static final ParseField QUERY_VECTOR_BUILDER_FIELD = new ParseField("query_vector_builder"); | ||
public static final ParseField VECTOR_SIMILARITY_FIELD = new ParseField("similarity"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that min_score
doesn't seem like a good candidate since it's global within a search. If we continue down the path of additional queries to simplify the API then it makes sense to me to have a wrapper for either a min_score query in general or a specific threshold query to wrap a knn query.
// Set alias filter, so it can be applied for queries that need it (e.g. knn query) | ||
public void setAliasFilter(QueryBuilder aliasFilter) { | ||
this.aliasFilter = aliasFilter; | ||
} | ||
|
||
public QueryBuilder getAliasFilter() { | ||
return aliasFilter; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if SearchContext
should be made available as part of rewrite
instead of SearchExecutionContext
since alias filter is available as part of that? I do concede this may be too large a change for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback, but looks like indeed this would be too large a change
server/src/main/java/org/elasticsearch/search/vectors/KnnVectorQueryBuilder.java
Outdated
Show resolved
Hide resolved
@benwtrent Great feedback, indeed interesting and necessary test cases, addressed in 9046812 |
field: my_vector | ||
query_vector: [ 1, 1, 1, 1 ] | ||
num_candidates: 5 | ||
filter: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Speaking purely from the viewpoint of a consumer of this API without deep knowledge of the underlying implementation of kNN search in ES:
In the spirit of simplifying the API, is there an opportunity to a) remove the 'filter' field from the knn query and b) have the kNN query be the post-filter instead?
Assuming the current spec changes, if I were to provide a 'term' and a 'knn' clause then I am actually more likely to see fewer results than if I were to provide a 'term' clause and a 'knn' clause with a 'filter', due to kNN trying to satisfy the 'size' requirement as quickly as possible.
More clauses yielding more results feels counter-intuituive in a context other than 'should'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mshameti Thanks for your feedback.
Are you suggesting to always exercise filters for knn query as post filters?
Pre-filtering is very important aspect of knn query that ensures you will get some results, while post-filter can eliminate all results.
'term' and a 'knn' clause then I am actually more likely to see fewer results than if I were to provide a 'term' clause and a 'knn' clause with a 'filter', due to kNN trying to satisfy the 'size' requirement as quickly as possible.
Not sure how is it more likely? It is actually less likely. knn
with a filter
returns the same or less number of results than the same knn query without filter, but never more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking!
I've created a Gist to demonstrate this a bit better hopefully. I've listed out my assumptions in there as well.
https://gist.github.com/mshameti/774639631363457dec310deee4f15766
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More clauses yielding more results feels counter-intuitive in a context other than 'should'
It doesn't return more results. knn
will apply the filter and when combined with other queries, those may be "post filtered" out. The doc count hit may still be k
as we found there are at least k
documents that match that filter and scored them.
But the scored hits returned must satisfy all query provided filters. I am not sure how it returns MORE results. Reading through the gist, I am still not 100% sure I understand the concern.
Could you clarify what you mean by "More clauses yielding more results feels counter-intuitive in a context other than 'should'"?
The scenario with a must
GET /knn-test/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"author": "Martin"
}
},
{
"knn": {
"query_vector": [1, 1, 1, 1],
"field": "paragraph_embedding",
"num_candidates": 1,
"filter": {
"match": {
"author": "Martin"
}
}
}
}
]
}
},
"size": 1
}
This would result in only documents that satisfy
{
"match": {
"author": "Martin"
}
}
And knn
will only score documents where
"filter": {
"match": {
"author": "Martin"
}
}
Is true.
So, it isn't finding MORE documents. It is just scoring different documents, which still matches the typical way a query works, its just that other queries have a "natural pre-filter" which is the matching terms :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @benwtrent! Thanks for the response.
In variant A: I have 2 clauses
- match
- knn
I get 0 hits.
In variant B: I have 3 clauses.
- match
- knn
- knn.match
I get 1 hit.
I mispoke when I said it flat out returns 'more' results.
What I meant to say it returns 'better' results in my scenario of trying to find 1 document authored by 'Martin' among 4 documents that have the same vector.
In Variant A, I've already made my intention pretty clear that I want author: 'Martin'
GET /knn-test/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"author": "Martin"
}
},
{
"knn": {
"query_vector": [1, 1, 1, 1],
"field": "paragraph_embedding",
"num_candidates": 1
}
}
]
}
},
"size": 1
}
and it returns 0 hits in this case, because we're starting with the approximate kNN search, then applying the match.
But if I were to add the additional sub-match pre-filter that behaves like I originally 'intended' Variant A to behave, then I get the document, and that's the odd part imo.
So my original thought was: is there a possibility to align the API around the 'arguably most-likely' user intent and the pattern of any other queries that include 'must'? That would look like:
a. treat the top level match as a pre-filter
b. apply knn as the post-filter
c. remove the knn.filter pre-filter altogether (because we already provided it in a).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a. treat the top level match as a pre-filter
I understand now.
We considered that, but it gets very difficult to get correct and would have non-obvious edge cases.
What do you do in a multiple nested bool
query?
What if a higher level bool
has a knn
query?
Do we ignore should
clauses?
We may in the future consider making the filter optional and dynamically applying the correct or expected pre-filter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great questions! For future use, I created a Gist of the situtations you mentioned and a couple others to help me reason through this. https://gist.github.com/mshameti/2b87802869d83df6b06abade142a922e
I most likely have not covered all the bases but if you do decide to revisit this it's here for future use.
Thanks again for engaging in the discussion, and as the proposed API change is already an improvement to the existing API, please consider this conversation resolved from my end!
@benwtrent @jdconrad This is ready for another round of review with nested support added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes me so happy!
🎉 🎉 🎉 🎉 🎉
A couple of things we should make sure of:
- Does this work with
function_score
dis_max
etc.? I am think it should without issue, but we should make 100% sure. - How about
percolate
? - Does
pinned
work? (I would think so, since we just rewrite to the scoring docs on the shard...). - We need to add
query-dsl-knn-query.asciidoc
and put it under "specialized queries".
@@ -0,0 +1,212 @@ | |||
setup: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome tests ❤️
- match: { hits.hits.2._id: "3" } | ||
- match: { hits.hits.2.fields.my_name.0: v1 } | ||
--- | ||
"PRE_FILTER: knn query with alias filter as pre-filter": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IT JUST WORKS! I love this.
# no hits because, regardless of num_candidates knn returns top 3 child vectors from distinct parents | ||
# and they don't pass the post-filter | ||
# TODO: fix it on Lucene leve so nested knn respects num_candidates | ||
- match: {hits.total.value: 0} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is indeed tricky. It does satisfy num_candidates
, but how we are pre-filtering things can act weird with other nested queries. You are correct, it is worth revisiting in the future.
Along with allowing nested pre-filtering (filtering on child field values instead of just parent values).
rest-api-spec/src/yamlRestTest/resources/rest-api-spec/test/search.vectors/40_knn_search.yml
Show resolved
Hide resolved
public static final ParseField FIELD_FIELD = new ParseField("field"); | ||
public static final ParseField NUM_CANDS_FIELD = new ParseField("num_candidates"); | ||
public static final ParseField QUERY_VECTOR_FIELD = new ParseField("query_vector"); | ||
public static final ParseField VECTOR_SIMILARITY_FIELD = new ParseField("similarity"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mayya-sharipova I am ok having it their for now. It is simplest and keeps uniformity with top-level knn.
If we ever introduce a vector_similarity_query
, we would deprecate this parameter and point folks to use it instead.
@benwtrent Thanks for your review so far.
With the latest commit, I've do the following:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor things.
} else if (queryName.equals(KnnVectorQueryBuilder.NAME)) { | ||
throw new IllegalArgumentException("the [knn] query is unsupported inside a percolator query"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this just because its too difficult to make work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think? By default, knn and percolate don't work together and this requires more investigation. But more importantly, I think it does not make sense to percolate a document against knn query, as knn query matches any single document.
I think this is OK for now.
I will have to think more about if we should support it or not. IMO, it seems like we should. We are just matching the nearest "k", so it seems to fit OK. But I can also see the argument against (as you laid out).
Additionally, we should have a "more_like_this" query that utilizes knn as well.
Maybe open a Github issue to track discussion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This as well, but more importantly, I think it does not make sense to percolate a document against knn query, as knn query matches any single document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as knn query matches any single document.
I understand that semantic similarity has no natural bounds like lexical. But, using this to find semantically similar queries to a stored/new docs is powerful. Especially when you consider hybrid search, and similarity filtering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benwtrent Let's discuss this offline (about percolator and more_like_this queries) and create github issues if we find them necessary.
I will consider this PR will go without those queries.
...ugin/src/yamlRestTest/resources/rest-api-spec/test/search-business-rules/10_pinned_query.yml
Show resolved
Hide resolved
@elasticmachine run elasticsearch-ci/docs |
1 similar comment
@elasticmachine run elasticsearch-ci/docs |
964d435
to
dc0012c
Compare
Relates to PR elastic#98916 Closes elastic/developer-docs-team#39
I use knn in combination with a "min_score", which does not match every document |
@edwineve You can submit your question in https://discuss.elastic.co/. This PR is closed. |
This adds a new knn query:
Closes #97940