Skip to content

Make knn search a query #98916

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Nov 1, 2023
Merged

Conversation

mayya-sharipova
Copy link
Contributor

@mayya-sharipova mayya-sharipova commented Aug 27, 2023

This adds a new knn query:

  • knn query is executed during the Query phase similar to all other queries.
  • No k parameter, k defaults to size
  • num_candidates - number of closest neighbours that knn query returns on each shard
  • For aggregations: "size" results are collected with total = size * shards. Aggregations will see size * shards results.
  • All filters from DSL are applied as post-filters, except: 1) alias filter is applied as pre-filter or 2) a filter provided as a parameter inside knn query.

Closes #97940

This introduced a new knn query:
- knn query is executed during the Query phase similar to all other queries.
- No k parameter, k defaults to  size
- num_candidates is a size of queue for candidates to consider while
  search a graph on each shard
- For aggregations: "size" results are collected with total = size * shards.
   Aggregations will see size * shards results.
- All filters from DSL are applied as post-filters, except: 1) alias filter
 is applied as  pre-filter or 2) a filter provided as a parameter
 inside knn query.
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@elasticsearchmachine
Copy link
Collaborator

Hi @mayya-sharipova, I've created a changelog YAML for you.

@mayya-sharipova mayya-sharipova marked this pull request as draft August 27, 2023 21:06
@mayya-sharipova mayya-sharipova marked this pull request as ready for review August 29, 2023 16:47
Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a concern about inheriting designs that were required because of how the top-level knn was. I think we can do better here when adding a new query.

public static final ParseField NUM_CANDS_FIELD = new ParseField("num_candidates");
public static final ParseField QUERY_VECTOR_FIELD = new ParseField("query_vector");
public static final ParseField QUERY_VECTOR_BUILDER_FIELD = new ParseField("query_vector_builder");
public static final ParseField VECTOR_SIMILARITY_FIELD = new ParseField("similarity");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% sure we should have similarity here in the knn query object.

The reason it was added to the top level knn object was because we couldn't have a vectorSimilarity query. But now we have opened the flood gates for allowing more than one kind of vector query.

Do we think we should have similarity here or not?

Copy link
Contributor Author

@mayya-sharipova mayya-sharipova Aug 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benwtrent Thanks for the comment.
In my opinion, it seems excessive to introduce a new user facing vectorSimilarity query just to exclude documents that don't pass certain threshold of a knn query.

I think we either should:

  1. Remove similarity from a knn query, and ask users to use min_score search request parameter. But, in this case it will be minimum score (not similarity). This is the way how all other queries can exclude non-relevant docs.
  2. Keep similarity parameter in the query.

WDYT of these options? I am ok with going with option#1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

min_score in the search request applies to the entire search clause. It seems like it misses a key usage where knn is used within a filter clause and executed with aggregations. I don't think we can use the search request min_score as a substitute.

I also see that function_score has a min_score parameter :(. But that thing is so complicated.

So, my counter options would be:

  • keep it in the query
  • we remove it and add a separate one called vectorSimilarity.

I am just trying to think of what we would have done if knn was always a query and we added similarity thresholds later. Would we have made it a separate query or added a parameter?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that min_score doesn't seem like a good candidate since it's global within a search. If we continue down the path of additional queries to simplify the API then it makes sense to me to have a wrapper for either a min_score query in general or a specific threshold query to wrap a knn query.

Copy link
Contributor Author

@mayya-sharipova mayya-sharipova Sep 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with having an extra vector_similarity query that wraps a knn query or semantic_search query (a new query that we will introduce for query_vector_builder):

  "vector_similarity": {
    "knn": {
	    "field": "dense-vector-field",
	    "num_candidates": 100,
	    "query_vector" : [...]
     },
    "similarity" : 10
  }
  "vector_similarity": {
    "semantic_search": {
	    "field": "dense-vector-field",
	    "num_candidates": 100,
	    "model_id": "my-text-embedding-model",
	    "text": "The opposite of blue"
    },
   "similarity" : 10
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@giladgal What would be your opinion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is an obvious correct choice between these two options. Having a parent query seems more organized and then that same query could be used with different types of vector queries, as long as you know which queries it can be used with.

Personally I prefer a parameter than a parent query. I think having all vectors being candidates for any query vector is not how we humans think of similarity and relevance. When the engine returns irrelevant results just because they are the most similar results, it is probably not what the end user wanted. In other words I see the similarity threshold as a natural part of a vector search and I would like for it to be documented so that anyone that reads about vector search immediately becomes aware of that important option. I also like that a parameter is simpler to add and I find it easier to read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @giladgal and others for your comments, it seems the majority of us agree to keep similarity as a parameter.

@benwtrent
Copy link
Member

@mayya-sharipova could you add a test that ensures the knn query supports _name? It would be good to know which knn vectors contributed to a score or not. The top level knn currently doesn't support this.

Copy link
Contributor

@jdconrad jdconrad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great change! I added a couple of thoughts inline.

public static final ParseField NUM_CANDS_FIELD = new ParseField("num_candidates");
public static final ParseField QUERY_VECTOR_FIELD = new ParseField("query_vector");
public static final ParseField QUERY_VECTOR_BUILDER_FIELD = new ParseField("query_vector_builder");
public static final ParseField VECTOR_SIMILARITY_FIELD = new ParseField("similarity");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that min_score doesn't seem like a good candidate since it's global within a search. If we continue down the path of additional queries to simplify the API then it makes sense to me to have a wrapper for either a min_score query in general or a specific threshold query to wrap a knn query.

Comment on lines +233 to +241
// Set alias filter, so it can be applied for queries that need it (e.g. knn query)
public void setAliasFilter(QueryBuilder aliasFilter) {
this.aliasFilter = aliasFilter;
}

public QueryBuilder getAliasFilter() {
return aliasFilter;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if SearchContext should be made available as part of rewrite instead of SearchExecutionContext since alias filter is available as part of that? I do concede this may be too large a change for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback, but looks like indeed this would be too large a change

@mayya-sharipova
Copy link
Contributor Author

@mayya-sharipova could you add a test that ensures the knn query supports _name? It would be good to know which knn vectors contributed to a score or not. The top level knn currently doesn't support this.

@benwtrent Great feedback, indeed interesting and necessary test cases, addressed in 9046812

field: my_vector
query_vector: [ 1, 1, 1, 1 ]
num_candidates: 5
filter:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speaking purely from the viewpoint of a consumer of this API without deep knowledge of the underlying implementation of kNN search in ES:

In the spirit of simplifying the API, is there an opportunity to a) remove the 'filter' field from the knn query and b) have the kNN query be the post-filter instead?

Assuming the current spec changes, if I were to provide a 'term' and a 'knn' clause then I am actually more likely to see fewer results than if I were to provide a 'term' clause and a 'knn' clause with a 'filter', due to kNN trying to satisfy the 'size' requirement as quickly as possible.

More clauses yielding more results feels counter-intuituive in a context other than 'should'

Copy link
Contributor Author

@mayya-sharipova mayya-sharipova Sep 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mshameti Thanks for your feedback.

Are you suggesting to always exercise filters for knn query as post filters?
Pre-filtering is very important aspect of knn query that ensures you will get some results, while post-filter can eliminate all results.

'term' and a 'knn' clause then I am actually more likely to see fewer results than if I were to provide a 'term' clause and a 'knn' clause with a 'filter', due to kNN trying to satisfy the 'size' requirement as quickly as possible.

Not sure how is it more likely? It is actually less likely. knn with a filter returns the same or less number of results than the same knn query without filter, but never more.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking!

I've created a Gist to demonstrate this a bit better hopefully. I've listed out my assumptions in there as well.

https://gist.github.com/mshameti/774639631363457dec310deee4f15766

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More clauses yielding more results feels counter-intuitive in a context other than 'should'

It doesn't return more results. knn will apply the filter and when combined with other queries, those may be "post filtered" out. The doc count hit may still be k as we found there are at least k documents that match that filter and scored them.

But the scored hits returned must satisfy all query provided filters. I am not sure how it returns MORE results. Reading through the gist, I am still not 100% sure I understand the concern.

Could you clarify what you mean by "More clauses yielding more results feels counter-intuitive in a context other than 'should'"?

The scenario with a must

GET /knn-test/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "author": "Martin"
          }
        },
        {
          "knn": {
            "query_vector": [1, 1, 1, 1],
            "field": "paragraph_embedding",
            "num_candidates": 1,
            "filter": {
              "match": {
                "author": "Martin"
              }
            }
          }
        }
      ]
    }
  },
  "size": 1
}

This would result in only documents that satisfy

{
          "match": {
            "author": "Martin"
          }
        }

And knn will only score documents where

"filter": {
              "match": {
                "author": "Martin"
              }
            }

Is true.

So, it isn't finding MORE documents. It is just scoring different documents, which still matches the typical way a query works, its just that other queries have a "natural pre-filter" which is the matching terms :).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @benwtrent! Thanks for the response.

In variant A: I have 2 clauses

  • match
  • knn

I get 0 hits.

In variant B: I have 3 clauses.

  • match
  • knn
  • knn.match

I get 1 hit.

I mispoke when I said it flat out returns 'more' results.

What I meant to say it returns 'better' results in my scenario of trying to find 1 document authored by 'Martin' among 4 documents that have the same vector.

In Variant A, I've already made my intention pretty clear that I want author: 'Martin'

GET /knn-test/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "author": "Martin"
          }
        },
        {
          "knn": {
            "query_vector": [1, 1, 1, 1],
            "field": "paragraph_embedding",
            "num_candidates": 1
          }
        }
      ]
    }
  },
  "size": 1
}

and it returns 0 hits in this case, because we're starting with the approximate kNN search, then applying the match.

But if I were to add the additional sub-match pre-filter that behaves like I originally 'intended' Variant A to behave, then I get the document, and that's the odd part imo.

So my original thought was: is there a possibility to align the API around the 'arguably most-likely' user intent and the pattern of any other queries that include 'must'? That would look like:

a. treat the top level match as a pre-filter
b. apply knn as the post-filter
c. remove the knn.filter pre-filter altogether (because we already provided it in a).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a. treat the top level match as a pre-filter

I understand now.

We considered that, but it gets very difficult to get correct and would have non-obvious edge cases.

What do you do in a multiple nested bool query?

What if a higher level bool has a knn query?

Do we ignore should clauses?

We may in the future consider making the filter optional and dynamically applying the correct or expected pre-filter.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great questions! For future use, I created a Gist of the situtations you mentioned and a couple others to help me reason through this. https://gist.github.com/mshameti/2b87802869d83df6b06abade142a922e

I most likely have not covered all the bases but if you do decide to revisit this it's here for future use.

Thanks again for engaging in the discussion, and as the proposed API change is already an improvement to the existing API, please consider this conversation resolved from my end!

@mayya-sharipova
Copy link
Contributor Author

@benwtrent @jdconrad This is ready for another round of review with nested support added.

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me so happy!

🎉 🎉 🎉 🎉 🎉

A couple of things we should make sure of:

  • Does this work with function_score dis_max etc.? I am think it should without issue, but we should make 100% sure.
  • How about percolate?
  • Does pinned work? (I would think so, since we just rewrite to the scoring docs on the shard...).
  • We need to add query-dsl-knn-query.asciidoc and put it under "specialized queries".

@@ -0,0 +1,212 @@
setup:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome tests ❤️

- match: { hits.hits.2._id: "3" }
- match: { hits.hits.2.fields.my_name.0: v1 }
---
"PRE_FILTER: knn query with alias filter as pre-filter":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IT JUST WORKS! I love this.

Comment on lines 209 to 212
# no hits because, regardless of num_candidates knn returns top 3 child vectors from distinct parents
# and they don't pass the post-filter
# TODO: fix it on Lucene leve so nested knn respects num_candidates
- match: {hits.total.value: 0}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed tricky. It does satisfy num_candidates, but how we are pre-filtering things can act weird with other nested queries. You are correct, it is worth revisiting in the future.

Along with allowing nested pre-filtering (filtering on child field values instead of just parent values).

public static final ParseField FIELD_FIELD = new ParseField("field");
public static final ParseField NUM_CANDS_FIELD = new ParseField("num_candidates");
public static final ParseField QUERY_VECTOR_FIELD = new ParseField("query_vector");
public static final ParseField VECTOR_SIMILARITY_FIELD = new ParseField("similarity");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mayya-sharipova I am ok having it their for now. It is simplest and keeps uniformity with top-level knn.

If we ever introduce a vector_similarity_query, we would deprecate this parameter and point folks to use it instead.

@mayya-sharipova
Copy link
Contributor Author

mayya-sharipova commented Oct 30, 2023

@benwtrent Thanks for your review so far.

A couple of things we should make sure of:
Does this work with function_score dis_max etc.? I am think it should without issue, but we should make 100% sure.
How about percolate?
Does pinned work? (I would think so, since we just rewrite to the scoring docs on the shard...).
We need to add query-dsl-knn-query.asciidoc and put it under "specialized queries".

With the latest commit, I've do the following:

  • Added more tests with other queries such as function_score, dis_max and pinned queries.
  • Modified percolate query NOT to accept knn as an internal query. What do you think? By default, knn and percolate don't work together and this requires more investigation. But more importantly, I think it does not make sense to percolate a document against knn query, as knn query matches any single document.
  • Added documentation for a new knn query and also put this query under "specialized queries".

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor things.

Comment on lines +442 to +443
} else if (queryName.equals(KnnVectorQueryBuilder.NAME)) {
throw new IllegalArgumentException("the [knn] query is unsupported inside a percolator query");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this just because its too difficult to make work?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think? By default, knn and percolate don't work together and this requires more investigation. But more importantly, I think it does not make sense to percolate a document against knn query, as knn query matches any single document.

I think this is OK for now.

I will have to think more about if we should support it or not. IMO, it seems like we should. We are just matching the nearest "k", so it seems to fit OK. But I can also see the argument against (as you laid out).

Additionally, we should have a "more_like_this" query that utilizes knn as well.

Maybe open a Github issue to track discussion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This as well, but more importantly, I think it does not make sense to percolate a document against knn query, as knn query matches any single document.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as knn query matches any single document.

I understand that semantic similarity has no natural bounds like lexical. But, using this to find semantically similar queries to a stored/new docs is powerful. Especially when you consider hybrid search, and similarity filtering.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benwtrent Let's discuss this offline (about percolator and more_like_this queries) and create github issues if we find them necessary.

I will consider this PR will go without those queries.

@mayya-sharipova
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/docs

1 similar comment
@mayya-sharipova
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/docs

@mayya-sharipova mayya-sharipova merged commit 61c7483 into elastic:main Nov 1, 2023
@mayya-sharipova mayya-sharipova deleted the knn-as-query branch November 1, 2023 18:21
mayya-sharipova added a commit to mayya-sharipova/elasticsearch that referenced this pull request Jan 18, 2024
mayya-sharipova added a commit to mayya-sharipova/elasticsearch that referenced this pull request Jan 18, 2024
@edwineve
Copy link

edwineve commented May 8, 2024

This as well, but more importantly, I think it does not make sense to percolate a document against knn query, as knn query matches any single document.

I use knn in combination with a "min_score", which does not match every document

@mayya-sharipova
Copy link
Contributor Author

@edwineve You can submit your question in https://discuss.elastic.co/. This PR is closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make knn search as a query
9 participants