-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Make k and num_candidates optional in the knn section #97533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Pinging @elastic/es-search (Team:Search) |
Another way to streamline this could involve substituting |
We discussed the results & the conclusions with @mayya-sharipova and agreed to verify a few more things prior to reaching a final conclusion on the default value for
|
Could you put these latencies and recalls in a Pareto graph (probably each k deserves its own graph per data set)? This way we can easily see the optimal point where recall starts to level off as latency increases. |
Now that we have also introduced knn as a query, I was wondering what we want the target behavior to be for KNN searches as far as param validation goes - as there are different scenarios and combinations on how parameters may be set. As discussed the defaults could be set as:
Now on to some examples (many are derived from some really nice tests that @mayya-sharipova added): Case 1
Case 2
In this scenario what should we expect the result to be? Should we assume the default size of 10 and act similarly to Case#1, or if Case 3 For knn-queries we do not specify
Case 4 Similarly to Case#2
Case 5
A I would think that for knn-search, validation is much more straightforward and intuitive. For knn-query, while it would be nice to have similar restrictions and validations to knn-search, that could also very easily end up being a bit confusing, the more complex the query gets. Maybe we could "auto-correct" the param with a warning instead of throwing? :thiking-face WDYT? |
preamble: kNN requires running in the DFS phase, where we at most return This is very different from the
I think only returning For consistent behavior, we should enforce
No, we don't want to throw. Note, there is no
Again, we don't want to throw, this is fine. We gather
Again, this is fine.
To me this is the opposite of nice. Knn as a query is meant to be more expert and less restrictive. |
I see your point. By "nice" I mean that it would be nice to be consistent. E.g. one might assume that the following 2 would yield the same results, but it might not be the case:
So, if someone is met with an exception in the first case, instead of adjusting the exploration params, they could just as easily "nest" the search under |
The query shouldn't throw. The top-level
|
Fair enough - given that the existing behavior was to throw I was under the impression that we want to keep this as an API design. So if we omit |
I would like @mayya-sharipova's input as well. But that is my current thought process.
This would only be for the knn query. The top level kNN clause should return consistent number of values no matter the number of shards. |
@pmpailis @benwtrent Thanks for the discussion so far. @pmpailis If If
I am ok with options 1 or 2. But I since we already throw an exception for the top level @pmpailis @benwtrent What do you think? |
I agree with @benwtrent 's above statement that the
Intuitively |
I don't think we should "auto-correct" as the user is specifically requesting a lower num_candidates without supplying a I think the only valid options are:
To move this forward, I am fine with throwing an error for now. Its always easier to be more lenient in the future if users consider this to be frustrating :) |
Thanks @mayya-sharipova, @jpountz & @benwtrent ! Personally, in the broader view of things I agree with @mayya-sharipova 's conclusion that we should either throw or autocorrect - not just let the query proceed as this could lead to weird & potentially inconsistent behavior for the user as in the example here. I would also consider autocorrect but as @benwtrent 's mentioned - it'll always be easier to be more lenient in the future than to start throwing exceptions for previously working functionality (if for example we discover later that for some reason this configuration isn't working as expected), so I'd opt for throwing in all cases (both knn search & query). Would potentially autocorrecting So, if everyone's ok with this, I'd suggest proceeding with throwing an exception for now for the following cases (making sure that the exception message is clear and easy to understand): knn-search
knn-query
|
No, knn search only. I stand firm we shouldn't throw on the query. |
Ok, fair enough. If we don't autocorrect either though and just proceed with what the user specified, then as mentioned the results will vary depending on the number of shards. Which in turn could be a bit confusing as the same query might return different results on different setups (e.g. |
The query already has this problem IMO. The number of hits and recall is effected directly by the number of shards (we gather num_candidates per shard and collect size per shard). Its considered "expert" for a reason. |
TBH I believe that the adaptation will be very wide - despite it being labeled as "expert use". And if people have already pipelines for generating |
@pmpailis It looks like a way forward for the top level knn search that we all agree on. Let's proceed with it, and leave |
Ok - will setup a PR covering the knn-search part and leave validation for knn-query as-is (except the already existing max value limit). |
Run some more experiments using rally with the following 5 tracks/datasets:
Furthermore for the For each of the above datasets we explored Attaching a pdf with the full results & will post recall/latency graphs for each datase as a separate comment for better readability - following with an interpretation of the results & the suggestion to move forward. |
Thanks for sharing @pmpailis . One aspect to keep in mind is that the number of candidates should also be a factor of the total number of indexed vectors as well as the number of segments. The previous shared results indicate that some of the tested queries need to visit a fairly significant percentage of the total documents. So rather than the latency I think it would be easier to look at the percentage of documents visited. |
Thanks for taking a look @jimczi ! Tbh we also discussed briefly with @benwtrent an alternative to make I've also updated the comment above, as we also run a couple of tests (with the |
++, there's also apache/lucene#12794 that should help to reason about |
Took some time as I had to work around some issues with OOM exceptions on local runs after upgrading to Lucene 9.9.0 (and due to the holiday period!), but I'm finally attaching the results for the runs described in the comment above. The recall/latency graphs above are also slightly different for the new runs, so I have updated them in-place. Based on the results, and as discussed with @benwtrent and @mayya-sharipova there are 2 main suggestions on how to proceed with setting the default value for
IMHO both options provide good enough defaults, with some tradeoffs on potential applications/latencies. Looking forward to hearing your thoughts! Note: the results posted in a previous comment for multi-segment |
@pmpailis thanks a bunch for digging into all this. Given the results, I still think |
Don't have a strong preference on this, so I'll gladly move on with the suggested |
Description
k
andnum_candidates
are required parameters of the knn section. Could they have sensible defaults instead? E.g. could we defaultk
tosize
andnum_candidates
to a value that would yield a sensible latency/recall trade-off depending onk
?The text was updated successfully, but these errors were encountered: