-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Speedup concurrent multi-segment HNWS graph search #12794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speedup concurrent multi-segment HNWS graph search #12794
Conversation
@mayya-sharipova with those experiments, I am guessing these are over multiple segments, could you include that information in the table? It would also be awesome to see what the QPS & recall compares to a fully merged graph. One more thing, does recall increase as expected as "num_candidates/fanout" increases with the candidate patch? |
@mayya-sharipova two important measurements we need to check here:
|
533d779
to
f52b5e4
Compare
Experiments
Search:
1M vectors of 100 dimsk=10
k=100
10M vectors of 100 dimsk=10
k=100
10M vectors of 768 dimsk=10
k=100
Summary:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very exciting change!
I'm new to this part of Lucene, and a lot of my comments are to improve my own understanding of the system. Overall, I think this is a promising optimization.
@@ -26,26 +26,71 @@ | |||
* @lucene.experimental | |||
*/ | |||
public final class TopKnnCollector extends AbstractKnnCollector { | |||
private static final float DEFAULT_GREEDINESS = 0.9f; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So greediness
is essentially how "greedy" our algorithm for picking top matches gets to be.
At 1, we go with the global min competitive similarity. And by default, we continue to search the segment as long as we find neighbors that are better than the top 10% of our local picks from the segment (1-0.9)*k)
.
Should we add some documentation around this param? My initial, naive impression was that higher greediness might mean we pick more nodes per segment, which is quite the opposite. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vigyasharma Thanks for your feedback. Indeed, more documentation is needed, I will add it after we finalize the experiments.
A general idea with the introduction of a second shorter local queue is that different searches from different graphs can progress differently. We don't want to stop searching a graph if we are just starting and still in a bad neighbourhood where similarity can be worse that the globally collected results. We still want to make some progress.
As you correctly noticed, greediness
is meant to show how greedy is our local segment based search if we are not competitive globally. A good approach could be to be greedy, and don't do much exploration in this case, keeping size of the second queue small.
TaskExecutor taskExecutor = indexSearcher.getTaskExecutor(); | ||
List<LeafReaderContext> leafReaderContexts = reader.leaves(); | ||
List<Callable<TopDocs>> tasks = new ArrayList<>(leafReaderContexts.size()); | ||
for (LeafReaderContext context : leafReaderContexts) { | ||
tasks.add(() -> searchLeaf(context, filterWeight)); | ||
tasks.add(() -> searchLeaf(context, filterWeight, globalMinSimAcc)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the MaxScoreAccumulator
thread safe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't look like it is shared across tasks in the IndexSearcher, right? For each slice, we have a single collector. And each collector gets its own MaxScoreAccumulator
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MaxScoreAccumulator
is thread safe, it uses LongAccumulator
that is enough for our purposes to collect max global similarity.
And the same instance of MaxScoreAccumulator
is shared among all slices.
This approach is similar to TopScoreDocCollector
.
this.greediness = DEFAULT_GREEDINESS; | ||
this.queue = new NeighborQueue(k, false); | ||
int queuegSize = Math.max(1, Math.round((1 - greediness) * k)); | ||
this.queueg = new NeighborQueue(queuegSize, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This queue is still local to each leaf, and hence to each task in the executor, right? I ask because I see some references to a global queue for greediness, in the issue description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, it is local.
// periodically update the local copy of global similarity | ||
if (reachedKResults || (visitedCount & globalMinSimAcc.modInterval) == 0) { | ||
MaxScoreAccumulator.DocAndScore docAndScore = globalMinSimAcc.get(); | ||
cachedGlobalMinSim = docAndScore.score; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only peek at the global minimum similarity after we have accumulated k
local results. For my understanding, what would happen if we were to start using globalMinSim
as soon as any leaf has collected its top K results?
I'm curious how the graph structure and entry point calculation affects this. Looking at graphSearcher.findBestEntryPoint(...)
, i think we navigate to the closest node we can find to our query on layer-1 and use that as the starting point for layer-0. Now layer-0 is a lot more dense than layer-1, and our best candidates are likely in the space between the entry-point on layer-1 and all its neighbors.
Is it always a good idea to at least look at k nodes around the entry point, regardless of the global min. competitive score?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your idea, it looks like it should provide better speedups than the current approach.
We can check globalMinSim before starting search, and see if it makes difference
boolean result = queue.insertWithOverflow(docId, similarity); | ||
queueg.insertWithOverflow(docId, similarity); | ||
|
||
boolean reachedKResults = (kResultsCollected == false && queue.size() == k()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So reachedKResults
is supposed to be true
only for the first time we get to k results, while kResultsCollected
continues to be true even as we collect subsequent hits?
Minor: Should we try to make it more explicit in the variable name? Maybe something like firstKResultsCollected
?
We seem to consistently see an improvement in recall between single segment, and multi-segment runs (both seq and conc.) on baseline. Is this because with multiple segments, we get multiple entry points into the overall graph? Whereas in a single merged segment, we only have access to a sparser set of nodes in layer-1 while finding the single best entry point? |
What is a good mental model on what kind of graphs would see minimal loss of recall between baseline and candidate? Is this change better with denser (higher fanout) graphs? Would it depend on graph construction params like |
@vigyasharma Answering other questions:
Indeed, this is the correct observation. For multiple segments, we retrieve k results from each segment, and then merge k* num_of_segments results to get the best |
c8af48a
to
0adfaac
Compare
Add tracking of global minimum similarity collected so far across segments. As soon as k results collected locally, us global similarity as minimum required similarity for knn search.
0adfaac
to
98a921f
Compare
98a921f
to
f78f60e
Compare
A second implementation of apache#12794 using Queue instead of MaxScoreAccumulator. Speedup concurrent multi-segment HNWS graph search by exchanging the global top scores collected so far across segments. These global top scores set the minimum threshold that candidates need to pass to be considered. This allows earlier stopping for segments that don't have good candidates.
Closed in favour of #12962 |
Speedup concurrent multi-segment HNWS graph search by exchanging
the global minimum similarity collected so far across segments. As the global
similarity is used as a minimum threshold that candidates need to pass to be considered,
this allows earlier stopping for segments that don't have good candidates.
Implementation details:
TopKnnCollector
has a new argumentMaxScoreAccumulator
used to keep track of the global minimum similarity collected so far.TopKnnCollector
for a segment collected the local k results, the global similarity is used as minimum required similarity for knn search.Having a global queue is a crude approach and the 1st step for concurrent search.
We don’t want to stop ourselves reaching a portion of a graph which is a good neighbourhood
of the query if we use a bound from a graph where the search is much further forward.
Because of this, we would like to continuing exploring local graphs with some limitations
even if the candidates don't pass the global min similarity threshold. For this reason,
we have added a second queue that has a size of (1-g) k, where g is greediness.
A node will be considered a candidate if it has similarity with a query that is larger
than the current bottom similarity of the 2nd queue.