Skip to content

Speedup concurrent multi-segment HNWS graph search #12794

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

mayya-sharipova
Copy link
Contributor

@mayya-sharipova mayya-sharipova commented Nov 10, 2023

Speedup concurrent multi-segment HNWS graph search by exchanging
the global minimum similarity collected so far across segments. As the global
similarity is used as a minimum threshold that candidates need to pass to be considered,
this allows earlier stopping for segments that don't have good candidates.

Implementation details:

  • TopKnnCollector has a new argument MaxScoreAccumulator used to keep track of the global minimum similarity collected so far.
  • As soon as a TopKnnCollector for a segment collected the local k results, the global similarity is used as minimum required similarity for knn search.

Having a global queue is a crude approach and the 1st step for concurrent search.
We don’t want to stop ourselves reaching a portion of a graph which is a good neighbourhood
of the query if we use a bound from a graph where the search is much further forward.
Because of this, we would like to continuing exploring local graphs with some limitations
even if the candidates don't pass the global min similarity threshold. For this reason,
we have added a second queue that has a size of (1-g) k, where g is greediness.

A node will be considered a candidate if it has similarity with a query that is larger
than the current bottom similarity of the 2nd queue.

@benwtrent
Copy link
Member

benwtrent commented Nov 10, 2023

@mayya-sharipova with those experiments, I am guessing these are over multiple segments, could you include that information in the table?

It would also be awesome to see what the QPS & recall compares to a fully merged graph.

One more thing, does recall increase as expected as "num_candidates/fanout" increases with the candidate patch?

@benwtrent
Copy link
Member

@mayya-sharipova two important measurements we need to check here:

  • When comparing baseline & candidate, can the candidate get to higher recall than baseline with lower latency?
  • When using the same query parameters (fanout, k, etc.), how does candidate measure against a fully merged graph (1 segment)? I am guessing a single graph (at least when using only one search thread), will always be faster, I am just wondering by how much.

@mayya-sharipova mayya-sharipova force-pushed the multi-search-hnsw-graph branch 2 times, most recently from 533d779 to f52b5e4 Compare November 15, 2023 13:59
@mayya-sharipova
Copy link
Contributor Author

mayya-sharipova commented Nov 16, 2023

Experiments

  • Available processors: 10; thread pool size: 16
  • luceneutil tool

Search:

  • baseline: Lucene main branch
  • candidate1: only global queue
  • candidate2: global queue + non-competitive local queue with greediness of 0.9

1M vectors of 100 dims

k=10

Fanout Avg visited nodes QPS Recall
Baseline Single segment 90 980 2336 0.739
Baseline 3 segments sequential 90 2627 1179 0.772
Baseline 3 segments concurrent 90 2627 1897 0.772
Candidate1 90 2383 2070 0.758
Candidate2 90 2437 1831 0.765

k=100

Fanout Avg visited nodes QPS Recall
Baseline Single segment 900 6722 430 0.921
Baseline 3 segments sequential 900 17595 169 0.949
Baseline 3 segments concurrent 900 17595 372 0.949
Candidate1 900 16423 385 0.947
Candidate2 900 16434 363 0.947

10M vectors of 100 dims

k=10

Faount Avg visited nodes QPS Recall
Baseline Single segment 90 1081 1798 0.634
Baseline 13 segments sequential 90 11869 227 0.680
Baseline 13 segments concurrent 90 11869 974 0.680
Candidate1 90 7571 1215 0.553
Candidate2 90 8489 1269 0.609

k=100

Faount Avg visited nodes QPS Recall
Baseline Single segment 900 7213 272 0.824
Baseline 13 segments sequential 900 78069 34 0.894
Baseline 13 segments concurrent 900 78069 194 0.894
Candidate1 900 49965 253 0.865
Candidate2 900 48547 258 0.860
Baseline Single segment 9900 55168 36 0.916
Baseline 13 segments sequential 9900 521851 4 0.962
Baseline 13 segments concurrent 9900 521851 23 0.962
Candidate1 9900 309378 31 0.953
Candidate2 9900 302826 31 0.953

10M vectors of 768 dims

k=10

Fanount Avg visited nodes QPS Recall
Baseline Single segment 90 1095 974 0.542
Baseline 19 segments concurrent 90 18091 267 0.541
Candidate1 90 9770 586 0.427
Candidate2 90 10700 482 0.475

k=100

Fanount Avg visited nodes QPS Recall
Baseline Single segment 900 7118 152 0.688
Baseline 19 segments concurrent 900 120453 69 0.732
Candidate1 900 65705 111 0.690
Candidate2 900 65596 107 0.692
Baseline Single segment 9900 59308 19 0.797
Baseline 19 segments concurrent 9900 818268 9 0.847
Candidate1 9900 466235 15 0.830
Candidate2 9900 462975 14 0.829

Summary:

  • For 100 dims index:
    • using global similarity leads to 1.1 speedups (1M docs), 1.3x (10M docs)
    • recalls slightly lower than a single segment (for small k) or slightly larger than a single segment (for larger k).
  • For 768 dims index:
    • using global similarity leads to 1.6-2x speedups
    • recalls slightly lower than a single segment (for small k) or slightly larger than a single segment (for larger k).
  • Sharing global similarity is more beneficial for larger indices and larger dims
  • Having an extra non-competitive local queue doesn't seem to be useful: in half of the cases it led to increased recall, but in other half of the case recall was the same or even worsened.

Copy link
Contributor

@vigyasharma vigyasharma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very exciting change!
I'm new to this part of Lucene, and a lot of my comments are to improve my own understanding of the system. Overall, I think this is a promising optimization.

@@ -26,26 +26,71 @@
* @lucene.experimental
*/
public final class TopKnnCollector extends AbstractKnnCollector {
private static final float DEFAULT_GREEDINESS = 0.9f;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So greediness is essentially how "greedy" our algorithm for picking top matches gets to be.
At 1, we go with the global min competitive similarity. And by default, we continue to search the segment as long as we find neighbors that are better than the top 10% of our local picks from the segment (1-0.9)*k).

Should we add some documentation around this param? My initial, naive impression was that higher greediness might mean we pick more nodes per segment, which is quite the opposite. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vigyasharma Thanks for your feedback. Indeed, more documentation is needed, I will add it after we finalize the experiments.

A general idea with the introduction of a second shorter local queue is that different searches from different graphs can progress differently. We don't want to stop searching a graph if we are just starting and still in a bad neighbourhood where similarity can be worse that the globally collected results. We still want to make some progress.

As you correctly noticed, greediness is meant to show how greedy is our local segment based search if we are not competitive globally. A good approach could be to be greedy, and don't do much exploration in this case, keeping size of the second queue small.

TaskExecutor taskExecutor = indexSearcher.getTaskExecutor();
List<LeafReaderContext> leafReaderContexts = reader.leaves();
List<Callable<TopDocs>> tasks = new ArrayList<>(leafReaderContexts.size());
for (LeafReaderContext context : leafReaderContexts) {
tasks.add(() -> searchLeaf(context, filterWeight));
tasks.add(() -> searchLeaf(context, filterWeight, globalMinSimAcc));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the MaxScoreAccumulator thread safe?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't look like it is shared across tasks in the IndexSearcher, right? For each slice, we have a single collector. And each collector gets its own MaxScoreAccumulator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MaxScoreAccumulator is thread safe, it uses LongAccumulator that is enough for our purposes to collect max global similarity.
And the same instance of MaxScoreAccumulator is shared among all slices.

This approach is similar to TopScoreDocCollector.

this.greediness = DEFAULT_GREEDINESS;
this.queue = new NeighborQueue(k, false);
int queuegSize = Math.max(1, Math.round((1 - greediness) * k));
this.queueg = new NeighborQueue(queuegSize, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This queue is still local to each leaf, and hence to each task in the executor, right? I ask because I see some references to a global queue for greediness, in the issue description.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, it is local.

// periodically update the local copy of global similarity
if (reachedKResults || (visitedCount & globalMinSimAcc.modInterval) == 0) {
MaxScoreAccumulator.DocAndScore docAndScore = globalMinSimAcc.get();
cachedGlobalMinSim = docAndScore.score;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only peek at the global minimum similarity after we have accumulated k local results. For my understanding, what would happen if we were to start using globalMinSim as soon as any leaf has collected its top K results?

I'm curious how the graph structure and entry point calculation affects this. Looking at graphSearcher.findBestEntryPoint(...), i think we navigate to the closest node we can find to our query on layer-1 and use that as the starting point for layer-0. Now layer-0 is a lot more dense than layer-1, and our best candidates are likely in the space between the entry-point on layer-1 and all its neighbors.

Is it always a good idea to at least look at k nodes around the entry point, regardless of the global min. competitive score?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your idea, it looks like it should provide better speedups than the current approach.
We can check globalMinSim before starting search, and see if it makes difference

boolean result = queue.insertWithOverflow(docId, similarity);
queueg.insertWithOverflow(docId, similarity);

boolean reachedKResults = (kResultsCollected == false && queue.size() == k());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So reachedKResults is supposed to be true only for the first time we get to k results, while kResultsCollected continues to be true even as we collect subsequent hits?

Minor: Should we try to make it more explicit in the variable name? Maybe something like firstKResultsCollected?

@vigyasharma
Copy link
Contributor

We seem to consistently see an improvement in recall between single segment, and multi-segment runs (both seq and conc.) on baseline. Is this because with multiple segments, we get multiple entry points into the overall graph? Whereas in a single merged segment, we only have access to a sparser set of nodes in layer-1 while finding the single best entry point?

@vigyasharma
Copy link
Contributor

vigyasharma commented Nov 18, 2023

What is a good mental model on what kind of graphs would see minimal loss of recall between baseline and candidate? Is this change better with denser (higher fanout) graphs? Would it depend on graph construction params like beam-width ?

@mayya-sharipova
Copy link
Contributor Author

@vigyasharma Answering other questions:

We seem to consistently see an improvement in recall between single segment, and multi-segment runs (both seq and conc.) on baseline. Is this because with multiple segments, we get multiple entry points into the overall graph? Whereas in a single merged segment, we only have access to a sparser set of nodes in layer-1 while finding the single best entry point?

Indeed, this is the correct observation. For multiple segments, we retrieve k results from each segment, and then merge k* num_of_segments results to get the best k results. As we are retrieving more results, we also get better recall than from the single segment.

Add tracking of global minimum similarity collected so far across segments.
As soon as k results collected locally, us global similarity as minimum required
similarity for knn search.
mayya-sharipova added a commit to mayya-sharipova/lucene that referenced this pull request Dec 21, 2023
A second implementation of apache#12794 using Queue instead of
MaxScoreAccumulator.

Speedup concurrent multi-segment HNWS graph search by exchanging
the global top scores  collected so far across segments. These global
top scores set the minimum threshold that candidates need to pass to be
considered. This allows earlier stopping for segments that don't have
good candidates.
@mayya-sharipova
Copy link
Contributor Author

Closed in favour of #12962

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants