Speedup concurrent multi-segment HNWS graph search #12794

mayya-sharipova · 2023-11-10T18:47:51Z

Speedup concurrent multi-segment HNWS graph search by exchanging
the global minimum similarity collected so far across segments. As the global
similarity is used as a minimum threshold that candidates need to pass to be considered,
this allows earlier stopping for segments that don't have good candidates.

Implementation details:

TopKnnCollector has a new argument MaxScoreAccumulator used to keep track of the global minimum similarity collected so far.
As soon as a TopKnnCollector for a segment collected the local k results, the global similarity is used as minimum required similarity for knn search.

Having a global queue is a crude approach and the 1st step for concurrent search.
We don’t want to stop ourselves reaching a portion of a graph which is a good neighbourhood
of the query if we use a bound from a graph where the search is much further forward.
Because of this, we would like to continuing exploring local graphs with some limitations
even if the candidates don't pass the global min similarity threshold. For this reason,
we have added a second queue that has a size of (1-g) k, where g is greediness.

A node will be considered a candidate if it has similarity with a query that is larger
than the current bottom similarity of the 2nd queue.

benwtrent · 2023-11-10T19:58:23Z

@mayya-sharipova with those experiments, I am guessing these are over multiple segments, could you include that information in the table?

It would also be awesome to see what the QPS & recall compares to a fully merged graph.

One more thing, does recall increase as expected as "num_candidates/fanout" increases with the candidate patch?

benwtrent · 2023-11-13T14:35:41Z

@mayya-sharipova two important measurements we need to check here:

When comparing baseline & candidate, can the candidate get to higher recall than baseline with lower latency?
When using the same query parameters (fanout, k, etc.), how does candidate measure against a fully merged graph (1 segment)? I am guessing a single graph (at least when using only one search thread), will always be faster, I am just wondering by how much.

mayya-sharipova · 2023-11-16T19:45:27Z

Experiments

Available processors: 10; thread pool size: 16
luceneutil tool

Search:

baseline: Lucene main branch
candidate1: only global queue
candidate2: global queue + non-competitive local queue with greediness of 0.9

1M vectors of 100 dims

k=10

	Fanout	Avg visited nodes	QPS	Recall
Baseline Single segment	90	980	2336	0.739
Baseline 3 segments sequential	90	2627	1179	0.772
Baseline 3 segments concurrent	90	2627	1897	0.772
Candidate1	90	2383	2070	0.758
Candidate2	90	2437	1831	0.765

k=100

	Fanout	Avg visited nodes	QPS	Recall
Baseline Single segment	900	6722	430	0.921
Baseline 3 segments sequential	900	17595	169	0.949
Baseline 3 segments concurrent	900	17595	372	0.949
Candidate1	900	16423	385	0.947
Candidate2	900	16434	363	0.947

10M vectors of 100 dims

k=10

	Faount	Avg visited nodes	QPS	Recall
Baseline Single segment	90	1081	1798	0.634
Baseline 13 segments sequential	90	11869	227	0.680
Baseline 13 segments concurrent	90	11869	974	0.680
Candidate1	90	7571	1215	0.553
Candidate2	90	8489	1269	0.609

k=100

	Faount	Avg visited nodes	QPS	Recall
Baseline Single segment	900	7213	272	0.824
Baseline 13 segments sequential	900	78069	34	0.894
Baseline 13 segments concurrent	900	78069	194	0.894
Candidate1	900	49965	253	0.865
Candidate2	900	48547	258	0.860
Baseline Single segment	9900	55168	36	0.916
Baseline 13 segments sequential	9900	521851	4	0.962
Baseline 13 segments concurrent	9900	521851	23	0.962
Candidate1	9900	309378	31	0.953
Candidate2	9900	302826	31	0.953

10M vectors of 768 dims

k=10

	Fanount	Avg visited nodes	QPS	Recall
Baseline Single segment	90	1095	974	0.542
Baseline 19 segments concurrent	90	18091	267	0.541
Candidate1	90	9770	586	0.427
Candidate2	90	10700	482	0.475

k=100

	Fanount	Avg visited nodes	QPS	Recall
Baseline Single segment	900	7118	152	0.688
Baseline 19 segments concurrent	900	120453	69	0.732
Candidate1	900	65705	111	0.690
Candidate2	900	65596	107	0.692
Baseline Single segment	9900	59308	19	0.797
Baseline 19 segments concurrent	9900	818268	9	0.847
Candidate1	9900	466235	15	0.830
Candidate2	9900	462975	14	0.829

Summary:

For 100 dims index:
- using global similarity leads to 1.1 speedups (1M docs), 1.3x (10M docs)
- recalls slightly lower than a single segment (for small k) or slightly larger than a single segment (for larger k).
For 768 dims index:
- using global similarity leads to 1.6-2x speedups
- recalls slightly lower than a single segment (for small k) or slightly larger than a single segment (for larger k).
Sharing global similarity is more beneficial for larger indices and larger dims
Having an extra non-competitive local queue doesn't seem to be useful: in half of the cases it led to increased recall, but in other half of the case recall was the same or even worsened.

vigyasharma

This is a very exciting change!
I'm new to this part of Lucene, and a lot of my comments are to improve my own understanding of the system. Overall, I think this is a promising optimization.

vigyasharma · 2023-11-17T22:57:47Z

lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java

@@ -26,26 +26,71 @@
 * @lucene.experimental
 */
 public final class TopKnnCollector extends AbstractKnnCollector {
+  private static final float DEFAULT_GREEDINESS = 0.9f;


So greediness is essentially how "greedy" our algorithm for picking top matches gets to be.
At 1, we go with the global min competitive similarity. And by default, we continue to search the segment as long as we find neighbors that are better than the top 10% of our local picks from the segment (1-0.9)*k).

Should we add some documentation around this param? My initial, naive impression was that higher greediness might mean we pick more nodes per segment, which is quite the opposite. :)

@vigyasharma Thanks for your feedback. Indeed, more documentation is needed, I will add it after we finalize the experiments.

A general idea with the introduction of a second shorter local queue is that different searches from different graphs can progress differently. We don't want to stop searching a graph if we are just starting and still in a bad neighbourhood where similarity can be worse that the globally collected results. We still want to make some progress.

As you correctly noticed, greediness is meant to show how greedy is our local segment based search if we are not competitive globally. A good approach could be to be greedy, and don't do much exploration in this case, keeping size of the second queue small.

vigyasharma · 2023-11-17T23:06:35Z

lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java

    TaskExecutor taskExecutor = indexSearcher.getTaskExecutor();
    List<LeafReaderContext> leafReaderContexts = reader.leaves();
    List<Callable<TopDocs>> tasks = new ArrayList<>(leafReaderContexts.size());
    for (LeafReaderContext context : leafReaderContexts) {
-      tasks.add(() -> searchLeaf(context, filterWeight));
+      tasks.add(() -> searchLeaf(context, filterWeight, globalMinSimAcc));


Is the MaxScoreAccumulator thread safe?

Doesn't look like it is shared across tasks in the IndexSearcher, right? For each slice, we have a single collector. And each collector gets its own MaxScoreAccumulator?

MaxScoreAccumulator is thread safe, it uses LongAccumulator that is enough for our purposes to collect max global similarity.
And the same instance of MaxScoreAccumulator is shared among all slices.

This approach is similar to TopScoreDocCollector.

vigyasharma · 2023-11-17T23:34:49Z

lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java

+    this.greediness = DEFAULT_GREEDINESS;
+    this.queue = new NeighborQueue(k, false);
+    int queuegSize = Math.max(1, Math.round((1 - greediness) * k));
+    this.queueg = new NeighborQueue(queuegSize, false);


This queue is still local to each leaf, and hence to each task in the executor, right? I ask because I see some references to a global queue for greediness, in the issue description.

Indeed, it is local.

vigyasharma · 2023-11-17T23:40:14Z

lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java

+      // periodically update the local copy of global similarity
+      if (reachedKResults || (visitedCount & globalMinSimAcc.modInterval) == 0) {
+        MaxScoreAccumulator.DocAndScore docAndScore = globalMinSimAcc.get();
+        cachedGlobalMinSim = docAndScore.score;


We only peek at the global minimum similarity after we have accumulated k local results. For my understanding, what would happen if we were to start using globalMinSim as soon as any leaf has collected its top K results?

I'm curious how the graph structure and entry point calculation affects this. Looking at graphSearcher.findBestEntryPoint(...), i think we navigate to the closest node we can find to our query on layer-1 and use that as the starting point for layer-0. Now layer-0 is a lot more dense than layer-1, and our best candidates are likely in the space between the entry-point on layer-1 and all its neighbors.

Is it always a good idea to at least look at k nodes around the entry point, regardless of the global min. competitive score?

Thanks for your idea, it looks like it should provide better speedups than the current approach.
We can check globalMinSim before starting search, and see if it makes difference

vigyasharma · 2023-11-17T23:53:24Z

lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java

+    boolean result = queue.insertWithOverflow(docId, similarity);
+    queueg.insertWithOverflow(docId, similarity);
+
+    boolean reachedKResults = (kResultsCollected == false && queue.size() == k());


So reachedKResults is supposed to be true only for the first time we get to k results, while kResultsCollected continues to be true even as we collect subsequent hits?

Minor: Should we try to make it more explicit in the variable name? Maybe something like firstKResultsCollected?

vigyasharma · 2023-11-18T00:10:50Z

We seem to consistently see an improvement in recall between single segment, and multi-segment runs (both seq and conc.) on baseline. Is this because with multiple segments, we get multiple entry points into the overall graph? Whereas in a single merged segment, we only have access to a sparser set of nodes in layer-1 while finding the single best entry point?

vigyasharma · 2023-11-18T00:25:00Z

What is a good mental model on what kind of graphs would see minimal loss of recall between baseline and candidate? Is this change better with denser (higher fanout) graphs? Would it depend on graph construction params like beam-width ?

mayya-sharipova · 2023-11-23T17:04:44Z

@vigyasharma Answering other questions:

We seem to consistently see an improvement in recall between single segment, and multi-segment runs (both seq and conc.) on baseline. Is this because with multiple segments, we get multiple entry points into the overall graph? Whereas in a single merged segment, we only have access to a sparser set of nodes in layer-1 while finding the single best entry point?

Indeed, this is the correct observation. For multiple segments, we retrieve k results from each segment, and then merge k* num_of_segments results to get the best k results. As we are retrieving more results, we also get better recall than from the single segment.

Add tracking of global minimum similarity collected so far across segments. As soon as k results collected locally, us global similarity as minimum required similarity for knn search.

A second implementation of apache#12794 using Queue instead of MaxScoreAccumulator. Speedup concurrent multi-segment HNWS graph search by exchanging the global top scores collected so far across segments. These global top scores set the minimum threshold that candidates need to pass to be considered. This allows earlier stopping for segments that don't have good candidates.

mayya-sharipova · 2024-02-06T14:16:58Z

Closed in favour of #12962

mayya-sharipova force-pushed the multi-search-hnsw-graph branch 2 times, most recently from 533d779 to f52b5e4 Compare November 15, 2023 13:59

vigyasharma reviewed Nov 17, 2023

View reviewed changes

vigyasharma mentioned this pull request Nov 20, 2023

Re-use information from graph traversal during exact search #12820

Closed

mayya-sharipova force-pushed the multi-search-hnsw-graph branch 2 times, most recently from c8af48a to 0adfaac Compare December 1, 2023 18:32

This was referenced Dec 5, 2023

Introduce dynamic segment efSearch to Knn{Byte|Float}VectorQuery #12551

Closed

Make k and num_candidates optional in the knn section elastic/elasticsearch#97533

Closed

mayya-sharipova added 2 commits December 12, 2023 15:02

Speedup concurrent multi-segment HNWS graph search

6e26846

Add tracking of global minimum similarity collected so far across segments. As soon as k results collected locally, us global similarity as minimum required similarity for knn search.

Add greediness

a7ec694

mayya-sharipova force-pushed the multi-search-hnsw-graph branch from 0adfaac to 98a921f Compare December 12, 2023 20:09

Some optimizations

f78f60e

mayya-sharipova force-pushed the multi-search-hnsw-graph branch from 98a921f to f78f60e Compare December 13, 2023 15:38

mayya-sharipova mentioned this pull request Dec 21, 2023

Speedup concurrent multi-segment HNSW graph search 2 #12962

Merged

mayya-sharipova closed this Feb 6, 2024

Speedup concurrent multi-segment HNWS graph search #12794

Speedup concurrent multi-segment HNWS graph search #12794

Uh oh!

Conversation

mayya-sharipova commented Nov 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benwtrent commented Nov 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benwtrent commented Nov 13, 2023

Uh oh!

mayya-sharipova commented Nov 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Experiments

1M vectors of 100 dims

10M vectors of 100 dims

10M vectors of 768 dims

Summary:

Uh oh!

vigyasharma left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vigyasharma commented Nov 18, 2023

Uh oh!

vigyasharma commented Nov 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mayya-sharipova commented Nov 23, 2023

Uh oh!

mayya-sharipova commented Feb 6, 2024

Uh oh!

Uh oh!

mayya-sharipova commented Nov 10, 2023 •

edited

Loading

benwtrent commented Nov 10, 2023 •

edited

Loading

mayya-sharipova commented Nov 16, 2023 •

edited

Loading

vigyasharma commented Nov 18, 2023 •

edited

Loading