-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Adding support for hex-encoded byte vectors on knn-search #105393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding support for hex-encoded byte vectors on knn-search #105393
Conversation
Documentation preview: |
Pinging @elastic/es-search (Team:Search) |
Hi @pmpailis, I've created a changelog YAML for you. |
return vector; | ||
} | ||
|
||
private static float[] parseQueryVectorArray(XContentParser parser) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is duplicated, maybe we could extract it to a common class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ Did some refactoring in DenseVectorFieldMapper
to avoid code duplication but could very well do the same here. Will update :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dotProduct += value * value; | ||
index++; | ||
XContentParser.Token token = context.parser().currentToken(); | ||
if (token == XContentParser.Token.START_ARRAY) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Maybe we could use a switch expression here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 3833517
...tTest/resources/rest-api-spec/test/search.vectors/175_knn_query_hex_encoded_byte_vectors.yml
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks Panos!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good progress! Highlevel concerns:
I wonder if we should ever allow hex encoded strings to query float
encoded vectors. I get this may help with backwards compatibility.
We shouldn't allow single value numbers to be parsed as an array.
We should parse the hex string directly into byte[]
and write that between nodes. Only transforming to float[]
for backwards compatibility.
--- | ||
"Knn search with hex string for float field" : | ||
# [64, 10, -30] - is encoded as '400ae2' | ||
# this will be properly decoded but only because: | ||
# (i) the provided input is compatible as the values are within [Byte.MIN_VALUE, Byte.MAX_VALUE] range | ||
# (ii) we do not differentiate between byte and float fields when initially parsing a query | ||
- do: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should support hexidecimal strings for float
at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am conflicted on this. I realize now we allow byte[]
(which is just always parsed as float[]
).
I wonder if we should allow the hex encoded strings. I am flip flopping here :/. Gonna have to think some more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, we should not allow this
index: knn_hex_vector_index | ||
id: "4" | ||
body: | ||
my_vector_float: "807f0a" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same concerns as above, I don't think we should allow this.
# [-128, 127, 10] - is encoded as '807f0a' | ||
- do: | ||
catch: /Failed to parse object./ | ||
index: | ||
index: knn_hex_vector_index | ||
id: "4" | ||
body: | ||
my_vector_float: "807f0a" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I agree we shouldn't allow hex encoded elements indexed into a float field, we should not have test code in setup
. Please move to its own test (same goes for the query one).
return switch (token) { | ||
case START_ARRAY -> parseVectorArray(context, fieldMapper, (val, idx) -> byteBuffer.put(val)); | ||
case VALUE_STRING -> parseHexEncodedVector(context, fieldMapper, (val, idx) -> byteBuffer.put(val)); | ||
case VALUE_NUMBER -> parseNumberVector(context, fieldMapper, (val, idx) -> byteBuffer.put(val)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't allow this. If this is something we want to support (just a number that gets put into a vector of dim==1), then it should be a separate PR. Personally, I am against it as it encourages bad behavior. If folks have a single value, they should index it as keyword or int
or float
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm definitely with you on this, but added this because it is what we already have and did not want to change the existing API.
Currently, when parsing the request we make use of declareFloatArray
which ends up being defined as
FLOAT_ARRAY(START_ARRAY, VALUE_NUMBER, VALUE_STRING)
hence, while far from ideal, we already support single valued numbers. I'm +1 if you agree to remove this (don't think that it'd actually affect anyone tbf)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hence, while far from ideal, we already support single valued numbers
POST vectors/_doc
{
"vector": 1
}
"caused_by": {
"type": "parsing_exception",
"reason": "Failed to parse object: expecting token of type [VALUE_NUMBER] but found [END_OBJECT]",
"line": 4,
"col": 1
}
When indexing.
But we do allow it on the query side. So, lets disallow it on indexing, but continue to allow it query side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guess I missed that :/ Somehow I was under the impression that this was consistently allowed on both indexing & searching. Will proceed to disallow it on indexing and keep the existing behavior on query-time.
server/src/main/java/org/elasticsearch/search/vectors/KnnSearchBuilder.java
Show resolved
Hide resolved
return switch (token) { | ||
case START_ARRAY -> parseQueryVectorArray(parser); | ||
case VALUE_STRING -> parseHexEncodedVector(parser); | ||
case VALUE_NUMBER -> parseNumberVector(parser); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should throw. We shouldn't allow this.
static float[] parseQueryVector(XContentParser parser) throws IOException { | ||
XContentParser.Token token = parser.currentToken(); | ||
return switch (token) { | ||
case START_ARRAY -> parseQueryVectorArray(parser); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its good to assume float[]
when passed an array. But we should be able to handle byte[]
directly via hex.
} | ||
|
||
private static float[] parseHexEncodedVector(XContentParser parser) throws IOException { | ||
// TODO optimize this as the array returned will be recomputed later again as a byte array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should do this now :).
void accept(byte value, int index); | ||
} | ||
|
||
public static class VectorData { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is the correct place or not - but thought to add it here initially, as almost all related classes already had a dependency on DenseVectorFieldMapper
- happy to move to its own class / discuss alternatives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I think it should be its own top-level class in
org.elasticsearch.search.vectors
. - It should be a
record
, you can override the canonical for uniqueness checks - It should also override
Writeable
and possiblyToXContent
and handle all the serialization stuff. - It should handle its own XContent parsing, this way users of this only need to use this class and it can correctly parse numerical array or string values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ Tbh the main reason that I defined it as a class
instead of a record
was to hide constructors and enable new object creation only via static methods (to ensure uniqueness).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was to hide constructors and enable new object creation only via static methods (to ensure uniqueness).
You can do that in the canonical
ctor and still have static methods that are preferred.
I would do something like:
record VectorData(float[] floats, byte[] bytes) {
public VectorData {
if (floats == null ^ bytes == null) {
throw new IllegalArgumentException("You must supply exactly either floats or bytes");
}
}
}
@elasticmachine update branch |
merge conflict between base and head |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we are headed in the right direction!
I think this code becomes easier to maintain and read if VectorData
is a stand alone public record class that handles all its own serialization & parsing.
void accept(byte value, int index); | ||
} | ||
|
||
public static class VectorData { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I think it should be its own top-level class in
org.elasticsearch.search.vectors
. - It should be a
record
, you can override the canonical for uniqueness checks - It should also override
Writeable
and possiblyToXContent
and handle all the serialization stuff. - It should handle its own XContent parsing, this way users of this only need to use this class and it can correctly parse numerical array or string values
public byte[] asByteVector() { | ||
if (isByteVector()) { | ||
return byteVector; | ||
} else if (isFloatVector()) { | ||
ElementType.BYTE.checkVectorBounds(floatVector); | ||
byte[] vec = new byte[floatVector.length]; | ||
for (int i = 0; i < floatVector.length; i++) { | ||
vec[i] = (byte) floatVector[i]; | ||
} | ||
return vec; | ||
} | ||
return new byte[0]; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is nice to have this for the mapper. However, this should have checks to ensure that if this is called, its actually a byte vector (meaning, whole numbers between Byte.MIN_VALUE|MAX_VALUE)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a call to ElementType.BYTE.checkVectorBounds(floatVector);
which would throw if any of the elements is outside of BYTE
range or decimal. However, to avoid code duplication, we do have to iterate over the array twice here, which is not great either :/ Will refactor and add all necessary checks in-place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty sure we iterate twice already. Though, it doesn't make sense to "bounds check" once the vector has switched to byte[]
already as you know for sure that byte[]
values are whole and between min/max ;) (as they are byte
values).
It seems like VectorData
should have an as...
method and a asElementType(ElementType)
to handle these weird scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The asByteVector
and asFloatVector
methods are currently called only when moving to explicit implementations (e.g. DenseVectorFieldType#createKnnByteQuery
and DenseVectorFieldType#createKnnFloatQuery
), and in the toXContent
methods. So, in most cases, while the underlying vector has been converted to byte, the VectorData
record itself is no longer used, so we won't need to re-transform the data.
Would it make sense to have a "rewrite"-like method to return a new VectorData
with byte[]
instead of float[]
?
It seems like VectorData should have an
as...
method and aasElementType(ElementType)
to handle these weird scenarios.
Not sure the distinction of the two usages is clear to me. Could you please provide an example of a scenario that the asElementType
method would handle to help me understand the intent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure the distinction of the two usages is clear to me. Could you please provide an example of a scenario that the asElementType method would handle to help me understand the intent
Maybe I misunderstand then. I was thinking a query:
- Starting from an old node (parsed as
float[]
) - Serialized to new node (read into
VectorData
) - Toquery is called but really its a
element_type: byte
so we have to check dimensions and transform intobyte[]
to create the correct Lucene query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the process remains pretty much the same - exactly as you described it. There was also a parseFloat
method that tried to eagerly read into byte[]
during initial parsing (from either XContent or older nodes) but has now been removed to better distinguish between explicit byte & float vectors and fail for hex-float combination.
So now, hex aside, we pass around VectorData
records holding float vectors (this hasn't changed) and call the asFloatVector
/ asByteVector
in DenseVectorFieldMapper
in the createKnnQuery
and createExactKnnQuery
methods, depending on the element type. At that point, as you've mentioned we do the dimension check / bound check etc and pass the byte[]
instance from now on.
I might have misunderstood/missing something , but AFAICT the dimensionality check and conversion happens only at that point, hence why it isn't clear to me the need for an additional asElementType
method.
Also, please note that I've just pushed a new set of changes moving VectorData
outside of DenseVectorFieldMapper
and taking care of serialization as suggested.
if (out.getTransportVersion().onOrAfter(TransportVersions.KNN_EXPLICIT_BYTE_QUERY_VECTOR_PARSING)) { | ||
boolean isFloat = query.isFloatVector(); | ||
out.writeBoolean(isFloat); | ||
if (isFloat) { | ||
out.writeFloatArray(query.asFloatVector()); | ||
} else { | ||
out.writeByteArray(query.asByteVector()); | ||
} | ||
} else { | ||
out.writeFloatArray(query.asFloatVector()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and even the transport version checks should encapsulated in VectorData
out.writeString(field); | ||
} | ||
|
||
@Override | ||
protected void doXContent(XContentBuilder builder, Params params) throws IOException { | ||
builder.startObject(NAME); | ||
builder.field("query", query); | ||
builder.field("query", query.asFloatVector()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Transforming byte -> float
here isn't necessary. The VectorData
should determine which "kind" it is and not bother transforming between them. Encapsulating the toXContent should fix this.
return asFloatVector(true); | ||
} | ||
|
||
public float[] asFloatVector(boolean failIfByte) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't really like this - as we only need to force the conversion for serialization. Maybe it'd be better to have a separate (albeit very similar) method instead, or include this logic somehow in the writeTo
? The main challenge there is that in certain cases (e.g. KnnVectorQueryBuilder
) there is additional logic in-between handling the query_vector
param.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking more on it, I don't think we should do this. Right now we allow "byte" arrays to query float indexed values.
Consider the following:
- New Coordinator accepts the
hex
encoded byte array - Its serialized to an older node (thus transformed to
float[]
) - Then we successfully query a
element_type: float
field.
I think its ok for us to be lenient here.
What do you think @mayya-sharipova. It seems that preventing byte[] array queries against a element_type: float
field is causing more trouble than its worth :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I don't think we should fail here. float v = b
is valid in java as its expanding the values. It seems like an unnecessary restriction the more I think about it.
Sorry for flip-flopping on this so much. What do you think @pmpailis should we restrict it? @mayya-sharipova what say you? I am happy to go with the majority as I obviously cannot make up my mind :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we do support byte
values for float
fields (we won't know about element type until much later) so I guess it makes sense to be "consistent" for hex as well and let it pass for float
vectors. I do understand the reasoning for restricting this, but we don't do that for a standard byte array now either, so this could potentially cause some confusion.
+ the scenario you mentioned, unless we decide to throw if hex
& old nodes (which I don't really think it'd be nice), would make us be lenient :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated (temporarily) to not fail for byte -> float conversion. Happy to change back if we decide to do so :) @mayya-sharipova wdyt?
public XContentBuilder toXContent(XContentBuilder builder, Params params) throws IOException { | ||
if (floatVector != null) { | ||
builder.array(params.param(XCONTENT_PARAM_NAME, DEFAULT_XCONTENT_NAME), floatVector); | ||
} else { | ||
builder.array(params.param(XCONTENT_PARAM_NAME, DEFAULT_XCONTENT_NAME), byteVector); | ||
} | ||
return builder; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be much simpler if this would only write the non-null array without a field name. Then the containing objects would do builder.field(fieldName, vectorData);
which will get written as fieldName: [1, 2, 3...]
I think the use of XContentParams here, while interesting, is complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeap I agree with your point. Tbh that's how I had it initially, but some serialization tests were failing with the following, so I resorted to making use of the more complex approach using XContentParams
...
Caused by: com.fasterxml.jackson.core.JsonGenerationException: Can not start an array, expecting field name
at com.fasterxml.jackson.core.JsonGenerator._reportError(JsonGenerator.java:2849)
at com.fasterxml.jackson.dataformat.yaml.YAMLGenerator._verifyValueWrite(YAMLGenerator.java:916)
at com.fasterxml.jackson.dataformat.yaml.YAMLGenerator.writeStartArray(YAMLGenerator.java:586)
at org.elasticsearch.xcontent.provider.json.JsonXContentGenerator.writeStartArray(JsonXContentGenerator.java:169)
I ended up re-writing the tests using mocks instead of concrete instances for the validation of VectorData#toXContent
, but failed to change this one back.
++ for the change, will update it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 2436b03
if (vec == null) return null; | ||
return new VectorData(vec); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (vec == null) return null; | |
return new VectorData(vec); | |
return vec == null ? null : new VectorData(vec); |
But not a big deal.
@@ -400,33 +445,20 @@ public void parseKnnVectorAndIndex(DocumentParserContext context, DenseVectorFie | |||
@Override | |||
double parseKnnVectorToByteBuffer(DocumentParserContext context, DenseVectorFieldMapper fieldMapper, ByteBuffer byteBuffer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how much cleaner this is becoming :D
return asFloatVector(true); | ||
} | ||
|
||
public float[] asFloatVector(boolean failIfByte) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I don't think we should fail here. float v = b
is valid in java as its expanding the values. It seems like an unnecessary restriction the more I think about it.
Sorry for flip-flopping on this so much. What do you think @pmpailis should we restrict it? @mayya-sharipova what say you? I am happy to go with the majority as I obviously cannot make up my mind :)
@elasticmachine update branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whether we want to support converting to float when a user has provided a hex vector (have to also consider desired bwc for this)
I think this is OK. float = byte
is an acceptable conversion is almost every programming language AND its something that HAS to happen for BWC to not be completely busted.
There is no technical reason for the restriction that I can think of.
@mayya-sharipova what do you think?
…ex_encoded_byte_vectors
@@ -384,11 +396,44 @@ public void parseKnnVectorAndIndex(DocumentParserContext context, DenseVectorFie | |||
+ "];" | |||
); | |||
} | |||
vector[index++] = (byte) value; | |||
consumer.accept((byte) value, index++); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to be so tricky with a Consumer
here and have a ByteBuffer
and byte[]
path? As far as I understand, all our ByteBuffer
are array-backed anyway? Can't we do without the indirection and non-static callsite here and just always insert into an array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a look and at the suggestion @original-brownbear ! Updated to remove the Consumer
overhead and parsing directly to a byte
array.
server/src/main/java/org/elasticsearch/index/mapper/vectors/DenseVectorFieldMapper.java
Outdated
Show resolved
Hide resolved
…ex_encoded_byte_vectors
server/src/main/java/org/elasticsearch/index/mapper/vectors/DenseVectorFieldMapper.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/mapper/vectors/DenseVectorFieldMapper.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pmpailis Thanks for persisting and addressing all the comments! Great work, Panos, I very much like how the code looks now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good stuff.
@elasticmachine update branch |
run elasticsearch-ci/part-1 |
@elasticmachine update branch |
@elasticmachine update branch |
run elasticsearch-ci/part-1 |
…ex_encoded_byte_vectors
The following tests are currently failing, most likely as a side-effect of another test (
Once this PR is merged, we can proceed with merging this one as well. |
Thanks everyone for the thorough reviews and the discussions ❤️ |
This PR updates the parsing of the
query_vector
param in both knn-search & knn-query to support hex-encoded byte vectors. This means that the following 2 requests are now equivalent (same goes for knn query) and would yield the same results.Same parsing is also taking place during indexing, so similarly, we now support both of the following (equivalent) formats