Skip to content

Adding support for hex-encoded byte vectors on knn-search #105393

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

pmpailis
Copy link
Contributor

@pmpailis pmpailis commented Feb 12, 2024

This PR updates the parsing of the query_vector param in both knn-search & knn-query to support hex-encoded byte vectors. This means that the following 2 requests are now equivalent (same goes for knn query) and would yield the same results.

POST my_index/_search
{
    "knn":{
        "query_vector": [64, 10, -30],
        "field": "my_vector_byte",
        "k": 10,
        "num_candidates": 100
    },
    "size": 10
}
POST my_index/_search
{
    "knn":{
        "query_vector": "400ae2",
        "field": "my_vector_byte",
        "k": 10,
        "num_candidates": 100
    },
    "size": 10
}

Same parsing is also taking place during indexing, so similarly, we now support both of the following (equivalent) formats

POST my_index/_doc
{
    "my_vector_byte": [64, -10, -30]
}
POST my_index/_doc
{
    "my_vector_byte": "40f6e2"
}

@pmpailis pmpailis added Team:Search Meta label for search team :Search Relevance/Vectors Vector search v8.13.0 labels Feb 12, 2024
Copy link
Contributor

Documentation preview:

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@elasticsearchmachine
Copy link
Collaborator

Hi @pmpailis, I've created a changelog YAML for you.

return vector;
}

private static float[] parseQueryVectorArray(XContentParser parser) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is duplicated, maybe we could extract it to a common class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ Did some refactoring in DenseVectorFieldMapper to avoid code duplication but could very well do the same here. Will update :)

Copy link
Contributor Author

@pmpailis pmpailis Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 3833517 & 638a737

dotProduct += value * value;
index++;
XContentParser.Token token = context.parser().currentToken();
if (token == XContentParser.Token.START_ARRAY) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Maybe we could use a switch expression here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 3833517

Copy link
Member

@carlosdelest carlosdelest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks Panos!

@benwtrent benwtrent self-requested a review February 13, 2024 13:24
Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good progress! Highlevel concerns:

I wonder if we should ever allow hex encoded strings to query float encoded vectors. I get this may help with backwards compatibility.

We shouldn't allow single value numbers to be parsed as an array.

We should parse the hex string directly into byte[] and write that between nodes. Only transforming to float[] for backwards compatibility.

Comment on lines 72 to 78
---
"Knn search with hex string for float field" :
# [64, 10, -30] - is encoded as '400ae2'
# this will be properly decoded but only because:
# (i) the provided input is compatible as the values are within [Byte.MIN_VALUE, Byte.MAX_VALUE] range
# (ii) we do not differentiate between byte and float fields when initially parsing a query
- do:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should support hexidecimal strings for float at all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am conflicted on this. I realize now we allow byte[] (which is just always parsed as float[]).

I wonder if we should allow the hex encoded strings. I am flip flopping here :/. Gonna have to think some more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, we should not allow this

index: knn_hex_vector_index
id: "4"
body:
my_vector_float: "807f0a"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same concerns as above, I don't think we should allow this.

Comment on lines 53 to 60
# [-128, 127, 10] - is encoded as '807f0a'
- do:
catch: /Failed to parse object./
index:
index: knn_hex_vector_index
id: "4"
body:
my_vector_float: "807f0a"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree we shouldn't allow hex encoded elements indexed into a float field, we should not have test code in setup. Please move to its own test (same goes for the query one).

return switch (token) {
case START_ARRAY -> parseVectorArray(context, fieldMapper, (val, idx) -> byteBuffer.put(val));
case VALUE_STRING -> parseHexEncodedVector(context, fieldMapper, (val, idx) -> byteBuffer.put(val));
case VALUE_NUMBER -> parseNumberVector(context, fieldMapper, (val, idx) -> byteBuffer.put(val));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't allow this. If this is something we want to support (just a number that gets put into a vector of dim==1), then it should be a separate PR. Personally, I am against it as it encourages bad behavior. If folks have a single value, they should index it as keyword or int or float

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm definitely with you on this, but added this because it is what we already have and did not want to change the existing API.

Currently, when parsing the request we make use of declareFloatArray which ends up being defined as

        FLOAT_ARRAY(START_ARRAY, VALUE_NUMBER, VALUE_STRING)

hence, while far from ideal, we already support single valued numbers. I'm +1 if you agree to remove this (don't think that it'd actually affect anyone tbf)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hence, while far from ideal, we already support single valued numbers

POST vectors/_doc
{
	"vector": 1
}
"caused_by": {
            "type": "parsing_exception",
            "reason": "Failed to parse object: expecting token of type [VALUE_NUMBER] but found [END_OBJECT]",
            "line": 4,
            "col": 1
        }

When indexing.

But we do allow it on the query side. So, lets disallow it on indexing, but continue to allow it query side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guess I missed that :/ Somehow I was under the impression that this was consistently allowed on both indexing & searching. Will proceed to disallow it on indexing and keep the existing behavior on query-time.

return switch (token) {
case START_ARRAY -> parseQueryVectorArray(parser);
case VALUE_STRING -> parseHexEncodedVector(parser);
case VALUE_NUMBER -> parseNumberVector(parser);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should throw. We shouldn't allow this.

static float[] parseQueryVector(XContentParser parser) throws IOException {
XContentParser.Token token = parser.currentToken();
return switch (token) {
case START_ARRAY -> parseQueryVectorArray(parser);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its good to assume float[] when passed an array. But we should be able to handle byte[] directly via hex.

}

private static float[] parseHexEncodedVector(XContentParser parser) throws IOException {
// TODO optimize this as the array returned will be recomputed later again as a byte array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should do this now :).

void accept(byte value, int index);
}

public static class VectorData {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is the correct place or not - but thought to add it here initially, as almost all related classes already had a dependency on DenseVectorFieldMapper - happy to move to its own class / discuss alternatives.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I think it should be its own top-level class in org.elasticsearch.search.vectors.
  • It should be a record, you can override the canonical for uniqueness checks
  • It should also override Writeable and possibly ToXContent and handle all the serialization stuff.
  • It should handle its own XContent parsing, this way users of this only need to use this class and it can correctly parse numerical array or string values

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ Tbh the main reason that I defined it as a class instead of a record was to hide constructors and enable new object creation only via static methods (to ensure uniqueness).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was to hide constructors and enable new object creation only via static methods (to ensure uniqueness).

You can do that in the canonical ctor and still have static methods that are preferred.

I would do something like:

record VectorData(float[] floats, byte[] bytes) {
    public VectorData {
        if (floats == null ^ bytes == null) {
          throw new IllegalArgumentException("You must supply exactly either floats or bytes");
        }
    }
}

@pmpailis
Copy link
Contributor Author

@elasticmachine update branch

@elasticmachine
Copy link
Collaborator

merge conflict between base and head

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we are headed in the right direction!

I think this code becomes easier to maintain and read if VectorData is a stand alone public record class that handles all its own serialization & parsing.

void accept(byte value, int index);
}

public static class VectorData {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I think it should be its own top-level class in org.elasticsearch.search.vectors.
  • It should be a record, you can override the canonical for uniqueness checks
  • It should also override Writeable and possibly ToXContent and handle all the serialization stuff.
  • It should handle its own XContent parsing, this way users of this only need to use this class and it can correctly parse numerical array or string values

Comment on lines 145 to 157
public byte[] asByteVector() {
if (isByteVector()) {
return byteVector;
} else if (isFloatVector()) {
ElementType.BYTE.checkVectorBounds(floatVector);
byte[] vec = new byte[floatVector.length];
for (int i = 0; i < floatVector.length; i++) {
vec[i] = (byte) floatVector[i];
}
return vec;
}
return new byte[0];
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is nice to have this for the mapper. However, this should have checks to ensure that if this is called, its actually a byte vector (meaning, whole numbers between Byte.MIN_VALUE|MAX_VALUE)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a call to ElementType.BYTE.checkVectorBounds(floatVector); which would throw if any of the elements is outside of BYTE range or decimal. However, to avoid code duplication, we do have to iterate over the array twice here, which is not great either :/ Will refactor and add all necessary checks in-place.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure we iterate twice already. Though, it doesn't make sense to "bounds check" once the vector has switched to byte[] already as you know for sure that byte[] values are whole and between min/max ;) (as they are byte values).

It seems like VectorData should have an as... method and a asElementType(ElementType) to handle these weird scenarios.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The asByteVector and asFloatVector methods are currently called only when moving to explicit implementations (e.g. DenseVectorFieldType#createKnnByteQuery and DenseVectorFieldType#createKnnFloatQuery), and in the toXContent methods. So, in most cases, while the underlying vector has been converted to byte, the VectorData record itself is no longer used, so we won't need to re-transform the data.

Would it make sense to have a "rewrite"-like method to return a new VectorData with byte[] instead of float[]?

It seems like VectorData should have an as... method and a asElementType(ElementType) to handle these weird scenarios.

Not sure the distinction of the two usages is clear to me. Could you please provide an example of a scenario that the asElementType method would handle to help me understand the intent?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure the distinction of the two usages is clear to me. Could you please provide an example of a scenario that the asElementType method would handle to help me understand the intent

Maybe I misunderstand then. I was thinking a query:

  • Starting from an old node (parsed as float[])
  • Serialized to new node (read into VectorData)
  • Toquery is called but really its a element_type: byte so we have to check dimensions and transform into byte[] to create the correct Lucene query.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the process remains pretty much the same - exactly as you described it. There was also a parseFloat method that tried to eagerly read into byte[] during initial parsing (from either XContent or older nodes) but has now been removed to better distinguish between explicit byte & float vectors and fail for hex-float combination.

So now, hex aside, we pass around VectorData records holding float vectors (this hasn't changed) and call the asFloatVector / asByteVector in DenseVectorFieldMapper in the createKnnQuery and createExactKnnQuery methods, depending on the element type. At that point, as you've mentioned we do the dimension check / bound check etc and pass the byte[] instance from now on.

I might have misunderstood/missing something , but AFAICT the dimensionality check and conversion happens only at that point, hence why it isn't clear to me the need for an additional asElementType method.

Also, please note that I've just pushed a new set of changes moving VectorData outside of DenseVectorFieldMapper and taking care of serialization as suggested.

Comment on lines 87 to 97
if (out.getTransportVersion().onOrAfter(TransportVersions.KNN_EXPLICIT_BYTE_QUERY_VECTOR_PARSING)) {
boolean isFloat = query.isFloatVector();
out.writeBoolean(isFloat);
if (isFloat) {
out.writeFloatArray(query.asFloatVector());
} else {
out.writeByteArray(query.asByteVector());
}
} else {
out.writeFloatArray(query.asFloatVector());
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and even the transport version checks should encapsulated in VectorData

out.writeString(field);
}

@Override
protected void doXContent(XContentBuilder builder, Params params) throws IOException {
builder.startObject(NAME);
builder.field("query", query);
builder.field("query", query.asFloatVector());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Transforming byte -> float here isn't necessary. The VectorData should determine which "kind" it is and not bother transforming between them. Encapsulating the toXContent should fix this.

return asFloatVector(true);
}

public float[] asFloatVector(boolean failIfByte) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't really like this - as we only need to force the conversion for serialization. Maybe it'd be better to have a separate (albeit very similar) method instead, or include this logic somehow in the writeTo ? The main challenge there is that in certain cases (e.g. KnnVectorQueryBuilder) there is additional logic in-between handling the query_vector param.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more on it, I don't think we should do this. Right now we allow "byte" arrays to query float indexed values.

Consider the following:

  • New Coordinator accepts the hex encoded byte array
  • Its serialized to an older node (thus transformed to float[])
  • Then we successfully query a element_type: float field.

I think its ok for us to be lenient here.

What do you think @mayya-sharipova. It seems that preventing byte[] array queries against a element_type: float field is causing more trouble than its worth :(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't think we should fail here. float v = b is valid in java as its expanding the values. It seems like an unnecessary restriction the more I think about it.

Sorry for flip-flopping on this so much. What do you think @pmpailis should we restrict it? @mayya-sharipova what say you? I am happy to go with the majority as I obviously cannot make up my mind :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we do support byte values for float fields (we won't know about element type until much later) so I guess it makes sense to be "consistent" for hex as well and let it pass for float vectors. I do understand the reasoning for restricting this, but we don't do that for a standard byte array now either, so this could potentially cause some confusion.

+ the scenario you mentioned, unless we decide to throw if hex & old nodes (which I don't really think it'd be nice), would make us be lenient :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated (temporarily) to not fail for byte -> float conversion. Happy to change back if we decide to do so :) @mayya-sharipova wdyt?

Comment on lines 92 to 99
public XContentBuilder toXContent(XContentBuilder builder, Params params) throws IOException {
if (floatVector != null) {
builder.array(params.param(XCONTENT_PARAM_NAME, DEFAULT_XCONTENT_NAME), floatVector);
} else {
builder.array(params.param(XCONTENT_PARAM_NAME, DEFAULT_XCONTENT_NAME), byteVector);
}
return builder;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be much simpler if this would only write the non-null array without a field name. Then the containing objects would do builder.field(fieldName, vectorData); which will get written as fieldName: [1, 2, 3...]

I think the use of XContentParams here, while interesting, is complex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap I agree with your point. Tbh that's how I had it initially, but some serialization tests were failing with the following, so I resorted to making use of the more complex approach using XContentParams

...
Caused by: com.fasterxml.jackson.core.JsonGenerationException: Can not start an array, expecting field name
	at com.fasterxml.jackson.core.JsonGenerator._reportError(JsonGenerator.java:2849)
	at com.fasterxml.jackson.dataformat.yaml.YAMLGenerator._verifyValueWrite(YAMLGenerator.java:916)
	at com.fasterxml.jackson.dataformat.yaml.YAMLGenerator.writeStartArray(YAMLGenerator.java:586)
	at org.elasticsearch.xcontent.provider.json.JsonXContentGenerator.writeStartArray(JsonXContentGenerator.java:169)

I ended up re-writing the tests using mocks instead of concrete instances for the validation of VectorData#toXContent, but failed to change this one back.

++ for the change, will update it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 2436b03

Comment on lines 160 to 161
if (vec == null) return null;
return new VectorData(vec);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (vec == null) return null;
return new VectorData(vec);
return vec == null ? null : new VectorData(vec);

But not a big deal.

@@ -400,33 +445,20 @@ public void parseKnnVectorAndIndex(DocumentParserContext context, DenseVectorFie
@Override
double parseKnnVectorToByteBuffer(DocumentParserContext context, DenseVectorFieldMapper fieldMapper, ByteBuffer byteBuffer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how much cleaner this is becoming :D

return asFloatVector(true);
}

public float[] asFloatVector(boolean failIfByte) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't think we should fail here. float v = b is valid in java as its expanding the values. It seems like an unnecessary restriction the more I think about it.

Sorry for flip-flopping on this so much. What do you think @pmpailis should we restrict it? @mayya-sharipova what say you? I am happy to go with the majority as I obviously cannot make up my mind :)

@pmpailis
Copy link
Contributor Author

@elasticmachine update branch

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whether we want to support converting to float when a user has provided a hex vector (have to also consider desired bwc for this)

I think this is OK. float = byte is an acceptable conversion is almost every programming language AND its something that HAS to happen for BWC to not be completely busted.

There is no technical reason for the restriction that I can think of.

@mayya-sharipova what do you think?

@@ -384,11 +396,44 @@ public void parseKnnVectorAndIndex(DocumentParserContext context, DenseVectorFie
+ "];"
);
}
vector[index++] = (byte) value;
consumer.accept((byte) value, index++);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to be so tricky with a Consumer here and have a ByteBuffer and byte[] path? As far as I understand, all our ByteBuffer are array-backed anyway? Can't we do without the indirection and non-static callsite here and just always insert into an array?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look and at the suggestion @original-brownbear ! Updated to remove the Consumer overhead and parsing directly to a byte array.

Copy link
Contributor

@mayya-sharipova mayya-sharipova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmpailis Thanks for persisting and addressing all the comments! Great work, Panos, I very much like how the code looks now.

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good stuff.

@pmpailis
Copy link
Contributor Author

@elasticmachine update branch

@pmpailis
Copy link
Contributor Author

run elasticsearch-ci/part-1

@pmpailis
Copy link
Contributor Author

@elasticmachine update branch

@pmpailis
Copy link
Contributor Author

@elasticmachine update branch

@pmpailis
Copy link
Contributor Author

run elasticsearch-ci/part-1

@pmpailis
Copy link
Contributor Author

The following tests are currently failing, most likely as a side-effect of another test (LoggerTests) updating the log-level for the root logger.

Tests with failures:
 - org.elasticsearch.snapshots.SnapshotResiliencyTests.testIndexNotFoundExceptionLogging
 - org.elasticsearch.snapshots.SnapshotResiliencyTests.testFullSnapshotUnassignedShards
 - org.elasticsearch.snapshots.SnapshotResiliencyTests.testIllegalArgumentExceptionLogging
 - org.elasticsearch.snapshots.SnapshotResiliencyTests.testSnapshotNameAlreadyInUseExceptionLogging

Once this PR is merged, we can proceed with merging this one as well.

@pmpailis
Copy link
Contributor Author

Thanks everyone for the thorough reviews and the discussions ❤️

@pmpailis pmpailis merged commit d471ccb into elastic:main Mar 13, 2024
@pmpailis pmpailis deleted the feature/support_for_hex_encoded_byte_vectors branch May 27, 2025 03:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants