Skip to content

Adding support for hex-encoded byte vectors on knn-search #105393

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
135d8d0
adding parsing for hex-encoded byte vectors
pmpailis Feb 12, 2024
0e4d7e6
Update docs/changelog/105393.yaml
pmpailis Feb 12, 2024
3833517
addressing PR comments - removing duplicated code and opting for swit…
pmpailis Feb 13, 2024
638a737
changing visibility of parseQueryVector method
pmpailis Feb 13, 2024
4d05b3f
iter
pmpailis Feb 13, 2024
325a900
addressing PR comments - adding VectorData DTO
pmpailis Feb 20, 2024
9933da3
Merge branch 'main' into feature/support_for_hex_encoded_byte_vectors
pmpailis Feb 20, 2024
5cec209
adding VectorData as a record
pmpailis Feb 21, 2024
2436b03
addressing PR comments - simplifying toXContent for VectorData
pmpailis Feb 21, 2024
fc405a0
Merge branch 'main' into feature/support_for_hex_encoded_byte_vectors
elasticmachine Feb 21, 2024
2c34682
addressing PR comments - removing mocks, updating bwc tests, and supp…
pmpailis Feb 22, 2024
db4ba3d
Merge remote-tracking branch 'origin/main' into feature/support_for_h…
pmpailis Feb 23, 2024
e2d0b1d
minor iter
pmpailis Feb 23, 2024
089b212
Merge remote-tracking branch 'origin/main' into feature/support_for_h…
pmpailis Mar 5, 2024
d0355ca
minor iter
pmpailis Mar 5, 2024
9c3840e
Merge branch 'main' into feature/support_for_hex_encoded_byte_vectors
pmpailis Mar 11, 2024
71591fe
Merge branch 'main' into feature/support_for_hex_encoded_byte_vectors
elasticmachine Mar 11, 2024
98cee5e
Merge branch 'main' into feature/support_for_hex_encoded_byte_vectors
elasticmachine Mar 11, 2024
d3a2838
Merge branch 'main' into feature/support_for_hex_encoded_byte_vectors
pmpailis Mar 11, 2024
80d3538
Merge branch 'main' into feature/support_for_hex_encoded_byte_vectors
elasticmachine Mar 12, 2024
c5282d9
Merge remote-tracking branch 'origin/main' into feature/support_for_h…
pmpailis Mar 12, 2024
ce0b82e
Merge remote-tracking branch 'origin/main' into feature/support_for_h…
pmpailis Mar 12, 2024
fffd8c3
Merge branch 'main' into feature/support_for_hex_encoded_byte_vectors
pmpailis Mar 12, 2024
727d747
fixing compilation error
pmpailis Mar 12, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions docs/changelog/105393.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
pr: 105393
summary: Adding support for hex-encoded byte vectors on knn-search
area: Vector Search
type: feature
issues: []
4 changes: 2 additions & 2 deletions docs/reference/query-dsl/knn-query.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -87,8 +87,8 @@ the top `size` results.
`query_vector`::
+
--
(Required, array of floats) Query vector. Must have the same number of dimensions
as the vector field you are searching against.
(Required, array of floats or string) Query vector. Must have the same number of dimensions
as the vector field you are searching against. Must be either an array of floats or a hex-encoded byte vector.
--

`num_candidates`::
Expand Down
2 changes: 1 addition & 1 deletion docs/reference/rest-api/common-parms.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -597,7 +597,7 @@ end::knn-num-candidates[]

tag::knn-query-vector[]
Query vector. Must have the same number of dimensions as the vector field you
are searching against.
are searching against. Must be either an array of floats or a hex-encoded byte vector.
end::knn-query-vector[]

tag::knn-similarity[]
Expand Down
2 changes: 1 addition & 1 deletion docs/reference/search/knn-search.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -121,7 +121,7 @@ include::{es-repo-dir}/rest-api/common-parms.asciidoc[tag=knn-k]
include::{es-repo-dir}/rest-api/common-parms.asciidoc[tag=knn-num-candidates]

`query_vector`::
(Required, array of floats)
(Required, array of floats or string)
include::{es-repo-dir}/rest-api/common-parms.asciidoc[tag=knn-query-vector]
====

Expand Down
21 changes: 21 additions & 0 deletions docs/reference/search/search-your-data/knn-search.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -242,6 +242,27 @@ POST byte-image-index/_search
// TEST[s/"k": 10/"k": 3/]
// TEST[s/"num_candidates": 100/"num_candidates": 3/]


_Note_: In addition to the standard byte array, one can also provide a hex-encoded string value
for the `query_vector` param. As an example, the search request above can also be expressed as follows,
which would yield the same results
[source,console]
----
POST byte-image-index/_search
{
"knn": {
"field": "byte-image-vector",
"query_vector": "fb09",
"k": 10,
"num_candidates": 100
},
"fields": [ "title" ]
}
----
// TEST[continued]
// TEST[s/"k": 10/"k": 3/]
// TEST[s/"num_candidates": 100/"num_candidates": 3/]

[discrete]
[[knn-search-quantized-example]]
==== Byte quantized kNN search
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
setup:
- skip:
version: ' - 8.13.99'
reason: 'hex encoding for byte vectors was added in 8.14'

- do:
indices.create:
index: knn_hex_vector_index
body:
settings:
number_of_shards: 1
mappings:
dynamic: false
properties:
my_vector_byte:
type: dense_vector
dims: 3
index : true
similarity : l2_norm
element_type: byte
my_vector_float:
type: dense_vector
dims: 3
index: true
element_type: float
similarity : l2_norm

# [-128, 127, 10] - is encoded as '807f0a'
- do:
index:
index: knn_hex_vector_index
id: "1"
body:
my_vector_byte: "807f0a"


# [0, 1, 0] - is encoded as '000100'
- do:
index:
index: knn_hex_vector_index
id: "2"
body:
my_vector_byte: "000100"

# [64, -10, -30] - is encoded as '40f6e2'
- do:
index:
index: knn_hex_vector_index
id: "3"
body:
my_vector_byte: "40f6e2"

- do:
index:
index: knn_hex_vector_index
id: "4"
body:
my_vector_float: [10.5, -10, 1024]

- do:
indices.refresh: {}

---
"Fail to index hex-encoded vector on float field":

# [-128, 127, 10] - is encoded as '807f0a'
- do:
catch: /Failed to parse object./
index:
index: knn_hex_vector_index
id: "5"
body:
my_vector_float: "807f0a"

---
"Knn search with hex string for float field" :
# [64, 10, -30] - is encoded as '400ae2'
# this will be properly decoded but only because:
# (i) the provided input is compatible as the values are within [Byte.MIN_VALUE, Byte.MAX_VALUE] range
# (ii) we do not differentiate between byte and float fields when initially parsing a query even for hex
# (iii) we support expansion from byte to float

- do:
search:
index: knn_hex_vector_index
body:
size: 3
knn:
field: my_vector_float
query_vector: "400ae2"
num_candidates: 100
k: 10

- match: { hits.total.value: 1 }
- match: { hits.hits.0._id: "4" }

---
"Knn search with hex string for byte field" :
# [64, 10, -30] - is encoded as '400ae2'
- do:
search:
index: knn_hex_vector_index
body:
size: 3
knn:
field: my_vector_byte
query_vector: "400ae2"
num_candidates: 100
k: 10

- match: { hits.total.value: 3 }
- match: { hits.hits.0._id: "3" }
- match: { hits.hits.1._id: "2" }
- match: { hits.hits.2._id: "1" }

---
"Knn search with hex string for byte field - dimensions mismatch" :
# [64, 10, -30, 10] - is encoded as '400ae20a'
- do:
catch: /the query vector has a different dimension \[4\] than the index vectors \[3\]/
search:
index: knn_hex_vector_index
body:
size: 3
knn:
field: my_vector_byte
query_vector: "400ae20a"
num_candidates: 100
k: 10


---
"Knn search with hex string for byte field - cannot decode string" :
# '40af20a' is garbage :)
- do:
catch: /failed to parse field \[query_vector\]/
search:
index: knn_hex_vector_index
body:
size: 3
knn:
field: my_vector_byte
query_vector: "40af20a"
num_candidates: 100
k: 10

---
"Knn search with standard byte vector matching against hex-encoded indexed docs" :
- do:
search:
index: knn_hex_vector_index
body:
size: 3
knn:
field: my_vector_byte
query_vector: [64, 10, -30]
num_candidates: 100
k: 10

- match: { hits.total.value: 3 }
- match: { hits.hits.0._id: "3" }
- match: { hits.hits.1._id: "2" }
- match: { hits.hits.2._id: "1" }
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
setup:
- skip:
version: ' - 8.13.99'
reason: 'hex encoding for byte vectors was added in 8.14'

- do:
indices.create:
index: knn_hex_vector_index
body:
settings:
number_of_shards: 1
mappings:
dynamic: false
properties:
my_vector_byte:
type: dense_vector
dims: 3
index : true
similarity : l2_norm
element_type: byte
my_vector_float:
type: dense_vector
dims: 3
index: true
element_type: float
similarity : l2_norm

# [-128, 127, 10] - is encoded as '807f0a'
- do:
index:
index: knn_hex_vector_index
id: "1"
body:
my_vector_byte: "807f0a"


# [0, 1, 0] - is encoded as '000100'
- do:
index:
index: knn_hex_vector_index
id: "2"
body:
my_vector_byte: "000100"

# [64, -10, -30] - is encoded as '40f6e2'
- do:
index:
index: knn_hex_vector_index
id: "3"
body:
my_vector_byte: "40f6e2"

- do:
index:
index: knn_hex_vector_index
id: "4"
body:
my_vector_float: [10.5, -10, 1024]

- do:
indices.refresh: {}

---
"Fail to index hex-encoded vector on float field":

# [-128, 127, 10] - is encoded as '807f0a'
- do:
catch: /Failed to parse object./
index:
index: knn_hex_vector_index
id: "5"
body:
my_vector_float: "807f0a"

---
"Knn query with hex string for float field" :
# [64, 10, -30] - is encoded as '400ae2'
# this will be properly decoded but only because:
# (i) the provided input is compatible as the values are within [Byte.MIN_VALUE, Byte.MAX_VALUE] range
# (ii) we do not differentiate between byte and float fields when initially parsing a query even for hex
# (iii) we support expansion from byte to float

- do:
search:
index: knn_hex_vector_index
body:
size: 3
query:
knn:
field: my_vector_float
query_vector: "400ae2"
num_candidates: 100

- match: { hits.total.value: 1 }
- match: { hits.hits.0._id: "4" }

---
"Knn query with hex string for byte field" :
# [64, 10, -30] - is encoded as '400ae2'
- do:
search:
index: knn_hex_vector_index
body:
size: 3
query:
knn:
field: my_vector_byte
query_vector: "400ae2"
num_candidates: 100

- match: { hits.total.value: 3 }
- match: { hits.hits.0._id: "3" }
- match: { hits.hits.1._id: "2" }
- match: { hits.hits.2._id: "1" }

---
"Knn query with hex string for byte field - dimensions mismatch" :
# [64, 10, -30, 10] - is encoded as '400ae20a'
- do:
catch: /the query vector has a different dimension \[4\] than the index vectors \[3\]/
search:
index: knn_hex_vector_index
body:
size: 3
query:
knn:
field: my_vector_byte
query_vector: "400ae20a"
num_candidates: 100

---
"Knn query with hex string for byte field - cannot decode string" :
# '40af20a' is garbage :)
- do:
catch: /failed to parse field \[query_vector\]/
search:
index: knn_hex_vector_index
body:
size: 3
query:
knn:
field: my_vector_byte
query_vector: "40af20a"
num_candidates: 100

---
"Knn query with standard byte vector matching against hex-encoded indexed docs" :
- do:
search:
index: knn_hex_vector_index
body:
size: 3
query:
knn:
field: my_vector_byte
query_vector: [64, 10, -30]
num_candidates: 100

- match: { hits.total.value: 3 }
- match: { hits.hits.0._id: "3" }
- match: { hits.hits.1._id: "2" }
- match: { hits.hits.2._id: "1" }
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,7 @@ static TransportVersion def(int id) {
public static final TransportVersion ADD_DATA_STREAM_GLOBAL_RETENTION = def(8_603_00_0);
public static final TransportVersion ALLOCATION_STATS = def(8_604_00_0);
public static final TransportVersion ESQL_EXTENDED_ENRICH_TYPES = def(8_605_00_0);
public static final TransportVersion KNN_EXPLICIT_BYTE_QUERY_VECTOR_PARSING = def(8_606_00_0);

/*
* STOP! READ THIS FIRST! No, really,
Expand Down
Loading