-
Notifications
You must be signed in to change notification settings - Fork 28
Added UTF-8 validation #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks for the PR. I'll take a look at it soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done the first pass through the PR and left a few comments. I'm going to dive deeper into the implementation next week.
Regarding the performance degradation you mentioned in the PR description:
I've put the utf8 validation in a stage zero method rather than the step method as it was significantly degrading the performance. The performance degradation seemed to be coming from having state, such as storing the previous vector or errors.
I've verified this, and indeed, plugging the validator into StructuralIndexer::step
causes performance degradation. In the compilation logs, I've seen that the JIT struggles with inlining Utf8Validator::validate
when it's called from the step
method, which could potentially be the cause of the performance drop.
Hi @Nostimo, just to let you know: I remember about this PR, but I've been busy with other tasks. I'll try to do the second pass as soon as I can. |
…SimdJsonParser::parse rather than the internal classes
I’m getting back to this. First, I need to read the paper to properly review the PR. My plan is as follows:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Nostimo I've left a few additional comments. Overall, you did a great job! Thank you for that.
…idths, code cleanup
Thank you, @Nostimo, for your contribution. I'm merging it. I'll add support for 512-bit vectors in a follow-up PR. |
I've put the utf8 validation in a stage zero method rather than the step method as it was significantly degrading the performance. The performance degradation seemed to be coming from having state, such as storing the previous vector or errors. Hopefully this will improve as the vector API comes out of incubation.
JMH benchmarks before change:
After:
Additional benchmarks of the SIMD utf8 validation vs utf8 validation from guava: