Skip to content

Added UTF-8 validation #13

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Nov 26, 2023
Merged

Added UTF-8 validation #13

merged 7 commits into from
Nov 26, 2023

Conversation

Nostimo
Copy link
Contributor

@Nostimo Nostimo commented Aug 10, 2023

I've put the utf8 validation in a stage zero method rather than the step method as it was significantly degrading the performance. The performance degradation seemed to be coming from having state, such as storing the previous vector or errors. Hopefully this will improve as the vector API comes out of incubation.

JMH benchmarks before change:

benchmark Throughput (ops/s)
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_fastjson 619.632
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_jackson 472.165
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_jsoniter_scala 1418.109
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_simdjson 1440.042
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_simdjsonPadded 1521.377
ParseBenchmark.simdjson /twitter.json 1535.270
ParseBenchmark.simdjson /github_events.json 17744.694
ParseBenchmark.simdjsonPadded /twitter.json 1685.552
ParseBenchmark.simdjsonPadded /github_events.json 18544.664

After:

benchmark Throughput (ops/s)
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_fastjson 617.064
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_jackson 478.122
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_jsoniter_scala 1431.893
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_simdjson 1311.505
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_simdjsonPadded 1303.818
ParseBenchmark.simdjson /twitter.json 1428.842
ParseBenchmark.simdjson /github_events.json 16368.646
ParseBenchmark.simdjsonPadded /twitter.json 1458.352
ParseBenchmark.simdjsonPadded /github_events.json 16739.397

Additional benchmarks of the SIMD utf8 validation vs utf8 validation from guava:

benchmark Throughput (ops/s)
Utf8ValidatorBenchmark.guava /twitter.json 3802.453
Utf8ValidatorBenchmark.guava /gsoc-2018.json 1420.639
Utf8ValidatorBenchmark.guava /github_events.json 71402.918
Utf8ValidatorBenchmark.utf8Validator /twitter.json 16678.892
Utf8ValidatorBenchmark.utf8Validator /gsoc-2018.json 10601.135
Utf8ValidatorBenchmark.utf8Validator /github_events.json 185121.703

@piotrrzysko
Copy link
Member

Thanks for the PR. I'll take a look at it soon.

Copy link
Member

@piotrrzysko piotrrzysko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done the first pass through the PR and left a few comments. I'm going to dive deeper into the implementation next week.

Regarding the performance degradation you mentioned in the PR description:

I've put the utf8 validation in a stage zero method rather than the step method as it was significantly degrading the performance. The performance degradation seemed to be coming from having state, such as storing the previous vector or errors.

I've verified this, and indeed, plugging the validator into StructuralIndexer::step causes performance degradation. In the compilation logs, I've seen that the JIT struggles with inlining Utf8Validator::validate when it's called from the step method, which could potentially be the cause of the performance drop.

@piotrrzysko
Copy link
Member

piotrrzysko commented Sep 4, 2023

Hi @Nostimo, just to let you know: I remember about this PR, but I've been busy with other tasks. I'll try to do the second pass as soon as I can.

@piotrrzysko
Copy link
Member

I’m getting back to this. First, I need to read the paper to properly review the PR.

My plan is as follows:

  • Investigate if we can improve the current performance of the validator. Recently, support for 512-bit vectors has been added to the parser. It would be great to add it to the validator as well. The question is: would you like to do it? If you don't have time, I can work on it.
  • If we can't improve the performance, I think we should consider adding an option to disable the validator. Perhaps, in some cases, validation is unnecessary. For example, when we are certain that the data we are processing is already validated.

Copy link
Member

@piotrrzysko piotrrzysko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Nostimo I've left a few additional comments. Overall, you did a great job! Thank you for that.

@piotrrzysko
Copy link
Member

Thank you, @Nostimo, for your contribution. I'm merging it. I'll add support for 512-bit vectors in a follow-up PR.

@piotrrzysko piotrrzysko merged commit bc5e14d into simdjson:main Nov 26, 2023
@piotrrzysko piotrrzysko mentioned this pull request Nov 26, 2023
Squiry pushed a commit to Squiry/simdjson-java that referenced this pull request Feb 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants