Skip to content

Feature/simdjson utils #58

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Nov 1, 2020
Merged

Feature/simdjson utils #58

merged 3 commits into from
Nov 1, 2020

Conversation

knapply
Copy link
Collaborator

@knapply knapply commented Nov 1, 2020

@eddelbuettel This is largely complete.

It adds is_valid_utf8(), is_valid_json(), and fminify() (I don't think there's a built-in way to do an fprettify() at the moment).

They're all vectorized (no more vapply(json, jsonlite::validate, logical(1L))!) and work on characters, raws, and lists of raws.

I need to step away and come back with fresh eyes, but all that should be needed is a fresh coat of paint on the documentation with some examples and to ensure the arguments are sufficiently validated (it's not quite as rigorous as fload()/fparse() yet).

Needless to say, the wizards working upstream have made everything obscenely fast...

all_files <- list.files(system.file("jsonexamples", package = "RcppSimdJson"),
                             recursive = TRUE, full.names = TRUE)
all_text <- vapply(all_files, function(.x) readChar(.x, file.size(.x)), character(1L))

microbenchmark::microbenchmark(
    simdjson = RcppSimdJson::is_valid_utf8(all_text),
    base = base::validUTF8(all_text),
    check = "identical"
)
#> Unit: microseconds
#>      expr      min        lq      mean   median        uq      max neval
#>  simdjson  172.679  177.3735  266.5388  210.508  299.5485  870.676   100
#>      base 2950.096 2974.0835 3249.8992 3155.039 3486.4265 4042.592   100




all_json_files <- list.files(system.file("jsonexamples", package = "RcppSimdJson"),
                             pattern = "\\.json$",
                             recursive = TRUE, full.names = TRUE)
all_json <- vapply(all_json_files, function(.x) readChar(.x, file.size(.x)), character(1L))

# validate single JSON string
microbenchmark::microbenchmark(
    simdjson = RcppSimdJson::is_valid_json(all_json[[1L]]),
    jsonify = jsonify::validate_json(all_json[[1L]]),
    jsonlite = jsonlite::validate(all_json[[1L]]),
    rjsonio = RJSONIO::isValidJSON(all_json[[1L]], asText = TRUE),
    check = "identical"
)
#> Registered S3 method overwritten by 'jsonlite':
#>   method     from   
#>   print.json jsonify
#> Unit: microseconds
#>      expr     min       lq      mean   median       uq     max neval
#>  simdjson  52.935  55.5880  57.88668  56.8310  58.8025  81.423   100
#>   jsonify 422.420 429.0455 443.11886 436.1015 443.3170 569.762   100
#>  jsonlite 585.439 593.1465 613.10092 602.6540 612.7265 703.933   100
#>   rjsonio 199.920 204.9015 213.77264 208.0955 213.8235 284.242   100

# validate many JSON strings
microbenchmark::microbenchmark(
    simdjson = RcppSimdJson::is_valid_json(all_json),
    jsonify = jsonify::validate_json(all_json),
    jsonlite = vapply(all_json, jsonlite::validate, logical(1L), USE.NAMES = FALSE),
    rjsonio = vapply(all_json, RJSONIO::isValidJSON, logical(1L), asText = TRUE, USE.NAMES = FALSE),
    check = "identical"
)
#> Unit: milliseconds
#>      expr       min        lq      mean    median        uq      max neval
#>  simdjson  2.642492  2.824298  3.369945  3.214475  3.503424 12.00913   100
#>   jsonify 13.668904 14.001894 14.977280 14.495898 15.279176 24.76140   100
#>  jsonlite 29.720667 31.003672 32.481978 32.183417 33.693321 38.51637   100
#>   rjsonio  6.834025  7.142605  7.701896  7.370767  7.895685 20.69166   100

# minify single JSON
microbenchmark::microbenchmark(
    simdjson = RcppSimdJson::fminify(all_json[[1L]]),
    jsonify = jsonify::minify_json(all_json[[1L]]),
    jsonlite = jsonlite::minify(all_json[[1L]])
)
#> Unit: microseconds
#>      expr     min       lq      mean   median        uq      max neval
#>  simdjson 241.586 253.3720  270.2938 260.8855  265.6235  538.340   100
#>   jsonify 851.640 896.4275  931.5250 914.9115  948.8345 1158.812   100
#>  jsonlite 911.802 962.1780 1006.2681 981.4720 1015.7915 1492.091   100

# minify many JSON
microbenchmark::microbenchmark(
    simdjson = RcppSimdJson::fminify(all_json),
    jsonify = vapply(all_json, jsonify::minify_json, character(1L), USE.NAMES = FALSE),
    jsonlite = vapply(all_json, jsonlite::minify, character(1L), USE.NAMES = FALSE)
)
#> Unit: milliseconds
#>      expr      min       lq     mean   median       uq      max neval
#>  simdjson 11.86990 12.60303 13.39678 13.14752 13.97499 18.53361   100
#>   jsonify 30.26910 31.48481 33.11033 32.93337 34.30477 38.01050   100
#>  jsonlite 41.14253 42.86855 44.70557 44.63316 46.02481 51.91243   100

@codecov
Copy link

codecov bot commented Nov 1, 2020

Codecov Report

Merging #58 into master will increase coverage by 1.66%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #58      +/-   ##
==========================================
+ Coverage   94.28%   95.95%   +1.66%     
==========================================
  Files          17       18       +1     
  Lines        1312     1408      +96     
==========================================
+ Hits         1237     1351     +114     
+ Misses         75       57      -18     
Impacted Files Coverage Δ
inst/include/RcppSimdJson/deserialize.hpp 91.86% <ø> (+5.08%) ⬆️
src/exported-utils.cpp 100.00% <100.00%> (ø)
src/simdjson_example.cpp 100.00% <100.00%> (+3.70%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0a2537d...1108268. Read the comment docs.

…overage on some older tests, and update ChangeLog.
Copy link
Owner

@eddelbuettel eddelbuettel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good as usual!

@eddelbuettel
Copy link
Owner

Do you want to flip it from draft to genuine PR? Or are there more parts you think are missing?

@knapply knapply marked this pull request as ready for review November 1, 2020 20:25
@knapply
Copy link
Collaborator Author

knapply commented Nov 1, 2020

Do you want to flip it from draft to genuine PR? Or are there more parts you think are missing?

Sure thing. I was just waiting for the CI to finish, but we should be good.

@eddelbuettel
Copy link
Owner

Which seems to start in slo-mo these days. [ And because I have more than $THRESHOLD repos I can't even auto-migrate to travis-ci.com. Rock, meet hard place. ]

Any reason not to fold this up and ship it to CRAN? (After one more round of win-builder / rhub of course.)

@knapply
Copy link
Collaborator Author

knapply commented Nov 1, 2020

Nope, I can't think of anything.

@eddelbuettel
Copy link
Owner

Alrighty --merging and moving right along then.

@eddelbuettel eddelbuettel merged commit 993927d into master Nov 1, 2020
@eddelbuettel eddelbuettel deleted the feature/simdjson-utils branch November 1, 2020 20:40
@eddelbuettel
Copy link
Owner

Wrapped up and shipped to CRAN. Tickled a 'needs human review' because (I think) the Windows box has now GitHub PAT and hits link-access limits 😿 as well as possibly to two existing build failures on the old box. I would expect it to fly through once they get to it, likely tomorrow (European hours) morning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants