-
Notifications
You must be signed in to change notification settings - Fork 14
overhaul deserialize.cpp for extremely flexible queries, generalize decompression #45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@eddelbuettel I (finally) got back to this. I think the only point of concern you might have is regarding the parameters. Originally, we had After approaching this a dozen different ways, I landed on separating parsing errors (where the JSON is simply invalid) and query errors (where a query doesn't return anything, but there's nothing wrong with the JSON itself). With that an mind, I split them into Other than that, there's some cool query functionality and we can handle .gz, .bz2, and .xz compressed files. I'd like to reinforce the tests and re-walk through (and yes, hit the ChangeLog and such), but almost there. |
I think we're not at a stage we have to worry about breaking params. We are still evolving quite a bit. Will try to take a look later or tomorrow but still have another burning issue to take care of myself. No rush. And still really lovely to see you chipping away at it and building something awesome. |
Merge branch 'experimental/pointer' of https://github.com/eddelbuettel/rcppsimdjson into experimental/pointer # Conflicts: # inst/include/RcppSimdJson/deserialize.hpp # inst/include/RcppSimdJson/utils.hpp # inst/tinytest/test_fparse_fload.R
Naturally, I forgot NEWS/ChangeLog..... will fix ASAP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooof. That is a big one. Not pretending I looked line by line -- but it is looks like pretty amazing and extensive work, once again.
I guess next is merge and platform testing...
I'm working on the ChangeLog and decided I need some better notes myself (which uncovered some refinements that I'm now fixing). These are the "big-picture" ideas: library(RcppSimdJson) Better QueriesWe can still pass a single json_to_query <- c(json1 = '["a",{"b":{"c": [[1,2,3],[4,5,6]]}}]',
json2 = '["a",{"b":{"c": [[7,8,9],[10,11,12]],"d":[[13,14,15,16],[17,18,19,20]]}}]')
# ^^^ json1 doesn't have "d" fparse(json_to_query, query = "1/b/c")
But now we can also pass multiple “flat” queries (a named or unnamed character vector). Each element of This is the preferred method if each fparse(json_to_query, query = c(query1 = "1/b/c",
query2 = "1/b/c/0",
query3 = "1/b/c/1"))
When we want to extract different data from each fparse(json_to_query,
query = list(queries1 = c(c1 = "1/b/c/0",
c2 = "1/b/c/1"),
queries2 = c(d1 = "1/b/d/0",
d2 = "1/b/d/1")))
Compressed FilesWe now handle .gz, .bz2, and .xz files that are decompressed to a raw vector (via .read_compress_write_load <- function(file_path, temp_dir) {
types <- c("gzip", "bzip2", "xz")
exts <- c("gz", "bz2", "xz")
init <- readBin(file_path, n = file.size(file_path), what = "raw")
mapply(function(type, ext) {
target_path <- paste0(temp_dir, "/", basename(file_path), ".", ext)
writeBin(memCompress(init, type = type), target_path)
RcppSimdJson::fload(target_path)
}, types, exts, SIMPLIFY = FALSE)
}
my_temp_dir <- sprintf("%s/rcppsimdjson-compressed-files", tempdir())
dir.create(my_temp_dir)
all_files <- dir(
system.file("jsonexamples", package = "RcppSimdJson"),
recursive = TRUE,
pattern = "\\.json$",
full.names = TRUE
)
names(all_files) <- basename(all_files)
res <- t(sapply(all_files, .read_compress_write_load, my_temp_dir))
unlink(my_temp_dir)
stopifnot(all(apply(
res, 1L,
function(.x) identical(.x[[1]], .x[[2]]) &&
identical(.x[[1]], .x[[3]])
)))
res
Smarter URL HandlingWith compressed files supported, we can better leverage the Additionally, remote JSON files are now downloaded simultaneously json_urls <- c(
"https://raw.githubusercontent.com/eddelbuettel/rcppsimdjson/master/inst/jsonexamples/small/smalldemo.json",
"https://raw.githubusercontent.com/eddelbuettel/rcppsimdjson/master/inst/jsonexamples/small/demo.json"
) my_temp_dir <- sprintf("%s/rcppsimdjson-downloads", tempdir())
dir.create(my_temp_dir)
fload(json_urls,
query = list(c(width = "Thumbnail/Width",
height = "Thumbnail/Height"),
c(width = "Image/Thumbnail/Width",
height = "Image/Thumbnail/Height")),
temp_dir = my_temp_dir,
keep_temp_files = TRUE,
compressed_download = TRUE)
list.files(my_temp_dir)
Lurking Windows Trap FixedWindows was mangling non-ASCII UTF-8. The issue/fix are essentially the same as SymbolixAU/jsonify#57 and it’s now tested (rather, a test is present) that uses a mix of 1-4 byte characters. extended_unicode <- '"լ ⿕ ٷ 豈 ٸ 㐀 ٹ 丂 Ɗ 一 á ٵ ̝ ѵ ̇ ˥ ɳ Ѡ · վ й ף ޑ ц Ґ ӎ Љ ß ϧ ͎ ƽ ޜ է ϖ y Î վ Ο Ӊ ٻ ʡ ө ȭ ˅ ޠ ɧ ɻ ث ́ ܇ ܧ ɽ Ո 戸 Ð 坮 ٳ 䔢 찅 곂 묨 ß ᇂ ƻ 䏐 ܄ 㿕 ս ّ 昩 僫 똠 Ɯ ٰ É"'
fparse(extended_unicode)
fparse(charToRaw(extended_unicode))
|
commit 342b08c19ecf2bea802be2426665222142c73e9f Author: Brendan Knapp <brendan.g.knapp@gmail.com> Date: Sat Aug 8 17:12:35 2020 -0700 more clean up, update ChangeLog/NEWS, add notes commit 62d74ff Author: Brendan <brendan.g.knapp@gmail.com> Date: Fri Aug 7 12:36:58 2020 -0700 add more encoding tests, verify on windows system commit 89f5bf8 Author: Brendan Knapp <brendan.g.knapp@gmail.com> Date: Fri Aug 7 07:45:20 2020 -0700 more clean up commit c31c15d Author: Brendan Knapp <brendan.g.knapp@gmail.com> Date: Thu Aug 6 17:46:31 2020 -0700 comfirmed windows string mangling... checking likely solution commit 6e41d18 Merge: a16908c 4e8337d Author: Brendan Knapp <brendan.g.knapp@gmail.com> Date: Thu Aug 6 16:41:57 2020 -0700 fix merge Merge branch 'experimental/pointer' of https://github.com/eddelbuettel/rcppsimdjson into experimental/pointer # Conflicts: # inst/include/RcppSimdJson/deserialize.hpp # inst/include/RcppSimdJson/utils.hpp # inst/tinytest/test_fparse_fload.R commit a16908c Author: Brendan Knapp <brendan.g.knapp@gmail.com> Date: Thu Aug 6 16:30:22 2020 -0700 rebase, more cleaning, add likely fix/check for potential Windows encoding issue commit 2b5bf82 Author: Brendan Knapp <brendan.g.knapp@gmail.com> Date: Sun Aug 2 10:25:26 2020 -0700 cleaning up structure and vestigial junk commit dac90d8 Author: Brendan Knapp <brendan.g.knapp@gmail.com> Date: Sat Aug 1 15:47:46 2020 -0700 queries and compressed files passing commit 289b66f Author: Brendan Knapp <brendan.g.knapp@gmail.com> Date: Thu Jul 23 20:34:46 2020 -0700 overhaul deserialize.cpp for extremely flexible querie, generalize decompression commit f1088bb Author: Dirk Eddelbuettel <edd@debian.org> Date: Wed Aug 5 07:05:11 2020 -0500 fix README thinko s/fparse/fload/ (closes #46) commit 4e8337d Author: Brendan Knapp <brendan.g.knapp@gmail.com> Date: Sun Aug 2 10:25:26 2020 -0700 cleaning up structure and vestigial junk commit 6120334 Author: Brendan Knapp <brendan.g.knapp@gmail.com> Date: Sat Aug 1 15:47:46 2020 -0700 queries and compressed files passing commit df7e711 Author: Brendan Knapp <brendan.g.knapp@gmail.com> Date: Thu Jul 23 20:34:46 2020 -0700 overhaul deserialize.cpp for extremely flexible querie, generalize decompression commit 6e4a27c Author: Dirk Eddelbuettel <edd@debian.org> Date: Thu Jul 16 13:34:18 2020 -0500 updated changelog, rolled minor version also ran M-x untabify on ChangeLog so unholy amount of whitespace change
addresses #43
Life got busy and I underestimated how tricky the more flexible queries would get.
Still WIP, but on track.