Skip to content

overhaul deserialize.cpp for extremely flexible queries, generalize decompression #45

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Aug 9, 2020

Conversation

knapply
Copy link
Collaborator

@knapply knapply commented Jul 24, 2020

addresses #43

Life got busy and I underestimated how tricky the more flexible queries would get.

Still WIP, but on track.

@knapply
Copy link
Collaborator Author

knapply commented Aug 1, 2020

@eddelbuettel I (finally) got back to this.

I think the only point of concern you might have is regarding the parameters.

Originally, we had error_ok= and on_error=.

After approaching this a dozen different ways, I landed on separating parsing errors (where the JSON is simply invalid) and query errors (where a query doesn't return anything, but there's nothing wrong with the JSON itself).

With that an mind, I split them into parse_error_ok=/on_parse_error= and query_error_ok=/on_query_error=. So this would be a breaking change in that regard as error_ok=/on_error= are gone.

Other than that, there's some cool query functionality and we can handle .gz, .bz2, and .xz compressed files.

I'd like to reinforce the tests and re-walk through (and yes, hit the ChangeLog and such), but almost there.

@eddelbuettel
Copy link
Owner

I think we're not at a stage we have to worry about breaking params. We are still evolving quite a bit.

Will try to take a look later or tomorrow but still have another burning issue to take care of myself. No rush. And still really lovely to see you chipping away at it and building something awesome.

@knapply knapply marked this pull request as ready for review August 7, 2020 20:35
@knapply knapply requested a review from eddelbuettel August 7, 2020 20:35
@knapply
Copy link
Collaborator Author

knapply commented Aug 7, 2020

Naturally, I forgot NEWS/ChangeLog..... will fix ASAP

Copy link
Owner

@eddelbuettel eddelbuettel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooof. That is a big one. Not pretending I looked line by line -- but it is looks like pretty amazing and extensive work, once again.

I guess next is merge and platform testing...

@knapply
Copy link
Collaborator Author

knapply commented Aug 8, 2020

I'm working on the ChangeLog and decided I need some better notes myself (which uncovered some refinements that I'm now fixing).

These are the "big-picture" ideas:

library(RcppSimdJson)

Better Queries

We can still pass a single query= that’s applied to each json= element.

json_to_query <- c(json1 = '["a",{"b":{"c": [[1,2,3],[4,5,6]]}}]',
                   json2 = '["a",{"b":{"c": [[7,8,9],[10,11,12]],"d":[[13,14,15,16],[17,18,19,20]]}}]')
#                                                                 ^^^ json1 doesn't have "d"
fparse(json_to_query, query = "1/b/c")
## $json1
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## 
## $json2
##      [,1] [,2] [,3]
## [1,]    7    8    9
## [2,]   10   11   12

But now we can also pass multiple “flat” queries (a named or unnamed character vector). Each element of query= is applied to all elements of json=.

This is the preferred method if each json= has roughly the same schema and we want to extract the same data from each of them.

fparse(json_to_query, query = c(query1 = "1/b/c",
                                query2 = "1/b/c/0",
                                query3 = "1/b/c/1"))
## $json1
## $json1$query1
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## 
## $json1$query2
## [1] 1 2 3
## 
## $json1$query3
## [1] 4 5 6
## 
## 
## $json2
## $json2$query1
##      [,1] [,2] [,3]
## [1,]    7    8    9
## [2,]   10   11   12
## 
## $json2$query2
## [1] 7 8 9
## 
## $json2$query3
## [1] 10 11 12

When we want to extract different data from each json=, such as when the schemata aren’t related, we can also specify a “nested” query. This is a list of character vectors that are applied in a zip-like fashion.

fparse(json_to_query,
       query = list(queries1 = c(c1 = "1/b/c/0",
                                 c2 = "1/b/c/1"),
                    queries2 = c(d1 = "1/b/d/0",
                                 d2 = "1/b/d/1")))
## $queries1
## $queries1$c1
## [1] 1 2 3
## 
## $queries1$c2
## [1] 4 5 6
## 
## 
## $queries2
## $queries2$d1
## [1] 13 14 15 16
## 
## $queries2$d2
## [1] 17 18 19 20

Compressed Files

We now handle .gz, .bz2, and .xz files that are decompressed to a raw vector (via memDecompress()).

.read_compress_write_load <- function(file_path, temp_dir) {
    types <- c("gzip", "bzip2", "xz")
    exts <- c("gz",    "bz2",   "xz")

    init <- readBin(file_path, n = file.size(file_path), what = "raw")
    
    mapply(function(type, ext) {
        target_path <- paste0(temp_dir, "/", basename(file_path), ".", ext)
        writeBin(memCompress(init, type = type), target_path)
        RcppSimdJson::fload(target_path)
    }, types, exts, SIMPLIFY = FALSE)
}

my_temp_dir <- sprintf("%s/rcppsimdjson-compressed-files", tempdir())
dir.create(my_temp_dir)
all_files <- dir(
    system.file("jsonexamples", package = "RcppSimdJson"),
    recursive = TRUE,
    pattern = "\\.json$",
    full.names = TRUE
)
names(all_files) <- basename(all_files)
res <- t(sapply(all_files, .read_compress_write_load, my_temp_dir))
unlink(my_temp_dir)

stopifnot(all(apply(
    res, 1L, 
    function(.x) identical(.x[[1]], .x[[2]]) && 
        identical(.x[[1]], .x[[3]])
)))

res
##                                       gzip          bzip2         xz           
## apache_builds.json                    List,15       List,15       List,15      
## github_events.json                    List,8        List,8        List,8       
## instruments.json                      List,9        List,9        List,9       
## mesh.json                             List,8        List,8        List,8       
## numbers.json                          Numeric,10001 Numeric,10001 Numeric,10001
## random.json                           List,4        List,4        List,4       
## adversarial.json                      List,1        List,1        List,1       
## demo.json                             List,1        List,1        List,1       
## flatadversarial.json                  List,2        List,2        List,2       
## che-1.geo.json                        List,2        List,2        List,2       
## che-2.geo.json                        List,2        List,2        List,2       
## che-3.geo.json                        List,2        List,2        List,2       
## google_maps_api_compact_response.json List,4        List,4        List,4       
## google_maps_api_response.json         List,4        List,4        List,4       
## twitter_api_compact_response.json     List,16       List,16       List,16      
## twitter_api_response.json             List,25       List,25       List,25      
## repeat.json                           List,4        List,4        List,4       
## smalldemo.json                        List,8        List,8        List,8       
## truenull.json                         Logical,2000  Logical,2000  Logical,2000 
## twitter_timeline.json                 List,21       List,21       List,21      
## twitter.json                          List,2        List,2        List,2       
## twitterescaped.json                   List,2        List,2        List,2       
## update-center.json                    List,6        List,6        List,6

Smarter URL Handling

With compressed files supported, we can better leverage the compressed_download= parameter.

Additionally, remote JSON files are now downloaded simultaneously if (getOption("download.file.method", default = "auto") == "libcurl").

json_urls <- c(
    "https://raw.githubusercontent.com/eddelbuettel/rcppsimdjson/master/inst/jsonexamples/small/smalldemo.json",
    "https://raw.githubusercontent.com/eddelbuettel/rcppsimdjson/master/inst/jsonexamples/small/demo.json"
)
my_temp_dir <- sprintf("%s/rcppsimdjson-downloads", tempdir())
dir.create(my_temp_dir)

fload(json_urls,
      query = list(c(width = "Thumbnail/Width", 
                     height = "Thumbnail/Height"),
                   c(width = "Image/Thumbnail/Width", 
                     height = "Image/Thumbnail/Height")),
      temp_dir = my_temp_dir,
      keep_temp_files = TRUE,
      compressed_download = TRUE)
## $smalldemo.json
## $smalldemo.json$width
## [1] 100
## 
## $smalldemo.json$height
## [1] 125
## 
## 
## $demo.json
## $demo.json$width
## [1] 100
## 
## $demo.json$height
## [1] 125
list.files(my_temp_dir)
## [1] "demo52a53d7b405d.json.gz"      "smalldemo52a53feccca8.json.gz"

Lurking Windows Trap Fixed

Windows was mangling non-ASCII UTF-8.

The issue/fix are essentially the same as SymbolixAU/jsonify#57 and it’s now tested (rather, a test is present) that uses a mix of 1-4 byte characters.

extended_unicode <- '"լ ⿕  ٷ 豈 ٸ 㐀 ٹ 丂 Ɗ 一 á ٵ ̝ ѵ ̇ ˥ ɳ Ѡ · վ  й ף ޑ  ц Ґ  ӎ Љ ß ϧ ͎ ƽ ޜ է ϖ y Î վ Ο Ӊ ٻ ʡ ө ȭ ˅ ޠ ɧ ɻ ث ́ ܇ ܧ ɽ Ո 戸 Ð 坮 ٳ 䔢 찅 곂 묨 ß ᇂ ƻ 䏐 ܄ 㿕 ս ّ 昩 僫 똠 Ɯ ٰ É"'
fparse(extended_unicode)
## [1] "լ ⿕  ٷ 豈 ٸ 㐀 ٹ 丂 Ɗ 一 á ٵ ̝ ѵ ̇ ˥ ɳ Ѡ · վ  й ף ޑ  ц Ґ  ӎ Љ ß ϧ ͎ ƽ ޜ է ϖ y Î վ Ο Ӊ ٻ ʡ ө ȭ ˅ ޠ ɧ ɻ ث ́ ܇ ܧ ɽ Ո 戸 Ð 坮 ٳ 䔢 찅 곂 묨 ß ᇂ ƻ 䏐 ܄ 㿕 ս ّ 昩 僫 똠 Ɯ ٰ É"
fparse(charToRaw(extended_unicode))
## [1] "լ ⿕  ٷ 豈 ٸ 㐀 ٹ 丂 Ɗ 一 á ٵ ̝ ѵ ̇ ˥ ɳ Ѡ · վ  й ף ޑ  ц Ґ  ӎ Љ ß ϧ ͎ ƽ ޜ է ϖ y Î վ Ο Ӊ ٻ ʡ ө ȭ ˅ ޠ ɧ ɻ ث ́ ܇ ܧ ɽ Ո 戸 Ð 坮 ٳ 䔢 찅 곂 묨 ß ᇂ ƻ 䏐 ܄ 㿕 ս ّ 昩 僫 똠 Ɯ ٰ É"

@eddelbuettel eddelbuettel merged commit f8b48f1 into master Aug 9, 2020
eddelbuettel added a commit that referenced this pull request Aug 9, 2020
commit 342b08c19ecf2bea802be2426665222142c73e9f
Author: Brendan Knapp <brendan.g.knapp@gmail.com>
Date:   Sat Aug 8 17:12:35 2020 -0700

    more clean up, update ChangeLog/NEWS, add notes

commit 62d74ff
Author: Brendan <brendan.g.knapp@gmail.com>
Date:   Fri Aug 7 12:36:58 2020 -0700

    add more encoding tests, verify on windows system

commit 89f5bf8
Author: Brendan Knapp <brendan.g.knapp@gmail.com>
Date:   Fri Aug 7 07:45:20 2020 -0700

    more clean up

commit c31c15d
Author: Brendan Knapp <brendan.g.knapp@gmail.com>
Date:   Thu Aug 6 17:46:31 2020 -0700

    comfirmed windows string mangling... checking likely solution

commit 6e41d18
Merge: a16908c 4e8337d
Author: Brendan Knapp <brendan.g.knapp@gmail.com>
Date:   Thu Aug 6 16:41:57 2020 -0700

    fix merge

    Merge branch 'experimental/pointer' of https://github.com/eddelbuettel/rcppsimdjson into experimental/pointer

    # Conflicts:
    #	inst/include/RcppSimdJson/deserialize.hpp
    #	inst/include/RcppSimdJson/utils.hpp
    #	inst/tinytest/test_fparse_fload.R

commit a16908c
Author: Brendan Knapp <brendan.g.knapp@gmail.com>
Date:   Thu Aug 6 16:30:22 2020 -0700

    rebase, more cleaning, add likely fix/check for potential Windows encoding issue

commit 2b5bf82
Author: Brendan Knapp <brendan.g.knapp@gmail.com>
Date:   Sun Aug 2 10:25:26 2020 -0700

    cleaning up structure and vestigial junk

commit dac90d8
Author: Brendan Knapp <brendan.g.knapp@gmail.com>
Date:   Sat Aug 1 15:47:46 2020 -0700

    queries and compressed files passing

commit 289b66f
Author: Brendan Knapp <brendan.g.knapp@gmail.com>
Date:   Thu Jul 23 20:34:46 2020 -0700

    overhaul deserialize.cpp for extremely flexible querie, generalize decompression

commit f1088bb
Author: Dirk Eddelbuettel <edd@debian.org>
Date:   Wed Aug 5 07:05:11 2020 -0500

    fix README thinko s/fparse/fload/ (closes #46)

commit 4e8337d
Author: Brendan Knapp <brendan.g.knapp@gmail.com>
Date:   Sun Aug 2 10:25:26 2020 -0700

    cleaning up structure and vestigial junk

commit 6120334
Author: Brendan Knapp <brendan.g.knapp@gmail.com>
Date:   Sat Aug 1 15:47:46 2020 -0700

    queries and compressed files passing

commit df7e711
Author: Brendan Knapp <brendan.g.knapp@gmail.com>
Date:   Thu Jul 23 20:34:46 2020 -0700

    overhaul deserialize.cpp for extremely flexible querie, generalize decompression

commit 6e4a27c
Author: Dirk Eddelbuettel <edd@debian.org>
Date:   Thu Jul 16 13:34:18 2020 -0500

    updated changelog, rolled minor version

    also ran M-x untabify on ChangeLog so unholy amount of whitespace change
@eddelbuettel eddelbuettel deleted the experimental/pointer branch August 11, 2020 16:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants