Skip to content

rcppsimdjson has inefficient network downloads #44

@lemire

Description

@lemire

I think that @melsiddieg was correct to complain about the performance of RcppSimdJson::fload. Something does not add up. Given how fast RcppSimdJson, it should be roughly as fast as curl:: curl_download . But it is not!

> url<-"http://bioinfo.hpc.cam.ac.uk/cellbase/webservices/rest/v4/hsapiens/feature/gene/TET1/snp?limit=200&skip=-1&skipCount=false&count=false&Output%20format=json&merge=false"
> res <- microbenchmark::microbenchmark(straight = curl::curl_download(url, tempfile()),
                                       jsonlite = jsonlite::fromJSON(url),
                                       simdjson = RcppSimdJson::fload(url),
                                       times = 5L)
> print(res)
Unit: milliseconds
     expr       min        lq      mean    median        uq       max neval
 straight  567.2850  568.8655  595.1786  570.3718  580.1068  689.2641     5
 jsonlite  714.3094  721.5960  733.4962  737.1303  744.9391  749.5061     5
 simdjson 2498.4776 2616.6448 2620.8897 2629.1768 2641.9882 2718.1610     5

The file has 784 KB. You can replace tempfile() by a file name and inspect it, you find that, indeed, curl is grabbing every little byte.

So it seems that RcppSimdJson could go faster by invoking curl_download and then parsing the resulting temporary file. It is also possible to load the file directly to memory (curl_fetch_memory) but I did not want to use that as a benchmark since you might argue (rightly so) that it might be cheating.

I am not 100% clear on why there is such a difference, but it does warrant investigation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions