Skip to content

arrow backend#197

Open
JanMarvin wants to merge 5 commits intojeroen:masterfrom
JanMarvin:arrow_backend
Open

arrow backend#197
JanMarvin wants to merge 5 commits intojeroen:masterfrom
JanMarvin:arrow_backend

Conversation

@JanMarvin
Copy link
Copy Markdown
Contributor

Hi @jeroen ,

this is a draft for an arrow backend. I just tried to include it from what I tried yesterday in #6. I make it pass the tests, but not everything can be done with this arrow backend. Since it can be called without any PR this is just elaborated toying around. I have no real world use case for this.

ct <- V8::v8(backend = "arrow")
ct$source("https://unpkg.com/underscore@1.13.7/underscore-min.js")
#> [1] "true"
ct$assign("flights", nycflights13::flights)
js_code <- 'var filtered = _.filter(flights, x => x.arr_delay > 720);'
ct$eval(js_code)
ct$get("filtered")[1:5, 1:8]
#> # A tibble: 5 × 8
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <dbl> <dbl> <dbl>    <dbl>          <dbl>     <dbl>    <dbl>          <dbl>
#> 1  2013     1     1      848           1835       853     1001           1950
#> 2  2013     1     9      641            900      1301     1242           1530
#> 3  2013     1    10     1121           1635      1126     1239           1810
#> 4  2013    11     3      603           1645       798      829           1913
#> 5  2013    12     5      756           1700       896     1058           2020

ct <- V8::v8()
ct$source("https://unpkg.com/underscore@1.13.7/underscore-min.js")
#> [1] "true"
ct$assign("flights", nycflights13::flights)
js_code <- 'var filtered = _.filter(flights, x => x.arr_delay > 720);'
ct$eval(js_code)
ct$get("filtered")[1:5, 1:8]
#>   year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> 1 2013     1   1      848           1835       853     1001           1950
#> 2 2013     1   9      641            900      1301     1242           1530
#> 3 2013     1  10     1121           1635      1126     1239           1810
#> 4 2013    11   3      603           1645       798      829           1913
#> 5 2013    12   5      756           1700       896     1058           2020

@jeroen
Copy link
Copy Markdown
Owner

jeroen commented Nov 24, 2025

Wow this is pretty cool, thanks! Going to tinker with this some more when I have time. I wonder if we can even go further and share the in-memory data object without copying.

@JanMarvin
Copy link
Copy Markdown
Contributor Author

Hm, I tried a bit, but apparently this is not so simple due to the sandbox in v8. All the hints I could find, caused the sandbox to bark while terminating the session due to a potential security violation.

But I asked Gemini to check the existing functions, and this is what I see

tmp <- tempfile(fileext = ".js")
curl::curl_download(
  url = "https://unpkg.com/underscore@1.13.7/underscore-min.js",
  destfile = tmp)

test <- function(backend = NULL, tmp = NULL) {
  ct <- V8::v8(backend = backend)
  ct$source(tmp)
  ct$call("_.filter", nycflights13::flights, V8::JS("function(x){return x.arr_delay > 720}"))
}

res <- microbenchmark::microbenchmark(
  test(backend = "arrow", tmp = tmp),
  test(backend = "jsonlite", tmp = tmp),
  times = 25, unit = "ms"
); res
#> Warning in microbenchmark::microbenchmark(test(backend = "arrow", tmp = tmp), :
#> less accurate nanosecond times to avoid potential integer overflows
#> Unit: milliseconds
#>                                   expr       min        lq      mean   median
#>     test(backend = "arrow", tmp = tmp)  287.8694  303.9558  354.6052  316.310
#>  test(backend = "jsonlite", tmp = tmp) 2621.3725 2676.6829 2761.8196 2703.873
#>         uq      max neval
#>   322.2923 1267.931    25
#>  2750.2416 4039.942    25

But this is a better comparison, the call function above is tweaked to avoid the construction of the function with the entire data included in the function body. The following should be a fair one to one comparison. This creates identical objects in v8, so both are directly comparable. Therefore the arrow backend creates a table from the Arrow table sourced from the ipc stream and when importing it creates an Arrow table to return the ipc stream.

test <- function(backend = NULL) {
  ct <- V8::v8(backend = backend)
  ct$assign("flights", nycflights13::flights)
  ct$get("flights")
}

res <- microbenchmark::microbenchmark(
  test(backend = "arrow"),
  test(backend = "jsonlite"),
  times = 5, unit = "ms"
); res
#> Warning in microbenchmark::microbenchmark(test(backend = "arrow"), test(backend
#> = "jsonlite"), : less accurate nanosecond times to avoid potential integer
#> overflows
#> Unit: milliseconds
#>                        expr      min       lq     mean   median       uq
#>     test(backend = "arrow") 6986.343 7040.548 7451.106 7641.929 7677.567
#>  test(backend = "jsonlite") 7421.892 7603.722 7905.523 7674.219 7795.122
#>       max neval
#>  7909.143     5
#>  9032.658     5

@JanMarvin
Copy link
Copy Markdown
Contributor Author

getting the data is the slow part (probably due to the fact that the conversion from javascript table to Arrow table is happening in js). Might be faster using a wasm Arrow.

test <- function(backend = NULL) {
  ct <- V8::v8(backend = backend)
  ct$assign("flights", nycflights13::flights)
  # ct$get("flights")
}

res <- microbenchmark::microbenchmark(
  test(backend = "arrow"),
  test(backend = "jsonlite"),
  times = 5, unit = "ms"
); res
#> Warning in microbenchmark::microbenchmark(test(backend = "arrow"), test(backend
#> = "jsonlite"), : less accurate nanosecond times to avoid potential integer
#> overflows
#> Unit: milliseconds
#>                        expr      min        lq      mean    median        uq
#>     test(backend = "arrow")  180.601  186.3381  384.6102  195.7693  219.9445
#>  test(backend = "jsonlite") 2322.650 2326.7051 2604.6440 2390.2020 2410.7080
#>       max neval
#>  1140.398     5
#>  3572.955     5

@JanMarvin
Copy link
Copy Markdown
Contributor Author

avoiding json speeds things up

test <- function(backend = NULL) {
  ct <- V8::v8(backend = backend)
  ct$assign("flights", nycflights13::flights)
  ct$get("flights")
}

res <- microbenchmark::microbenchmark(
  test(backend = "arrow"),
  test(backend = "jsonlite"),
  times = 5, unit = "ms"
); res
#> Warning in microbenchmark::microbenchmark(test(backend = "arrow"), test(backend
#> = "jsonlite"), : less accurate nanosecond times to avoid potential integer
#> overflows
#> Unit: milliseconds
#>                        expr      min       lq     mean   median       uq
#>     test(backend = "arrow") 4645.056 4645.183 4741.127 4655.773 4667.055
#>  test(backend = "jsonlite") 7717.207 8181.782 8765.938 8965.483 9028.921
#>       max neval
#>  5092.569     5
#>  9936.294     5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants