Pass data through arrow #506

alexmgns · 2025-09-03T14:23:29Z

In the current state I have only implemented travel_time_matrix() to use arrow. I had to temporarily disable expanded_travel_time_matrix() to achieve this, which is why tests wont pass.

We also now have to include arrow as part of the little jar. A side effect of this is that the size goes from 0.15mb to 6mb. Just something to notice but not an issue.

…provement # Conflicts: # r-package/inst/jar/r5r.jar

rafapereirabr · 2025-09-05T00:49:09Z

Hi Alex. I've been able to run a few performance tests. See my reprex and a couple comments / questions below .

Reprex

options(java.parameters = "-Xm20G")

devtools::load_all(".")
library(bench)

path <- system.file("extdata/poa", package = "r5r")
r5r_network <- setup_r5(data_path = path, verbose = FALSE)

# 2) load origin/destination points and set arguments
points <- read.csv(system.file("extdata/poa/poa_hexgrid.csv", package = "r5r"))
points <- rbind(points, points, points, points)
mode <- c("WALK", "TRANSIT")
max_walk_time <- 30   # minutes
max_trip_duration <- 60 # minutes
departure_datetime <- as.POSIXct("13-05-2019 14:00:00",
                                 format = "%d-%m-%Y %H:%M:%S")

bench::system_time(iterations = 1,
  ttm <- travel_time_matrix(r5r_network = r5r_network,
                            origins = points,
                            destinations = points,
                            mode = mode,
                            departure_datetime = departure_datetime,
                            max_walk_time = max_walk_time,
                            max_trip_duration = max_trip_duration,
                            progress = T)
  )

Results

TLDR: the arrow implementation is faster but indeed it's currently not a huge difference.

ps. I think the function bench::system_time() only captures the memory use in R, so the results below might not capture the memory used on the Java side.

#  expression    min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result
#  java_to_dt    29s  30.6s    0.0332     229MB   0.0774     3     7      1.51m <NULL>
#   arrow df   24.3s  25.2s    0.0389     570MB   0.0519     3     4      1.28m <NULL>
# arrow arrow  25.3s  26.7s    0.0363     569MB   0.0485     3     4      1.38m <NULL>

Comments / Questions

1. returning `df` or and arrow table

One strange behavior I've found is this. Internally, as it stands in your PR, the travel time matrix function returns a data.frame, so the code on line 243 is this:

travel_times <- arrow::read_ipc_stream(travel_times, as_data_frame = T)

As a test, I changed it to as_data_frame = FALSE. With this change, we do not materialize the output, and simply return an arrow table. So I was expecting to see a quicker computation time, but the function actually became a bit slower. This is a bit strange, isn't it ?

2. R5 and arrow competing for CPU ?

I don't understand the details of the Java code, but it seems to me that you are streaming the R5 results to arrow in parallel with ArrowR5Process and BatchWithSeq. If that's the case, my concern here is that R5 runs in parallel already. So this means that R5 and Arrow would be in practice competing for CPU, which would degrade performance. Is my understanding correct ? I'm sorry if I may have misunderstood something.

alexmgns · 2025-09-07T23:01:38Z

Hi Rafa,

To be completely honest I am not 100% on the details of how multithreaded code is handled in the background. However, I believe that the two processes are not happening in parallel, the collector only collects all the batches during runtime, but the actual joining happens after all the batches are recieved. This is because collector.join() is only called after the forloop:

            for (ForkJoinTask<?> t : tasks) {
                t.get();
            }

My understanding is that this for loop awaits until all the R5 threads finish their processing. So the collecting only happens once R5 is done.

Although, I think your concert of R5 and arrow clashing could actually be a way to speed up the function further. I don't think its possible to degrade performance if the two processes are running simultaneously, java is smart and the resources will be distributed as needed there won't be clashing. But the collecting can only happen on a single thread, so at the end of the function, once all the batches are processed we have to wait for a single thread to collect all the tables into one, while the other threads are idle.
Perhaps if the batches were being merged in parallel while R5 was still calculating this would reduce the real life time it takes for the calculation, because at the very end we wouldn't have to wait as long for the single thread to finish while the rest sit idle.

alexmgns and others added 8 commits July 16, 2025 15:12

add arrow as dependency

d7ccc61

deleted unnecessary intelij files

1f6b910

Merge branch 'master' into arrow-improvement

0f140f4

first implementation of arrow table as output of traveltimeMatrix

841499a

add arrow to gradle

4c0f6e2

Rebuild JAR for commit 4c0f6e2

3dd9ae9

integrated arrow to travel_time_matrix.R

3288d8f

Merge remote-tracking branch 'origin/arrow-improvement' into arrow-im…

918d50a

…provement # Conflicts: # r-package/inst/jar/r5r.jar

rafapereirabr mentioned this pull request Nov 13, 2025

Support for Arrow #369

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pass data through arrow #506

Pass data through arrow #506

Uh oh!

alexmgns commented Sep 3, 2025

Uh oh!

rafapereirabr commented Sep 5, 2025

Uh oh!

alexmgns commented Sep 7, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Pass data through arrow #506

Are you sure you want to change the base?

Pass data through arrow #506

Uh oh!

Conversation

alexmgns commented Sep 3, 2025

Uh oh!

rafapereirabr commented Sep 5, 2025

Reprex

Results

Comments / Questions

1. returning df or and arrow table

2. R5 and arrow competing for CPU ?

Uh oh!

alexmgns commented Sep 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

1. returning `df` or and arrow table

alexmgns commented Sep 7, 2025 •

edited

Loading