Progress bar/indicator for "by" operations #6228

joshhwuu · 2024-07-08T21:40:05Z

Similar to fwrite progress printing, triggers when ETA >= 3s, prints the following, updating every second until complete.

> dt = data.table(a = 1:100000000)
> dt[, 1, by = a]
# Processed 48203243 groups out of 100000000. 48% done. Time elapsed: 10s. ETA: 10s

Custom progress() function to show groups done, groups remaining, time elapsed
Add new option or argument to trigger progress printing
~~Add to gforce operations?~~
Benchmarking with current (calls to wallclock() are quite expensive) to measure overhead

Benchmarking

With Progress Printing

dt = data.table(a = 1:10000000)
microbenchmark(copy(dt)[, 1, by = a, showProgress = FALSE], times = 50)
Unit: seconds
                                        expr      min       lq     mean   median       uq      max neval
 copy(dt)[, 1, by = a, showProgress = FALSE] 1.994391 2.036843 2.096827 2.069326 2.111358 2.365085    50

microbenchmark(copy(dt)[, 1, by = a, showProgress = TRUE], times = 50)
Unit: seconds
                                       expr      min      lq     mean   median       uq     max neval
 copy(dt)[, 1, by = a, showProgress = TRUE] 2.103596 2.14602 2.182603 2.168911 2.205532 2.40131    50

Current Master

dt = data.table(a = 1:10000000)
microbenchmark(copy(dt)[, 1, by = a], times = 50)
Unit: seconds
                  expr     min     lq     mean   median       uq      max neval
 copy(dt)[, 1, by = a] 2.01209 2.0507 2.113709 2.086281 2.105044 3.377778    50

Performance analysis using atime:

a.result = atime::atime(
  N = 2^seq(2, 24),
  setup = {
    rows.per.group = 10
    N.rows = N*rows.per.group
    DT = data.table(i=1:N.rows, g=rep(1:N, each=rows.per.group))
  },
  seconds.limit = 3,
  default = DT[, 1, by=g],
  "TRUE" = DT[, 1, by=g, showProgress=TRUE],
  "FALSE" = DT[, 1, by=g, showProgress=FALSE]
)
plot(a.result)

github-actions · 2024-07-08T21:56:44Z

Generated via commit 78807e9

Download link for the artifact containing the test results: ↓ atime-results.zip

Time taken to finish the standard R installation steps: 12 minutes and 23 seconds

Time taken to run atime::atime_pkg on the tests: 6 minutes and 12 seconds

tdhock · 2024-07-16T11:01:49Z

please add an argument to [.data.table as in fread

     showProgress=getOption("datatable.showProgress", interactive()),

Also please add docs similar to ?fread

showProgress: ‘TRUE’ displays progress on the console if the ETA is
          greater than 3 seconds. It is produced in fread's C code
          where the very nice (but R level) txtProgressBar and
          tkProgressBar are not easily available.

joshhwuu · 2024-07-16T20:24:58Z

New argument added, allows us to do:

DT[, 1, by = a, showProgress=TRUE] (Will also automatically start if interactive() and > ?1s)

R CMD check should pass now.

~~However, there's a memory leak associated with the changes found by valgrind, will investigate.~~

Edit: Apparently this also happens with current master... not sure if this is known:

my_test_file.R:

library(devtools)
load_all()
dt = data.table(a = 1:1000000)
dt[, 1, by = a]

R -d "valgrind --tool=memcheck --leak-check=full" --vanilla < my_test_file.R

==190100== 
==190100== HEAP SUMMARY:
==190100==     in use at exit: 144,647,575 bytes in 26,707 blocks
==190100==   total heap usage: 309,984 allocs, 283,277 frees, 448,491,612 bytes allocated
==190100== 
==190100== 2,352 bytes in 7 blocks are possibly lost in loss record 438 of 2,164
==190100==    at 0x484DA83: calloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==190100==    by 0x40147D9: calloc (rtld-malloc.h:44)
==190100==    by 0x40147D9: allocate_dtv (dl-tls.c:375)
==190100==    by 0x40147D9: _dl_allocate_tls (dl-tls.c:634)
==190100==    by 0x4D597B4: allocate_stack (allocatestack.c:430)
==190100==    by 0x4D597B4: pthread_create@@GLIBC_2.34 (pthread_create.c:647)
==190100==    by 0x575D25F: ??? (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==190100==    by 0x5753A10: GOMP_parallel (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==190100==    by 0x16BBEB60: forder (forder.c:491)
==190100==    by 0x4958279: ??? (in /usr/lib/R/lib/libR.so)
==190100==    by 0x495888C: ??? (in /usr/lib/R/lib/libR.so)
==190100==    by 0x49AFB97: Rf_eval (in /usr/lib/R/lib/libR.so)
==190100==    by 0x49B3237: ??? (in /usr/lib/R/lib/libR.so)
==190100==    by 0x49AF935: Rf_eval (in /usr/lib/R/lib/libR.so)
==190100==    by 0x49B16DE: ??? (in /usr/lib/R/lib/libR.so)
==190100== 
==190100== LEAK SUMMARY:
==190100==    definitely lost: 0 bytes in 0 blocks
==190100==    indirectly lost: 0 bytes in 0 blocks
==190100==      possibly lost: 2,352 bytes in 7 blocks
==190100==    still reachable: 144,645,223 bytes in 26,700 blocks
==190100==                       of which reachable via heuristic:
==190100==                         newarray           : 4,264 bytes in 1 blocks
==190100==         suppressed: 0 bytes in 0 blocks
==190100== Reachable blocks (those to which a pointer was found) are not shown.
==190100== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==190100== 
==190100== For lists of detected and suppressed errors, rerun with: -s
==190100== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)

@tdhock @Anirban166 have you seen this before? Should I file an issue?

After doing a quick search on some forums, apparently Valgrind often doesn't playtoo well with openMP, this is most likely a false positive from Valgrind:
https://medium.com/@auraham/pseudo-memory-leaks-when-using-openmp-11a383cc4cf9
https://stackoverflow.com/questions/6973489/valgrind-and-openmp-still-reachable-and-possibly-lost-is-that-bad
dmlc/xgboost#8238

R/data.table.R

Co-authored-by: Michael Chirico <[email protected]>

tdhock · 2024-07-18T03:57:03Z

I did some benchmarking https://tdhock.github.io/blog/2024/atime-showProgress/ and it looks like the proposed code with new showProgress arg has the same performance as previous master, at least for small data sizes / compute times. (<0.1 sec).

joshhwuu · 2024-07-18T08:17:13Z

I did some benchmarking https://tdhock.github.io/blog/2024/atime-showProgress/ and it looks like the proposed code with new showProgress arg has the same performance as previous master, at least for small data sizes / compute times. (<0.1 sec).

Great! Really cool article! I included some benchmarks for some large (>2s) operations in the PR description above and it seems that on average the new changes aren't that much slower than current master (at least not by magnitudes). I think performance is nearly identical for small (<1s) operations because a built in feature of the progress indicator is that it doesn't print unless the operation is longer than 1 second, similar to fwrite (3s)

When I log on tomorrow I'll do some more comprehensive benchmarking with some of the steps you've outlined in your article!

tdhock · 2024-07-18T10:04:15Z

it doesn't print unless the operation is longer than 1 second, similar to fwrite (3s)

I wonder if we easily can make that time delay (1 or 3 seconds) the same and user controlled? If not in thie PR then please create another issue.

tdhock · 2024-07-18T10:34:07Z

For longer by operations to benchmark before and after showProgress, https://github.com/Rdatatable/data.table/wiki/Benchmarks-:-Grouping

joshhwuu · 2024-07-18T17:34:31Z

I wonder if we easily can make that time delay (1 or 3 seconds) the same and user controlled? If not in thie PR then please create another issue.

An option to select how long the function to wait before progress printing should be easy to add, it is simply a check. (This means yes, I can change this to match fwrite's 3 seconds, I just did 1 second in this draft so I can easily test the feature without having to use super large data) Not sure how useful it'll be though, seeing as fwrite and fread don't have this option and I've yet to see someone ask for it

joshhwuu · 2024-07-18T17:47:52Z

Also @tdhock and anyone that's interested in this feature, there currently isn't support for any gforce optimized operations (ie sum, mean, order etc), this is because I've only added progress printing to dogroups which is only called whenever gforce = FALSE

Would we want progress printing for gforce operations as well? It will add a little overhead but can be good for consistency

On second note: I took a quick glance at the implementations for the gforce optimized functions, and it seems that if we were to implement this for gforce, I would have to do one for every single separate function, and figure out how to with parallelization. For this PR, I think it may be a little out of scope, unless I find an easier way to do it while I dig around

MichaelChirico · 2024-07-18T18:12:26Z

I also don't see a particular need to implement for GForce, besides consistency.

With how GForce actually operates, it's ~roughly equivalent to just doing an un-grouped operation, for which we don't offer any progress bar. So in that sense, offering a progress bar in this case would then be inconsistent with the lack of bar for any j computation.

So, I'm happy to at very least leave this as out of scope for this issue & wait if such request will be made later.

joshhwuu · 2024-07-18T19:13:22Z

Ok, then I believe this PR should be ready to go, are there any preferences to having a 1 second vs 3 second wait for progression printing? Otherwise, I can add a news (does this count as a new feature or a note?) entry, write better documentation

Although I would prefer that we merge #6296 first, as some overlapping variables between the two now that we no longer want to rely on clock()

tdhock · 2024-07-18T19:19:00Z

I would prefer consistent 3 second wait.
Is there any other code/data that can obviously be shared between the different progress indicators?

MichaelChirico · 2024-07-18T19:22:42Z

I agree on the need for consistency, and a preference to share code between the implementations.

One thing I immediately see is the current implementation uses Rprintf(), which can't be called from fread.c (since that's R-agnostic). I think it will require a bit more investment to get the implementations to share code, therefore I'm OK to defer that as a follow-up issue. WDYT @tdhock?

tdhock · 2024-07-18T19:25:12Z

Rather than deferring code sharing to a follow up issue, I would rather get it right here.
At least look at what is common to fread and fwrite, and use the same here.

joshhwuu · 2024-07-18T19:27:41Z

I would prefer consistent 3 second wait.

Ok 👍

Is there any other code/data that can obviously be shared between the different progress indicators?

fread progress printing is just a progress bar, printing out equal signs until it is done using a custom progress function, fwrite prints operation-specific information iteratively similar to 'by' operations, although the information it prints differs. As for gforce, the information would be based on how each of the different optimized functions work internally, ie the differences between gmean and gsum and how they handle groups. AFAIU, gforce doesn't iteratively handle each group, instead it handles every group at once via vectorization, so how progress printing would work isn't immediately clear to me

joshhwuu · 2024-07-18T19:50:15Z

There's definitely a world where all of the code can use void progress() implemented for fread, (the prototype for this PR actually called that), it just means the progress printing wouldn't be very specific, although I did see here that there might've been intention at one point to use it for fwrite:

data.table/src/fwrite.c

Line 957 in 0030b15

// TODO: use progress() as in fread

tdhock · 2024-07-29T15:06:19Z

ok great thanks for the analysis about fread/fwrite, so I guess we create a follow-up issue about sharing progress() between all three.

please add a NEWS item.

tdhock · 2024-07-29T20:02:31Z

great thanks

progress prototype

20b2137

fwrite-esque reporting

6b09d1b

This comment was marked as outdated.

Sign in to view

joshhwuu mentioned this pull request Jul 15, 2024

Master List of data.table Issues for GSoC '24 (Josh) joshhwuu/gsoc-2024#1

Open

11 tasks

add as an argument like in fread, update data.table.Rd

dc2320d

rd change

2bafec9

joshhwuu mentioned this pull request Jul 16, 2024

'by' operations much slower when verbose=TRUE #6286

Open

MichaelChirico reviewed Jul 17, 2024

View reviewed changes

R/data.table.R Outdated Show resolved Hide resolved

Update R/data.table.R

b3d9a8d

Co-authored-by: Michael Chirico <[email protected]>

progress printing starts after >3s

f30bf9e

Merge branch 'master' into progressforbyops

0463cb2

joshhwuu marked this pull request as ready for review July 19, 2024 18:09

NEWS entry

ad76d61

Merge branch 'master' into progressforbyops

78807e9

tdhock merged commit 4704c82 into master Jul 29, 2024
5 checks passed

joshhwuu added a commit that referenced this pull request Aug 1, 2024

reorganize and clarify time variables following #6228

8133a6d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Progress bar/indicator for "by" operations #6228

Progress bar/indicator for "by" operations #6228

joshhwuu commented Jul 8, 2024 •

edited

Loading

github-actions bot commented Jul 8, 2024 •

edited

Loading

This comment was marked as outdated.

tdhock commented Jul 16, 2024

joshhwuu commented Jul 16, 2024 •

edited

Loading

tdhock commented Jul 18, 2024

joshhwuu commented Jul 18, 2024 •

edited

Loading

tdhock commented Jul 18, 2024

tdhock commented Jul 18, 2024

joshhwuu commented Jul 18, 2024 •

edited

Loading

joshhwuu commented Jul 18, 2024 •

edited

Loading

MichaelChirico commented Jul 18, 2024

joshhwuu commented Jul 18, 2024 •

edited

Loading

tdhock commented Jul 18, 2024

MichaelChirico commented Jul 18, 2024

tdhock commented Jul 18, 2024

joshhwuu commented Jul 18, 2024 •

edited

Loading

joshhwuu commented Jul 18, 2024

tdhock commented Jul 29, 2024

tdhock commented Jul 29, 2024

Progress bar/indicator for "by" operations #6228

Progress bar/indicator for "by" operations #6228

Conversation

joshhwuu commented Jul 8, 2024 • edited Loading

Benchmarking

Performance analysis using atime:

github-actions bot commented Jul 8, 2024 • edited Loading

This comment was marked as outdated.

tdhock commented Jul 16, 2024

joshhwuu commented Jul 16, 2024 • edited Loading

tdhock commented Jul 18, 2024

joshhwuu commented Jul 18, 2024 • edited Loading

tdhock commented Jul 18, 2024

tdhock commented Jul 18, 2024

joshhwuu commented Jul 18, 2024 • edited Loading

joshhwuu commented Jul 18, 2024 • edited Loading

MichaelChirico commented Jul 18, 2024

joshhwuu commented Jul 18, 2024 • edited Loading

tdhock commented Jul 18, 2024

MichaelChirico commented Jul 18, 2024

tdhock commented Jul 18, 2024

joshhwuu commented Jul 18, 2024 • edited Loading

joshhwuu commented Jul 18, 2024

tdhock commented Jul 29, 2024

tdhock commented Jul 29, 2024

joshhwuu commented Jul 8, 2024 •

edited

Loading

github-actions bot commented Jul 8, 2024 •

edited

Loading

joshhwuu commented Jul 16, 2024 •

edited

Loading

joshhwuu commented Jul 18, 2024 •

edited

Loading

joshhwuu commented Jul 18, 2024 •

edited

Loading

joshhwuu commented Jul 18, 2024 •

edited

Loading

joshhwuu commented Jul 18, 2024 •

edited

Loading

joshhwuu commented Jul 18, 2024 •

edited

Loading