Parallelising #1148

Arthfael · 2024-10-02T14:10:06Z

Arthfael
Oct 2, 2024

I apologize in advance if this idea seems "out-there". I don't know at all how relevant this would be for other people or how feasible from the underlying code.
I have large amounts of proteomics data, which I am trying to write with formatting. Using single core, this is slow (> 1h). I am wondering whether there could not be ways for me to make it work using parallelization, considering that I have N > 50 vCPUs at my disposal.

The idea would be to:

Write N chunks of my table without formatting into a temporary .csv file, so that each node of the cluster need read only 1 chunk.
On each vCPU, read one .csv, create wbWorkbook then apply formatting and save into a temporary Excel file.
In my main R environment, create an empty tab, then sequentially read each Excel file and paste the formatted data from the latter into the former.

The whole strategy hinges on two points:

Feasibility:
- I know I can wb_open to load an excel file as a new wbWorkbook object
- I can in theory also clone a worksheet from one wbWorkbook into another using wb_clone_worksheet; my table does not feature any formulas, charts or pivot tables, not any other exotic features (some conditional formatting: would that be a problem?), so I think limitations do not apply.
- I can use wb_copy_cells to copy within a worksheet. Can I move data within a wbWorkbook from one tab into a specific position into another tab?
Speed: would this be efficient, i.e worth it? That is, would the advantage of creating the formatting in a multi-threaded way be offset by the fact that I then need to move the data around?

JanMarvin · 2024-10-02T17:22:31Z

JanMarvin
Oct 2, 2024
Maintainer

Hi @Arthfael ,

There are certainly parts that could be parallelized and lets put aside that I have doubts that it's worth creating many formatted worksheets (the row limit in OOXML is something like 1 mio and the column limit ~16 thousand that's a lot of cells per worksheet and I can tell you, I don't see anything if I'm simply buried in data. But that was not point of the discussion.

Internally openxlsx2 uses a large character data frame cc per worksheet. This frame is a long representation of the entire worksheets data. This cc frame consists of ~10 Variables. What takes a long time is the construction of the data frame and writing into it. When saving the workbook object one has to construct the worksheet xml file which will be huge with a huge file. When writing the xlsx file, the xml file is constructed in memory and afterwards written into the workbook and zipped afterwards.

Given that one has enough memory, the construction of the worksheet xmls could be parallelized. What cannot be parallelized and takes a lot of memory is the construction of the cc file. After all it is required to create a character data frame which is something like (n * k) x 10. Creating this object and writing into it is not possible in parallel with the current code. If you know the dimensions it would be possible to construct it piecewise and rbind() them together. You could create multiple workbooks and write the pieces into different regions of the workbook and paste them together.

wb <- wb_workbook()$add_worksheet()$add_data(x = ..., dims = ...)

Afterwards you can get cc with

wb$worksheets[[1]]$sheet_data$cc

Construct a few workbook objects, collect ccs and rbind them together. Still this would require even more memory, having all the multiple sheets in a workbook.

But you are definitely on your own in this endeavor as I have no further interest in this (writing this response already took 30 minutes, working on this ...). I would push the data you want into some database, maybe mssql and use this as a data source for some xlsx file if needed. This should be faster, more robust, and you can still fill spreadsheets if required. Or maybe have a look at writexl() which probably is way better at simply writing an xlsx file.

1 reply

JanMarvin Oct 3, 2024
Maintainer

See here for a more detailed explanation, how to use power query to fill a workbook with external sources.

Arthfael · 2024-10-05T15:21:22Z

Arthfael
Oct 5, 2024
Author

Thank you, I do agree that this may not be ideal... but what can I say, there reports are expected of me, and an external database is currently not an option. I appreciate the 30 minutes ^^

I have played a bit with the idea and can provide, for any poor soul who would stumble on this in the future and wonder whether this can be made to work, the following code:

require(openxlsx2)
require(data.table)
require(plyr)

wd <- getwd()
temp <- ... # the data you want to write into your Excel table
sheetnm <- "My_first_tab"

# Create parallel cluster
N.clust <- parallel::detectCores()-1
parClust <- parallel::makeCluster(N.clust, type = "SOCK")

# Define N.clust chunks of temp
sq <- round((1:N.clust)*nrow(temp)/N.clust)
sq <- data.frame(Start = c(1,sq[1:(N.clust-1)]+1), End = sq) # We are deciding how to partition the data

# Write each locally: this seems much faster for me than exporting to the cluster the whole table
tstWrt <- lapply(1:N.clust, function(x) {
  data.table::fwrite(temp[sq$Start[x]:sq$End[x],], paste0(wd, "/tempDat_", x, ".tsv"), quote = FALSE, sep = "\t", col.names = FALSE, na = "NA")
})
parallel::clusterExport(parClust, list("sheetnm", "sq", "wd"), envir = environment())
tstWB <- parallel::parLapply(parClust, 1:N.clust, function(x) { #x <- 1
  rg <- sq$Start[x]:sq$End[x]
  WB <- openxlsx2::wb_workbook()
  WB <- openxlsx2::wb_add_worksheet(WB, sheetnm)
  tmp <- data.table::fread(paste0(wd, "/tempDat_", x, ".tsv"))
  if (x == 1) {
    # We are only writing the column names for the first chunk
    WB <- openxlsx2::wb_add_data_table(WB, sheetnm, tmp,
                                       dims = openxlsx2::wb_dims(rows = 2, cols = 1), # The rows = 2 here is because I leave one blank line above the table for writing a "super header" with merged column categories
                                       col_names = TRUE, table_style = "TableStyleMedium2",
                                       banded_rows = TRUE, banded_cols = FALSE)
  } else {
    WB <- openxlsx2::wb_add_data_table(WB, sheetnm, tmp,
                                       dims = openxlsx2::wb_dims(rows = sq$Start[x]+2, cols = 1),
                                       col_names = FALSE, table_style = "TableStyleMedium2",
                                       banded_rows = TRUE, banded_cols = FALSE)
  }
  return(WB)
})
tstCC <- lapply(1:N.clust, function(x) { #x <- 1
  tstWB[[x]]$worksheets[[1]]$sheet_data$cc
})
CCs <- plyr::rbind.fill(tstCC)
tstRA <- lapply(1:N.clust, function(x) { #x <- 1
  tstWB[[x]]$worksheets[[1]]$sheet_data$row_attr
})
RAs <- plyr::rbind.fill(tstRA)

tmpWB <- tstWB[[1]]
tmpWB$worksheets[[m]]$sheet_data$cc_out <- NULL # (Not sure if necessary)
openxlsx2::wb_save(tmpWB, paste0(wd, "/tst1.xlsx")) # Test: only columns from the first chunk are written
# After some tests, it turned out that I could not just get away with editing just ...$sheet_data$cc
# I clearly also need to edit ...$sheet_data$row_attr
# As part of debugging this, I also found other references to the table dimensions in the data structure, and am also editing those.
# This may be unnecessary, but then again, it takes almost no time and is cleaner.
fullDims <- openxlsx2::wb_dims(rows = (0:nrow(temp))+2,
                                                 cols = 1:ncol(temp))
m <- match(sheetnm, openxlsx2::wb_get_sheet_names(tmpWB))
tmpWB$worksheets[[m]]$dimension <- paste0("<dimension ref=\"", fullDims, "\"/>")
tmpWB$tables$tab_ref[[m]] <- fullDims
tmpWB$tables$tab_xml[[m]] <- gsub("ref=\"[A-Z]+[0-9]+:[A-Z]+[0-9]+\"",
                                                          paste0("ref=\"", fullDims, "\""),
                                                          tmpWB$tables$tab_xml[[m]])
tmpWB$worksheets[[m]]$sheet_data$cc <- CCs
tmpWB$worksheets[[m]]$sheet_data$row_attr <- RAs
openxlsx2::wb_save(tmpWB, paste0(wd, "/tst2.xlsx")) # TADAM! This should have as many rows as temp!!!
# Cleanup: remove temporaty tsv files:
tstRmv <- lapply(1:N.clust, function(x) { unlink(paste0(wd, "/tempDat_", x, ".tsv")) })

A first benchmark suggests some significant time gains writing the table into the workbook - ~4x faster, which is well short of the maximum expected improvement (I have 55 threads in my cluster, and memory isn't an issue) but to be expected because of the overheads, and still a nice boost.
Now, here I am just saving the table data - no conditional formatting applied. My actual code contains a lot of additional steps, which I have yet to add to this. Since these will go into the parallel bit, I expect that the ratio of parallel/overheads will increase, so the speed improvements should be even better.

So... I think that this will prove very useful for me, thanks for the support.

5 replies

JanMarvin Oct 5, 2024
Maintainer

Thanks for the follow up. Just for reference, we don’t try to be slow by design. For me personally speed isn’t an issue, my data objects are slim enough. Also while you are hopefully paid for the hour it takes to create the workbook, I’m not for spending time developing open source software and it’s not like there’s just a forgotten Sys.sleep() in the code. At least not that I’m aware of :)

If you stumble over things, please feel free to open PR. After all that’s how most of the R eco system works

JanMarvin Oct 5, 2024
Maintainer

Once you have written the first two rows of cc -assuming that the first is the column header and the remaining rows are styled similar to the second, the data row- you could try to construct cc outside of the wbWorkbook object similar to the data row, with data.table, duckdb or this new rust based package.

JanMarvin Oct 6, 2024
Maintainer

You could have a look at the branches unsorted_cc and export_wo_pugi those could provide further insight on things that could be tweaked, when writing xlsx files.

The checks for the correct order are probably not necessary and it should be possible to export without pugixml to avoid the potentially costly creation of the worksheet xml.

Arthfael Nov 5, 2024
Author

So, I finally got time to return to this, and I think I was able to make it work - it's done marvels to reduce the time spent. I am providing below a minimal example of my code, for anyone foolish enough to ever attempt to explore this dark corner:

WB <- openxlsx2::wb_WB()
SheetNames # Names of the desired Excel tabs
xlTabs # Named list, containing the data frame to write as table for each tab
for (sheetnm in SheetNames) {
  myData <- xlTabs[[sheetnm]]
  nRws <- nrow(myData)
  nCol <- ncol(myData)
  hdRg <- c(1, 2) # Header range: the table actually starts at row 2, I have a super-header (column groupings, not actually part of the Excel table range) on row 1
  tblRws <- c(1, nRws) + 2 # First and last rows where table data (not header) will be written
  # NB: Excel apparently can take 1048576 rows max!
  # At some point I may need a solution for this, but for now I am not exceeding this
  if (nRws > 1048576 - 2) {
    stop("Unhandled extreme case: more rows than the maximum allowed limit (= 1048576) in Excel!")
  }
  #
  if (sheetnm %in% openxlsx2::wb_get_sheet_names(WB)) { WB <- openxlsx2::wb_remove_worksheet(WB, sheetnm) }
  WB <- openxlsx2::wb_add_worksheet(WB, sheetnm)
  #
  tblNm <- tolower(gsub(" |\\$|\\.|-", "_", sheetnm)) # You do not have to give the table a name, but in my case it made subsequent steps easier.
  # However, I had to replace more characters than just " " and "$" for this to work.  
  #
  dims <- openxlsx2::wb_dims(rows = hdRg[2], cols = 1) # Where to start
  #
  # Step 1: write template table
  # ----------------------------
  #  - Create dummy data
  #    In the first row we will not write the actual data, but a nicely behaved dummy row without any NAs, NaNs or Inf... which could cause havoc.
  #
  dummyData <- myData[1, , drop = FALSE]
  tst <- sapply(colnames(myData), function(x) { sum(c("numeric", "integer") %in% class(myData[[x]])) }) > 0
  wNum <- setNames(which(tst), NULL)
  wTxt <- setNames(which(!tst), NULL)
  dummyData[wNum] <- 0
  dummyData[wTxt] <- "Hello world!"
  #
  #  - Write dummy data to table
  WB <- openxlsx2::wb_add_data_table(WB,
                                     sheetnm,
                                     dummyData,
                                     dims,
                                     table_name = tblNm,
                                     table_style = "TableStyleMedium2",
                                     banded_rows = TRUE,
                                     banded_cols = FALSE)
  #wb_save(WB, paste0(wd, "/tst.xlsx"));xl_open(paste0(wd, "/tst.xlsx"))
  #
  # Step 2: Global formatting
  # -------------------------
  # Now apply all of your global, column-wise formattings to the table
  # ...
  #
  # e.g.
  # dims <- openxlsx2::wb_dims(tblRws[1], 1:5)
  # WB <- openxlsx2::wb_add_cell_style(WB, sheetnm, dims, vertical = "top")
  #
  # Step 3: Create real table
  # -------------------------
  sheetMtch <- match(sheetnm, openxlsx2::wb_get_sheet_names(WB))
  #  - Edit cc object
  cc <- WB$worksheets[[sheetMtch]]$sheet_data$cc
  cc_3 <- cc[which(cc$row_r == "3"),] # My dummy row
  cc_12 <- cc[which(cc$row_r %in% c("1", "2")),] # Header row
  rownames(cc_12) <- NULL # Not sure if row names are relevant...
  #  - Sanity check:
  uNum <- unique(cc_3$typ[wNum])
  uTxt <- unique(cc_3$typ[wTxt])
  stopifnot((length(uNum) == 1)&&(uNum == "2"),
            (length(uTxt) == 1)&&(uTxt == "4"),
            length(unique(c(wNum, wTxt))) == nCol,
            sum(wNum %in% wTxt) == 0) # We only cover types 2 and 4
  #  - Create small template cc
  tmp <- cc_3
  tmp$v <- tmp$is <- ""
  tmp$c_t[wTxt] <- "inlineStr"
  tmp$c_t[wNum] <- ""
  #  - Expand it to cover the whole table
  cc_Rest <- 1:(nRws*nCol) - 1
  cc_Rest <- cc_Rest %% nCol
  cc_Rest <- cc_Rest + 1
  cc_Rest <- tmp[cc_Rest,]
  #  - Edit cell and cell columns
  a <- 1:nrow(cc_Rest)-1
  a <- a - (a %% nCol)
  a <- a/nCol
  a <- a + 3
  cc_Rest$row_r <- a
  cc_Rest$r <- do.call(paste0, c(cc_Rest[, c("c_r", "row_r")]))
  #  - Write data:
  #    Numeric values are written as text into column "v",
  #    Text goes into "is" and is subject to xml formatting.
  lngRws <- ((1:nRws)-1)*nCol
  opt <- getOption("scipen") # To avoid scientific notation when writing as text
  options(scipen = 999)
  #    -> numbers:
  if (length(wNum)) {
    for (i in wNum) {
      rg <- lngRws+i
      cc_Rest$v[rg] <- myData[[i]]
    }
  }
  options(scipen = opt)
  #    -> text:
  if (length(wTxt)) {
    for (i in wTxt) {
      rg <- lngRws+i
      g <- grepl(" ", myData[[i]])+1
      cc_Rest$is[rg] <- paste0("<is><t", c("", " xml:space=\"preserve\"")[g], ">", myData[[i]], "</t></is>")
      # Not sure if xml:space=\"preserve\" is needed systematically,
      # but it was added in some (not all!) cases where the data contains spaces.
      # I do not think that it would hurt to include it every time a space is present, or even systematically.
    }
  }
  #    Errors: those are going to v, and are marked as errors "e" in c_t
  #    NB: this could be done more elegantly.
  #      a) infinites
  w <- which(cc_Rest$v %in% c("-Inf", "Inf"))
  if (length(w)) {
    cc_Rest$v[w] <- "#NUM!" 
    cc_Rest$c_t[w] <- "e"
  }
  w <- which((cc_Rest$is %in% c("<is><t>-Inf</t></is>", "<is><t>Inf</t></is>",
                                "<is><t>-Inf</t xml:space=\"preserve\"></is>", "<is><t xml:space=\"preserve\">Inf</t></is>"))|(is.infinite(cc_Rest$is)))
  if (length(w)) {
    cc_Rest$v[w] <- "#NUM!" 
    cc_Rest$c_t[w] <- "e"
  }
  #      b) NaNs
  w <- which(cc_Rest$v == "NaN")
  if (length(w)) {
    cc_Rest$v[w] <- "#VALUE!" 
    cc_Rest$c_t[w] <- "e"
  }
  w <- which((cc_Rest$is %in% c("<is><t>NaN</t></is>", "<is><t xml:space=\"preserve\">NaN</t></is>"))|(is.nan(cc_Rest$is)))
  if (length(w)) {
    cc_Rest$v[w] <- "#VALUE!" 
    cc_Rest$c_t[w] <- "e"
    cc_Rest$is[w] <- ""
  }
  #      c) NAs
  w <- which((is.na(cc_Rest$v))|(cc_Rest$v == "NA"))
  if (length(w)) {
    cc_Rest$v[w] <- "#N/A"
    cc_Rest$c_t[w] <- "e"
  }
  w <- which((cc_Rest$is %in% c("<is><t>NA</t></is>", "<is><t xml:space=\"preserve\">NA</t></is>"))|(is.na(cc_Rest$is)))
  if (length(w)) {
    cc_Rest$v[w] <- "#N/A"
    cc_Rest$c_t[w] <- "e"
    cc_Rest$is[w] <- ""
  }
  #
  rownames(cc_Rest) <- NULL # Again, is this useful?
  cc <- rbind(cc_12, cc_Rest) # Build full CC table
  rownames(cc) <- as.character(1:nrow(cc)) # As above...
  #
  #  - Fix range of conditional formatting
  # Conditional formattings are a named object, where the name is the applicable range
  # Here we need to change that range from that of our original dummy to that of the real table
  cf <- names(WB$worksheets[[sheetMtch]]$conditionalFormatting)
  if (length(cf)) {
    cfNms0 <- cfNms <- names(WB$worksheets[[sheetMtch]]$conditionalFormatting)
    w <- which(gsub("[A-Z]", "", cfNms) == "3:3")
    if (length(w)) {
      cfNms <- gsub("3$", nRws+2, cfNms[w])
      names(WB$worksheets[[sheetMtch]]$conditionalFormatting)[w] <- cfNms
    }
  }
  #
  WB$worksheets[[sheetMtch]]$sheet_data$cc <- cc
  #
  #  - Edit row attribute
  ra <- WB$worksheets[[sheetMtch]]$sheet_data$row_attr
  tmp <- rep(3, nRws-1)
  tmp <- ra[tmp, ]
  tmp$r <- (2:nRws)+2
  ra <- rbind(ra, tmp)
  rownames(ra) <- as.character(ra$r)
  WB$worksheets[[sheetMtch]]$sheet_data$row_attr <- ra
  WB$worksheets[[sheetMtch]]$sheet_data$cc_out <- NULL # Shouldn't be a problem (cc_out is only created when the object is used to write a WB with openxlsx2::wb_save())
  #
  #  - Edit sheet and table dimensions
  fullDims <- openxlsx2::wb_dims(rows = 1:(nRws+2), cols = 1:ncol(myData))
  tblDims <- openxlsx2::wb_dims(rows = 2:(nRws+2), cols = 1:ncol(myData))
  WB$worksheets[[sheetMtch]]$dimension <- paste0("<dimension ref=\"", fullDims, "\"/>")
  WB$tables$tab_ref[match(sheetMtch, WB$tables$tab_sheet)] <- tblDims
  #
  #  - Edit xml
  #    Those appear to be added sequentially to WB$tables$tab_xml
  #    There's maybe some internal table-to-table_xml map (beside each table being referenced by name in its xml) but I did not find it.
  #    This editting is somewhat even hackier than the rest.
  #    There's much safer ways to edit the xml, such as first parsing it as xml but... well it was faster this way.
  xml <- WB$tables$tab_xml[length(WB$tables$tab_xml)]
  xml <- as.character(xml) # Fix weird character encoding shenanigans which result in Chinese characters appearing at the end of the xml!!!
  tmp <- gsub("\".*", "", unlist(strsplit(xml, " id=\"")))
  stopifnot(length(tmp) >= 2)
  xmlID <- tmp[2]
  xml <- gsub("ref=\"[A-Z]+[0-9]+:[A-Z]+[0-9]+\"", paste0("ref=\"", tblDims, "\""), xml)
  pat <- paste0(" id=\"", xmlID,"\" name=\"", tblNm, "\" displayName=\"", tblNm, "\" ")
  xml <- paste0(unlist(strsplit(xml, pat)))
  xml <- paste0(xml[1], " id=\"", xmlID, "\" name=\"", tblNm, "\" displayName=\"", tblNm, "\" ", xml[2])
  WB$tables$tab_xml[length(WB$tables$tab_xml)] <- xml
  #
  # Step 4: Here you could then apply any formatting which does not affect the whole rows range (e.g. I color in red text in specific rows of interest, group specific cells in my super-header...)
  ...
}
openxlsx2::wb_save(WB, paste0(wd, "/tst.xlsx"))
openxlsx2::xl_open(paste0(wd, "/tst.xlsx"))

Please note that this whole thing is based on assumptions which must be verified for doing this to even make sense:

Each column should have one style across the whole length of the column. This is usually the case.
I am currently only dealing with ...$cc$typ values of 2 (numeric data) and 4 (text). Presumably formulas and... logicals(?)... other stuff (?) have different types and must be dealt with separately.

For me it works for now. I am not claiming that it will solve every potential case, but I get the table that I want, formatting - including conditional formatting - works, filters work... it does not throw any error when I open the Excel table (that took some time to fix), and is as far as I can tell indistinguishable from the original table. Importantly, because that was the whole original point, it is MUCH MUCH FASTER for me.
If you ever re-use this, proceed with caution and check that the output table is at least functionally identical to what you used to get from the classic methods (all.equal() is your best friend).

JanMarvin Nov 19, 2024
Maintainer

Could you have a look if #1184 provides some speedups? I have spend some time thinking about bottlenecks and worked on the c++ functions. It's a bit quicker on my end, but as stated, my resources (and my interest 😄) to create meaningful benchmarks are limited. With a 2000 x 100 data frame of random numbers it is a blink of an eye faster, but previously the same would run already satisfyingly fast. Therefore I'm interested how it will impact your large datasets.

What I did since 1.10:

reduce the number of times Rcpp::checkUserInterrupt() is called (this allows to interrupt Rcpp functions gracefully). Previously it was called in every cell. Now it's called every 10.000 cells. [misc] Cleanup cpp code #1177
try to avoid making string copies. Previously we took a cell from one data frame and put it in a string inside a struct and afterwards copied it from the struct into the second data frame. Now we try to make use of internal C functions to move only pointers around. [Rcpp] avoid copying objects #1184

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelising #1148

{{title}}

Replies: 2 comments 6 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Parallelising #1148

Arthfael Oct 2, 2024

Replies: 2 comments · 6 replies

JanMarvin Oct 2, 2024 Maintainer

JanMarvin Oct 3, 2024 Maintainer

Arthfael Oct 5, 2024 Author

JanMarvin Oct 5, 2024 Maintainer

JanMarvin Oct 5, 2024 Maintainer

JanMarvin Oct 6, 2024 Maintainer

Arthfael Nov 5, 2024 Author

JanMarvin Nov 19, 2024 Maintainer

Arthfael
Oct 2, 2024

Replies: 2 comments 6 replies

JanMarvin
Oct 2, 2024
Maintainer

JanMarvin Oct 3, 2024
Maintainer

Arthfael
Oct 5, 2024
Author

JanMarvin Oct 5, 2024
Maintainer

JanMarvin Oct 5, 2024
Maintainer

JanMarvin Oct 6, 2024
Maintainer

Arthfael Nov 5, 2024
Author

JanMarvin Nov 19, 2024
Maintainer