Skip to content

Commit

Permalink
Adding assertion to check for regular JSON inputs of size greater tha…
Browse files Browse the repository at this point in the history
…n `INT_MAX` bytes (#17057)

Addresses #17017 

Libcudf does not support parsing regular JSON inputs of size greater than `INT_MAX` bytes. Note that the batched reader can only be used for JSON lines inputs.

Authors:
  - Shruti Shivakumar (https://github.com/shrshi)

Approvers:
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #17057
  • Loading branch information
shrshi authored Oct 14, 2024
1 parent 86db980 commit 319ec3b
Show file tree
Hide file tree
Showing 2 changed files with 11 additions and 6 deletions.
3 changes: 1 addition & 2 deletions cpp/src/io/json/nested_json_gpu.cu
Original file line number Diff line number Diff line change
Expand Up @@ -83,8 +83,7 @@ struct tree_node {
void check_input_size(std::size_t input_size)
{
// Transduce() writes symbol offsets that may be as large input_size-1
CUDF_EXPECTS(input_size == 0 ||
(input_size - 1) <= std::numeric_limits<cudf::io::json::SymbolOffsetT>::max(),
CUDF_EXPECTS(input_size == 0 || (input_size - 1) <= std::numeric_limits<int32_t>::max(),
"Given JSON input is too large");
}
} // namespace
Expand Down
14 changes: 10 additions & 4 deletions cpp/src/io/json/read_json.cu
Original file line number Diff line number Diff line change
Expand Up @@ -351,10 +351,16 @@ table_with_metadata read_json(host_span<std::unique_ptr<datasource>> sources,
* JSON inputs.
*/
std::size_t const total_source_size = sources_size(sources, 0, 0);
std::size_t chunk_offset = reader_opts.get_byte_range_offset();
std::size_t chunk_size = reader_opts.get_byte_range_size();
chunk_size = !chunk_size ? total_source_size - chunk_offset
: std::min(chunk_size, total_source_size - chunk_offset);

// Batching is enabled only for JSONL inputs, not regular JSON files
CUDF_EXPECTS(
reader_opts.is_enabled_lines() || total_source_size < std::numeric_limits<int32_t>::max(),
"Parsing Regular JSON inputs of size greater than INT_MAX bytes is not supported");

std::size_t chunk_offset = reader_opts.get_byte_range_offset();
std::size_t chunk_size = reader_opts.get_byte_range_size();
chunk_size = !chunk_size ? total_source_size - chunk_offset
: std::min(chunk_size, total_source_size - chunk_offset);

std::size_t const size_per_subchunk = estimate_size_per_subchunk(chunk_size);
std::size_t const batch_size_upper_bound = get_batch_size_upper_bound();
Expand Down

0 comments on commit 319ec3b

Please sign in to comment.