Skip to content

Conversation

@Aydin-ab
Copy link
Contributor

@Aydin-ab Aydin-ab commented Jan 6, 2026

follow up of this (closed for inactivity over the holidays):
#58571

There are a lot of files changed but the main technical content to review are the README.ipynb

For context, the goal is to refactor this current template
https://console.anyscale.com/template-preview/llm_batch_inference
And split it into 2: one on text data, and the other on vision data
Both are very similar, but should be read independently as two distinct content 👍

Aydin Abiar added 30 commits November 12, 2025 12:10
Signed-off-by: Aydin Abiar <[email protected]>
Signed-off-by: Aydin Abiar <[email protected]>
Signed-off-by: Aydin Abiar <[email protected]>
Signed-off-by: Aydin Abiar <[email protected]>
Signed-off-by: Aydin Abiar <[email protected]>
Signed-off-by: Aydin Abiar <[email protected]>
Signed-off-by: Aydin Abiar <[email protected]>
Signed-off-by: Aydin Abiar <[email protected]>
Signed-off-by: Aydin Abiar <[email protected]>
Signed-off-by: Aydin Abiar <[email protected]>
Signed-off-by: Aydin Abiar <[email protected]>
Signed-off-by: Aydin Abiar <[email protected]>
Signed-off-by: Aydin Abiar <[email protected]>
Signed-off-by: Aydin Abiar <[email protected]>
Signed-off-by: Aydin Abiar <[email protected]>
Signed-off-by: Aydin Abiar <[email protected]>
Signed-off-by: Aydin Abiar <[email protected]>
Signed-off-by: Aydin Abiar <[email protected]>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces two new beginner-level Ray Data LLM batch inference examples: one for text data (reformatting dates) and another for vision data (generating image captions). The changes include adding new vocabulary terms like 'postprocess' and 'reformat' to the Vale style guide, updating documentation configuration to include the new example READMEs, and adding entries for these examples in examples.yml. For each example, new files were added, including Jupyter notebooks, corresponding Python scripts (for small and scaled datasets), Anyscale job configurations, CI scripts for testing and converting notebooks to markdown, and cluster configurations for AWS and GCE. Review comments highlighted several inconsistencies and areas for improvement: a grammatical error ('On contrary' instead of 'On the contrary'), trailing blank lines in YAML files, duplication of the nb2py.py utility script across examples, and an out-of-sync README for the vision example. Additionally, there were inconsistencies in the number of partitions used in the vision example's Python script versus its notebook, and a missing repartitioning step for large datasets in the vision notebook.

@ray-gardener ray-gardener bot added docs An issue or change related to documentation data Ray Data-related issues labels Jan 6, 2026
Aydin Abiar added 2 commits January 6, 2026 11:24
Signed-off-by: Aydin Abiar <[email protected]>
Signed-off-by: Aydin Abiar <[email protected]>
Copy link
Contributor

@nrghosh nrghosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • left comments inline
  • broader and other detailed comments below

Both Notebooks

  • detokenize=false issue
  • Job config filename mismatches

Comments & Explanations

  • max_model_len is a hard sequence length cap, not an "estimate." Sequences exceeding this will error at runtime
  • Repartition numbers (64, 128, 256) are arbitrary magic numbers with no explained heuristic (should be ~2-4x worker count)
  • batch_size recommendations don't explain the actual constraint (KV cache memory per sequence)

Performance Section

  • concurrency and batch size advice is kind of circular, vague and not super actionable—no guidance on how to determine or actually think about those values
  • quantization examples need caveats / disclaimers and explanation about model arch / hardware type / quantization method, they are not interchangable
  • Model parallelism example suddenly switches to different model (Llama 90B) without explanation— and model ref switches between a lot of the examples - confusing to walk through

Missing Content

  • No error handling for malformed inputs (corrupt images, bad data)
  • No checkpointing example or discussion despite intro mentioning it

Text Notebook Specific

Use Case & Prompt

  • Date reformatting seems like a trivial use-case even for a demo, why do something that a regex could do?
  • temperature=0.3 why? why not 0?
  • No output validation in postprocess—should probably validate MM-DD-YYYY format
  • mention guided/constrained decoding to guarantee valid output?
  • String concatenation is missing a space: "...MM-DD-YYYY.""Be concise..."

Vision Notebook Specific

Code

  • explain why 225x225 resize? model specific? users may not understand why or how to adapt.
  • Notebook does %pip install datasets - maybe pin version?

Explanations

  • explain why max_model_len=8192 vs text's 256 (32x difference) - talk about vision overhead?
  • "go over your Anyscale Job" → "go to your Anyscale Job"
  • show image validation as part of the demo?
def preprocess(row):
    try:
        image = Image.open(BytesIO(row['jpg']['bytes'])).convert('RGB')
    except Exception as e:
        return None

and maybe show off how Ray Data handles None returns (or use filter() to drop failures).


Both

Performance remarks / tips and pasted code examples doesn't read super smoothly and is confusing (if I put on my user hat and look at it like the first time) - jumping between models, magic numbners, etc - could be a lot more thorough in explanation. Will defer to https://github.com/anyscale/docs/pull/1626/changes for more detailed tuning and performance focus.

Aydin Abiar added 2 commits January 7, 2026 17:25
…of batch size, concurrency, more refs to docs links, refactor quantization and model parallelism section for more readability, add image validation, mention anyscale runtime, pin datasets version

Signed-off-by: Aydin Abiar <[email protected]>
@Aydin-ab
Copy link
Contributor Author

Aydin-ab commented Jan 8, 2026

@nrghosh
Followed your suggestions + the ones in your main comment (changing the prompt task etc)
I'll test the new code tomorrow but if the content looks ok now let me know 👍 Thanks a lot

@nrghosh
Copy link
Contributor

nrghosh commented Jan 8, 2026

/gemini review

Copy link
Contributor

@nrghosh nrghosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @Aydin-ab - see also cursor/gemini comments when it comes to code. As long as you are able to run them successfully, should be free of serious bugs now.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces new LLM batch inference examples for both text and vision data, along with their corresponding CI configurations and helper scripts. The changes effectively split the existing template into two distinct, independently readable content pieces, which is a good refactoring step. The new examples demonstrate how to use Ray Data LLM APIs for batch inference, including data preparation, processor configuration, and scaling considerations. The addition of CI scripts ensures these examples remain functional.

However, there are a few areas that could be improved for robustness and clarity:

  • The nb2py.py scripts use specific string matching to modify dataset limits for CI. This approach is brittle and could break if the exact string in the notebook changes.
  • Some comments in the Jupyter notebooks are slightly misleading regarding dataset size limits.
  • The standalone Python scripts contain hardcoded configuration values that would ideally be configurable for real-world use cases.

Aydin Abiar added 3 commits January 8, 2026 11:26
Aydin Abiar added 4 commits January 8, 2026 13:43
Signed-off-by: Aydin Abiar <[email protected]>
Signed-off-by: Aydin Abiar <[email protected]>
Signed-off-by: Aydin Abiar <[email protected]>
Signed-off-by: Aydin Abiar <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues docs An issue or change related to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants