Skip to content

Batch Prediction with Long Context and Context Caching using Google Gemini API in Colab - #550 #602

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

william-Dic
Copy link

Description of the feature request:

This feature request aims to develop a robust, production-ready code sample that demonstrates how to perform batch prediction with the Google Gemini API using long context and context caching. The primary use case is to extract information from large video content—such as lectures or documentaries—by asking multiple, potentially interconnected, questions.

Key aspects of the feature include:

  • Batch Prediction: Efficiently submitting a batch of questions in a way that minimizes API calls and handles rate limits, possibly by dividing the questions into smaller batches.
  • Long Context Handling: Leveraging Gemini’s long context capabilities to provide the entire video transcript or segmented summaries as context. This includes strategies to segment and summarize transcripts that exceed maximum context limits.
  • Context Caching: Implementing persistent context caching (using, for example, a JSON file) to store and reuse previous summarizations and conversation history, thereby reducing redundant API calls and improving response times.
  • Interconnected Questions: Supporting conversational history so that each question can build upon previous answers, leading to more accurate and relevant responses.
  • Output Formatting: Delivering clear, structured, and user-friendly outputs, with potential enhancements like clickable links to relevant video timestamps.
  • Robust Error Handling: Ensuring the solution gracefully handles network errors, API failures, and invalid inputs through retries and exponential backoff.
  • Multi-Language Support: Allowing the user to specify the transcript language, accommodating videos in different languages.

What problem are you trying to solve with this feature?

The feature addresses the challenge of extracting meaningful insights from lengthy video transcripts. When dealing with large amounts of text, it's difficult to efficiently process and query the information without running into API context limits or making redundant calls. This solution tackles that problem by segmenting and summarizing the transcript, caching context to reduce unnecessary API usage, and maintaining conversation history to answer interconnected questions accurately.

Demonstration of the Current Gemini Video Analysis Solution

In this demonstration, I use Gemini to analyze an almost two-hour-long video and then ask it questions. The system returns responses asynchronously in under one second.

Demo.mp4

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@Giom-V
Copy link
Collaborator

Giom-V commented Mar 24, 2025

Thanks a lot @william-Dic. I'll check internally if everything is fine with us and will come back to you.

@Giom-V Giom-V self-assigned this Mar 31, 2025
@Giom-V
Copy link
Collaborator

Giom-V commented Mar 31, 2025

Hello @william-Dic, I think the example is interesting, but now that we have the YT integration, wouldn't be easier to just use chat mode, load the video and ask questions? What's the value added of your workflow?

@JonathanMShaw
Copy link

JonathanMShaw commented Apr 3, 2025

How does the billing on this play out? Say I have a 1M token db and I want to make 1k tiny requests against it, each of which generates a tiny answer. The naive way is to include the db in the input prompt, for a total of 1M * 1k * input_token_cost. If instead I cache it, I would expect to get charged 1M * 1k * cached_token_cost, which is 1/4 of the naive cost (plus a few dollars an hour for them to hang onto the cache). If cache the db and batch the 1k prompts, does that half the cost again, so the whole thing is 1/8th as expensive as the naive implementation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:examples Issues/PR referencing examples folder status:awaiting review PR awaiting review from a maintainer
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants