Using dbpedia-entities-openai-1M as a dataset#1
Conversation
Most of the work is I/O bound now that we're pulling from an existing dataset.
There was a problem hiding this comment.
Pull Request Overview
This PR transitions the Tobias gem from generating random vector data to using the real-world "dbpedia-entities-openai-1M" dataset from Hugging Face. The change improves testing realism by using actual data instead of synthetic vectors.
Key changes:
- Replaces random vector generation with Hugging Face dataset download and processing
- Implements parallel processing for improved data loading performance
- Refactors database access patterns for consistency across scripts
Reviewed Changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tobias.gemspec | Adds parallel processing dependency |
| scripts/vector.rb | Replaces random data with Hugging Face dataset processing |
| scripts/tpcc.rb | Updates database access to use consistent db. prefix |
| scripts/stress.rb | Implements parallel data loading and adds new queries |
| lib/tobias/evaluations/work_mem.rb | Fixes typo and simplifies query execution |
| lib/tobias/evaluations/base.rb | Removes database parameter passing |
| lib/tobias/container.rb | Major refactor to inject database dependency and add parallel processing |
| lib/tobias/cli.rb | Updates container initialization with database injection |
| lib/tobias.rb | Adds parallel library requirement and PostgreSQL environment fix |
| README.md | Documents Hugging Face CLI prerequisite |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| end | ||
|
|
||
| def download_from_hugging_face(repo, local_dir="/tmp/#{repo}") | ||
| `hf download #{repo} --repo-type=dataset --local-dir #{local_dir}` |
There was a problem hiding this comment.
Direct string interpolation in shell commands is vulnerable to command injection. Use proper shell escaping or consider using a Ruby library like Open3 with proper parameter sanitization.
| `hf download #{repo} --repo-type=dataset --local-dir #{local_dir}` | |
| stdout, status = Open3.capture2("hf", "download", repo, "--repo-type=dataset", "--local-dir", local_dir) | |
| stdout |
| run_action(query, context).sql | ||
| end | ||
| def run_query(query) | ||
| @database.run(run_action(query).sql) |
There was a problem hiding this comment.
The method calls run_action(query).sql but this assumes the result has a .sql method. However, if the query block returns a Sequel dataset, this should work, but if it returns other types, this could fail. Consider adding error handling or type checking.
| @database.run(run_action(query).sql) | |
| result = run_action(query) | |
| sql = if result.respond_to?(:sql) | |
| result.sql | |
| elsif result.is_a?(String) | |
| result | |
| else | |
| raise TypeError, "Query block must return a Sequel dataset or a String, got #{result.class}" | |
| end | |
| @database.run(sql) |
Using dbpedia-entities-openai-1M as a dataset
No description provided.