Skip to content

Using dbpedia-entities-openai-1M as a dataset#1

Merged
binarycleric merged 14 commits into
mainfrom
using-real-data-for-vectors
Aug 23, 2025
Merged

Using dbpedia-entities-openai-1M as a dataset#1
binarycleric merged 14 commits into
mainfrom
using-real-data-for-vectors

Conversation

@binarycleric

Copy link
Copy Markdown
Owner

No description provided.

@binarycleric binarycleric requested a review from Copilot August 18, 2025 18:08

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR transitions the Tobias gem from generating random vector data to using the real-world "dbpedia-entities-openai-1M" dataset from Hugging Face. The change improves testing realism by using actual data instead of synthetic vectors.

Key changes:

  • Replaces random vector generation with Hugging Face dataset download and processing
  • Implements parallel processing for improved data loading performance
  • Refactors database access patterns for consistency across scripts

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tobias.gemspec Adds parallel processing dependency
scripts/vector.rb Replaces random data with Hugging Face dataset processing
scripts/tpcc.rb Updates database access to use consistent db. prefix
scripts/stress.rb Implements parallel data loading and adds new queries
lib/tobias/evaluations/work_mem.rb Fixes typo and simplifies query execution
lib/tobias/evaluations/base.rb Removes database parameter passing
lib/tobias/container.rb Major refactor to inject database dependency and add parallel processing
lib/tobias/cli.rb Updates container initialization with database injection
lib/tobias.rb Adds parallel library requirement and PostgreSQL environment fix
README.md Documents Hugging Face CLI prerequisite

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment thread scripts/vector.rb Outdated
Comment thread scripts/vector.rb Outdated
end

def download_from_hugging_face(repo, local_dir="/tmp/#{repo}")
`hf download #{repo} --repo-type=dataset --local-dir #{local_dir}`

Copilot AI Aug 18, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Direct string interpolation in shell commands is vulnerable to command injection. Use proper shell escaping or consider using a Ruby library like Open3 with proper parameter sanitization.

Suggested change
`hf download #{repo} --repo-type=dataset --local-dir #{local_dir}`
stdout, status = Open3.capture2("hf", "download", repo, "--repo-type=dataset", "--local-dir", local_dir)
stdout

Copilot uses AI. Check for mistakes.
Comment thread scripts/vector.rb Outdated
Comment thread lib/tobias/evaluations/work_mem.rb
Comment thread lib/tobias/container.rb
run_action(query, context).sql
end
def run_query(query)
@database.run(run_action(query).sql)

Copilot AI Aug 18, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method calls run_action(query).sql but this assumes the result has a .sql method. However, if the query block returns a Sequel dataset, this should work, but if it returns other types, this could fail. Consider adding error handling or type checking.

Suggested change
@database.run(run_action(query).sql)
result = run_action(query)
sql = if result.respond_to?(:sql)
result.sql
elsif result.is_a?(String)
result
else
raise TypeError, "Query block must return a Sequel dataset or a String, got #{result.class}"
end
@database.run(sql)

Copilot uses AI. Check for mistakes.
@binarycleric binarycleric marked this pull request as ready for review August 23, 2025 13:33
@binarycleric binarycleric changed the title Trying to use dbpedia-entities-openai-1M as a dataset Using dbpedia-entities-openai-1M as a dataset Aug 23, 2025
@binarycleric binarycleric merged commit 53547ce into main Aug 23, 2025
1 check passed
@binarycleric binarycleric deleted the using-real-data-for-vectors branch August 23, 2025 13:46
binarycleric added a commit that referenced this pull request Aug 23, 2025
Using dbpedia-entities-openai-1M as a dataset
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants