-
Notifications
You must be signed in to change notification settings - Fork 13
Add parquet input support (fixes #40) #41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds support for loading documents from Parquet files, addressing issue #40. The implementation refactors the document loading code by extracting common DataFrame processing logic into a reusable helper function, which improves code maintainability and reduces duplication between CSV and Parquet format handling.
- Introduced
InputDataType.PARQUETenum value and implementedload_parquet_doc()andload_parquet_dir()functions for Parquet file support - Refactored common DataFrame processing logic into
_load_docs_from_dataframe()helper function to support both CSV and Parquet formats - Added comprehensive parametrized tests covering both CSV and Parquet formats with file and directory scenarios
Reviewed changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/autod/io/document_test.py | New comprehensive test suite for document loading functionality, with parametrized tests covering CSV, Parquet, JSON, and text formats |
| tests/autod/io/init.py | Added package initialization file with copyright header |
| tests/autod/init.py | Added package initialization file with copyright header |
| benchmark_qed/autod/io/document.py | Refactored CSV loading to use new _load_docs_from_dataframe() helper, implemented load_parquet_doc() and load_parquet_dir(), updated create_documents() to support Parquet format, and improved load_documents() attribute handling |
| benchmark_qed/autod/io/enums.py | Added PARQUET = "parquet" enum value to InputDataType |
| ruff.toml | Added flake8-copyright configuration with regex pattern for Microsoft copyright headers |
| pyproject.toml | Updated test coverage path from tests/unit to tests and added pytest configuration for temporary path retention |
| .gitignore | Uncommented .idea/ to ignore JetBrains IDE files |
| .github/workflows/python-ci.yml | Added test execution step to CI workflow |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This PR adds support for loading documents from Parquet files, refactors document loading code for better reusability, and improves the testing infrastructure.