Skip to content

Conversation

@pbourke
Copy link
Contributor

@pbourke pbourke commented Dec 17, 2025

This PR adds support for loading documents from Parquet files, refactors document loading code for better reusability, and improves the testing infrastructure.

  • Introduced InputDataType.PARQUET enum value and implemented load_parquet_doc() and load_parquet_dir() functions
  • Extracted common DataFrame processing logic into _load_docs_from_dataframe() helper function to support both CSV and Parquet formats
  • Added comprehensive tests in tests/autod/io/document_test.py
  • Added test step to GitHub Actions workflow

@pbourke pbourke marked this pull request as ready for review December 17, 2025 23:19
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for loading documents from Parquet files, addressing issue #40. The implementation refactors the document loading code by extracting common DataFrame processing logic into a reusable helper function, which improves code maintainability and reduces duplication between CSV and Parquet format handling.

  • Introduced InputDataType.PARQUET enum value and implemented load_parquet_doc() and load_parquet_dir() functions for Parquet file support
  • Refactored common DataFrame processing logic into _load_docs_from_dataframe() helper function to support both CSV and Parquet formats
  • Added comprehensive parametrized tests covering both CSV and Parquet formats with file and directory scenarios

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/autod/io/document_test.py New comprehensive test suite for document loading functionality, with parametrized tests covering CSV, Parquet, JSON, and text formats
tests/autod/io/init.py Added package initialization file with copyright header
tests/autod/init.py Added package initialization file with copyright header
benchmark_qed/autod/io/document.py Refactored CSV loading to use new _load_docs_from_dataframe() helper, implemented load_parquet_doc() and load_parquet_dir(), updated create_documents() to support Parquet format, and improved load_documents() attribute handling
benchmark_qed/autod/io/enums.py Added PARQUET = "parquet" enum value to InputDataType
ruff.toml Added flake8-copyright configuration with regex pattern for Microsoft copyright headers
pyproject.toml Updated test coverage path from tests/unit to tests and added pytest configuration for temporary path retention
.gitignore Uncommented .idea/ to ignore JetBrains IDE files
.github/workflows/python-ci.yml Added test execution step to CI workflow

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@pbourke pbourke merged commit b85285f into main Dec 19, 2025
4 checks passed
@pbourke pbourke deleted the feat/add_parquet_input branch December 19, 2025 00:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants