-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Cleanup/userfile indexing #7221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Greptile SummaryThis PR successfully completes the cleanup of legacy user file document ID migration infrastructure. The changes remove special-case handling for user files throughout the indexing pipeline, unifying them with standard connector processing. Key Changes:
Impact:
Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Beat as Celery Beat
participant CheckIndexing as check_for_indexing
participant DB as Database
participant TaskCreation as try_creating_docfetching_task
participant DocFetchQueue as CONNECTOR_DOC_FETCHING Queue
participant DocProcessing as docprocessing_task
participant VespaIndex as Vespa Index
Note over Beat,VespaIndex: Before: User files had separate queue and special handling
Note over Beat,VespaIndex: After: All connectors use unified indexing pipeline
Beat->>CheckIndexing: Periodic trigger
CheckIndexing->>DB: fetch_indexable_standard_connector_credential_pair_ids()
Note over CheckIndexing,DB: No longer fetches user files separately
DB-->>CheckIndexing: cc_pair_ids (includes all connectors)
loop For each cc_pair
CheckIndexing->>TaskCreation: Create indexing task
Note over TaskCreation: Removed is_user_file queue selection logic
TaskCreation->>DocFetchQueue: Queue task (all use same queue)
end
DocFetchQueue->>DocProcessing: Process document
DocProcessing->>VespaIndex: Index document
Note over DocProcessing,VespaIndex: Removed document_id migration logic
DocProcessing->>DB: Update cc_pair status
Note over DocProcessing,DB: Set to ACTIVE (no longer pauses user files)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (1)
-
backend/onyx/configs/app_configs.py, line 672 (link)style:
USER_FILE_INDEXING_LIMITis no longer used after removingfetch_indexable_user_file_connector_credential_pair_ids. Remove this unused constant.Context Used: Rule from
dashboard- When hardcoding a boolean variable to a constant value, remove the variable entirely and clean up al... (source)
25 files reviewed, 1 comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 issue found across 25 files
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="backend/alembic/versions/a3c1a7904cd0_remove_userfile_related_deprecated_.py">
<violation number="1" location="backend/alembic/versions/a3c1a7904cd0_remove_userfile_related_deprecated_.py:31">
P1: Downgrade will fail if `user_file` table has existing rows. Adding a NOT NULL column without a `server_default` causes PostgreSQL to reject the operation when rows exist. Consider using `nullable=True` (like the similar migration in `2b75d0a8ffcb`) since data is lost during upgrade anyway.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| ) | ||
| op.add_column( | ||
| "user_file", | ||
| sa.Column("document_id", sa.String(), nullable=False), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P1: Downgrade will fail if user_file table has existing rows. Adding a NOT NULL column without a server_default causes PostgreSQL to reject the operation when rows exist. Consider using nullable=True (like the similar migration in 2b75d0a8ffcb) since data is lost during upgrade anyway.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/alembic/versions/a3c1a7904cd0_remove_userfile_related_deprecated_.py, line 31:
<comment>Downgrade will fail if `user_file` table has existing rows. Adding a NOT NULL column without a `server_default` causes PostgreSQL to reject the operation when rows exist. Consider using `nullable=True` (like the similar migration in `2b75d0a8ffcb`) since data is lost during upgrade anyway.</comment>
<file context>
@@ -0,0 +1,32 @@
+ )
+ op.add_column(
+ "user_file",
+ sa.Column("document_id", sa.String(), nullable=False),
+ )
</file context>
✅ Addressed in 643df9b
643df9b to
eec9f48
Compare
| project_ids=project_ids, | ||
| ) | ||
|
|
||
| old_doc_id_to_new_doc_id: dict[str, str] = dict() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
| op.drop_column("connector_credential_pair", "is_user_file") | ||
|
|
||
|
|
||
| def downgrade() -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
want someone who was around when this was implemented and has context to sanity check this, once this migration runs on prod we're done this data is gone
c0483af to
be59d18
Compare
- Removed user files indexing worker from supervisord configuration. - Deleted user file doc ID migration task and its references from the codebase. - Updated task queues to eliminate user files indexing. - Cleaned up related constants and configurations in the Helm charts.
…aIndex - Deleted the legacy document_id field from UserFile model. - Removed temporary document ID handling from VespaIndex update method and related functions. - Cleaned up associated comments and TODOs regarding document ID migration.
- Deleted user file related fields from the database schema, including and . - Updated various functions and queries to eliminate user file handling. - Cleaned up related code and comments across multiple modules
…datable update method
…ve user_file indexing limit configuration
… model and update migration script - Deleted the document_id_migrated field from the UserFile model. - Updated the migration script to drop the corresponding column from the user_file table.
…guration - Bumped the Onyx Helm chart version from 0.4.17 to 0.4.18. - Removed the celery_worker_user_files_indexing configuration from values.yaml to clean up unused settings.
…rove logging - Removed the old_doc_id_to_new_doc_id parameter from the OpenSearchDocumentIndex update method. - Updated logging to use a defined priority variable for clarity in the connector module.
…erfile deprecations
be59d18 to
c726b72
Compare
Description
This pull request removes all code, database fields, and Celery tasks related to legacy user file document ID migration, as this migration is now complete. It simplifies the codebase by eliminating special handling for user files in indexing and processing logic, and cleans up related monitoring and configuration.
How Has This Been Tested?
Tested byindexing documents from connectors and user files
Additional Options
Summary by cubic
Remove legacy user-file indexing and document ID migration. Consolidates all indexing into the standard connector pipeline and removes the user-files indexing worker, queues, tasks, and deprecated schema.
Refactors
Migration
Written for commit c726b72. Summary will update on new commits.