Skip to content

Conversation

@Subash-Mohan
Copy link
Contributor

@Subash-Mohan Subash-Mohan commented Jan 6, 2026

Description

This pull request removes all code, database fields, and Celery tasks related to legacy user file document ID migration, as this migration is now complete. It simplifies the codebase by eliminating special handling for user files in indexing and processing logic, and cleans up related monitoring and configuration.

How Has This Been Tested?

Tested byindexing documents from connectors and user files

Additional Options

  • [Optional] Override Linear Check

Summary by cubic

Remove legacy user-file indexing and document ID migration. Consolidates all indexing into the standard connector pipeline and removes the user-files indexing worker, queues, tasks, and deprecated schema.

  • Refactors

    • Dropped user_file.document_id and document_id_migrated, and connector_credential_pair.is_user_file via migration.
    • Removed the user file doc ID migration task and related helpers.
    • Deleted the USER_FILES_INDEXING queue and worker; cleaned up constants, beat schedule, task creation, monitoring, tests, scripts, supervisord, and Helm charts.
    • Simplified indexing: no special handling for user files; unified queues and priorities; removed user-file CC pair queries; connector status no longer forced to PAUSED.
    • Removed temporary support for updating document_id during metadata updates across index implementations.
  • Migration

    • Run the DB migration.
    • Remove any Helm values/deployments for the user-files indexing worker; redeploy Celery with only connector_doc_fetching.
    • Update monitoring to stop tracking the user_files_indexing queue.

Written for commit c726b72. Summary will update on new commits.

@Subash-Mohan Subash-Mohan requested a review from a team as a code owner January 6, 2026 08:56
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 6, 2026

Greptile Summary

This PR successfully completes the cleanup of legacy user file document ID migration infrastructure. The changes remove special-case handling for user files throughout the indexing pipeline, unifying them with standard connector processing.

Key Changes:

  • Removes database fields user_file.document_id and connector_credential_pair.is_user_file via migration
  • Deletes entire dedicated Celery worker for user files indexing (celery_worker_user_files_indexing) and related Helm deployment resources
  • Removes USER_FILE_DOCID_MIGRATION periodic task and USER_FILES_INDEXING queue
  • Simplifies indexing logic by removing user file special casing - all connectors now use the same CONNECTOR_DOC_FETCHING queue
  • Removes document ID migration logic from Vespa index update methods
  • Eliminates 60+ test cases for deprecated user file queue routing

Impact:

  • Reduces code complexity by ~750 lines
  • Removes infrastructure overhead of maintaining separate worker and queue
  • Simplifies monitoring by removing user files indexing queue metrics
  • User files now follow standard connector lifecycle (ACTIVE status on success instead of PAUSED)

Confidence Score: 5/5

  • This PR is safe to merge - it's a well-executed cleanup of completed migration code
  • The cleanup is comprehensive and systematic across all layers (database, application logic, infrastructure, and tests). Only one minor unused constant remains. The removal of deprecated migration code is straightforward with no complex logic changes.
  • No files require special attention - the only issue is removing the unused USER_FILE_INDEXING_LIMIT constant in backend/onyx/configs/app_configs.py

Important Files Changed

Filename Overview
backend/alembic/versions/a3c1a7904cd0_remove_userfile_related_deprecated_.py Removes deprecated document_id from user_file and is_user_file from connector_credential_pair tables
backend/onyx/db/models.py Removes is_user_file field from ConnectorCredentialPair and document_id field from UserFile models
backend/onyx/configs/constants.py Removes legacy constants: USER_FILE_DOCID_MIGRATION task, USER_FILES_INDEXING queue, and related lock definitions
backend/onyx/db/connector_credential_pair.py Removes is_user_file filtering logic and fetch_indexable_user_file_connector_credential_pair_ids function
backend/onyx/background/celery/tasks/docprocessing/tasks.py Removes user file special casing in indexing logic - no longer pauses user file connectors on success or fetches them separately
backend/onyx/document_index/vespa/vespa_document_index.py Removes old_doc_id_to_new_doc_id parameter and related document ID migration logic from update methods
backend/supervisord.conf Removes celery_worker_user_files_indexing worker configuration and related logging
deployment/helm/charts/onyx/templates/celery-worker-user-files-indexing.yaml Deletes entire Helm deployment for user files indexing worker

Sequence Diagram

sequenceDiagram
    participant Beat as Celery Beat
    participant CheckIndexing as check_for_indexing
    participant DB as Database
    participant TaskCreation as try_creating_docfetching_task
    participant DocFetchQueue as CONNECTOR_DOC_FETCHING Queue
    participant DocProcessing as docprocessing_task
    participant VespaIndex as Vespa Index

    Note over Beat,VespaIndex: Before: User files had separate queue and special handling
    Note over Beat,VespaIndex: After: All connectors use unified indexing pipeline

    Beat->>CheckIndexing: Periodic trigger
    CheckIndexing->>DB: fetch_indexable_standard_connector_credential_pair_ids()
    Note over CheckIndexing,DB: No longer fetches user files separately
    DB-->>CheckIndexing: cc_pair_ids (includes all connectors)
    
    loop For each cc_pair
        CheckIndexing->>TaskCreation: Create indexing task
        Note over TaskCreation: Removed is_user_file queue selection logic
        TaskCreation->>DocFetchQueue: Queue task (all use same queue)
    end

    DocFetchQueue->>DocProcessing: Process document
    DocProcessing->>VespaIndex: Index document
    Note over DocProcessing,VespaIndex: Removed document_id migration logic
    
    DocProcessing->>DB: Update cc_pair status
    Note over DocProcessing,DB: Set to ACTIVE (no longer pauses user files)
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. backend/onyx/configs/app_configs.py, line 672 (link)

    style: USER_FILE_INDEXING_LIMIT is no longer used after removing fetch_indexable_user_file_connector_credential_pair_ids. Remove this unused constant.

    Context Used: Rule from dashboard - When hardcoding a boolean variable to a constant value, remove the variable entirely and clean up al... (source)

25 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 25 files

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="backend/alembic/versions/a3c1a7904cd0_remove_userfile_related_deprecated_.py">

<violation number="1" location="backend/alembic/versions/a3c1a7904cd0_remove_userfile_related_deprecated_.py:31">
P1: Downgrade will fail if `user_file` table has existing rows. Adding a NOT NULL column without a `server_default` causes PostgreSQL to reject the operation when rows exist. Consider using `nullable=True` (like the similar migration in `2b75d0a8ffcb`) since data is lost during upgrade anyway.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

)
op.add_column(
"user_file",
sa.Column("document_id", sa.String(), nullable=False),
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Downgrade will fail if user_file table has existing rows. Adding a NOT NULL column without a server_default causes PostgreSQL to reject the operation when rows exist. Consider using nullable=True (like the similar migration in 2b75d0a8ffcb) since data is lost during upgrade anyway.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/alembic/versions/a3c1a7904cd0_remove_userfile_related_deprecated_.py, line 31:

<comment>Downgrade will fail if `user_file` table has existing rows. Adding a NOT NULL column without a `server_default` causes PostgreSQL to reject the operation when rows exist. Consider using `nullable=True` (like the similar migration in `2b75d0a8ffcb`) since data is lost during upgrade anyway.</comment>

<file context>
@@ -0,0 +1,32 @@
+    )
+    op.add_column(
+        &quot;user_file&quot;,
+        sa.Column(&quot;document_id&quot;, sa.String(), nullable=False),
+    )
</file context>

✅ Addressed in 643df9b

@Subash-Mohan Subash-Mohan force-pushed the cleanup/userfile-indexing branch from 643df9b to eec9f48 Compare January 6, 2026 10:13
project_ids=project_ids,
)

old_doc_id_to_new_doc_id: dict[str, str] = dict()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

op.drop_column("connector_credential_pair", "is_user_file")


def downgrade() -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

want someone who was around when this was implemented and has context to sanity check this, once this migration runs on prod we're done this data is gone

@Subash-Mohan Subash-Mohan force-pushed the cleanup/userfile-indexing branch from c0483af to be59d18 Compare January 7, 2026 08:21
- Removed user files indexing worker from supervisord configuration.
- Deleted user file doc ID migration task and its references from the codebase.
- Updated task queues to eliminate user files indexing.
- Cleaned up related constants and configurations in the Helm charts.
…aIndex

- Deleted the legacy document_id field from UserFile model.
- Removed temporary document ID handling from VespaIndex update method and related functions.
- Cleaned up associated comments and TODOs regarding document ID migration.
- Deleted user file related fields from the database schema, including  and .
- Updated various functions and queries to eliminate user file handling.
- Cleaned up related code and comments across multiple modules
… model and update migration script

- Deleted the document_id_migrated field from the UserFile model.
- Updated the migration script to drop the corresponding column from the user_file table.
…guration

- Bumped the Onyx Helm chart version from 0.4.17 to 0.4.18.
- Removed the celery_worker_user_files_indexing configuration from values.yaml to clean up unused settings.
…rove logging

- Removed the old_doc_id_to_new_doc_id parameter from the OpenSearchDocumentIndex update method.
- Updated logging to use a defined priority variable for clarity in the connector module.
@Subash-Mohan Subash-Mohan force-pushed the cleanup/userfile-indexing branch from be59d18 to c726b72 Compare January 8, 2026 04:53
@Subash-Mohan Subash-Mohan enabled auto-merge January 8, 2026 04:57
@Subash-Mohan Subash-Mohan added this pull request to the merge queue Jan 8, 2026
Merged via the queue into main with commit 8ef8dfd Jan 8, 2026
71 of 73 checks passed
@Subash-Mohan Subash-Mohan deleted the cleanup/userfile-indexing branch January 8, 2026 05:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants