Cleanup/userfile indexing #7221

Subash-Mohan · 2026-01-06T08:56:30Z

Description

This pull request removes all code, database fields, and Celery tasks related to legacy user file document ID migration, as this migration is now complete. It simplifies the codebase by eliminating special handling for user files in indexing and processing logic, and cleans up related monitoring and configuration.

How Has This Been Tested?

Tested byindexing documents from connectors and user files

Additional Options

[Optional] Override Linear Check

Summary by cubic

Remove legacy user-file indexing and document ID migration. Consolidates all indexing into the standard connector pipeline and removes the user-files indexing worker, queues, tasks, and deprecated schema.

Refactors
- Dropped user_file.document_id and document_id_migrated, and connector_credential_pair.is_user_file via migration.
- Removed the user file doc ID migration task and related helpers.
- Deleted the USER_FILES_INDEXING queue and worker; cleaned up constants, beat schedule, task creation, monitoring, tests, scripts, supervisord, and Helm charts.
- Simplified indexing: no special handling for user files; unified queues and priorities; removed user-file CC pair queries; connector status no longer forced to PAUSED.
- Removed temporary support for updating document_id during metadata updates across index implementations.
Migration
- Run the DB migration.
- Remove any Helm values/deployments for the user-files indexing worker; redeploy Celery with only connector_doc_fetching.
- Update monitoring to stop tracking the user_files_indexing queue.

^{Written for commit c726b72. Summary will update on new commits.}

greptile-apps · 2026-01-06T09:00:14Z

Greptile Summary

This PR successfully completes the cleanup of legacy user file document ID migration infrastructure. The changes remove special-case handling for user files throughout the indexing pipeline, unifying them with standard connector processing.

Key Changes:

Removes database fields user_file.document_id and connector_credential_pair.is_user_file via migration
Deletes entire dedicated Celery worker for user files indexing (celery_worker_user_files_indexing) and related Helm deployment resources
Removes USER_FILE_DOCID_MIGRATION periodic task and USER_FILES_INDEXING queue
Simplifies indexing logic by removing user file special casing - all connectors now use the same CONNECTOR_DOC_FETCHING queue
Removes document ID migration logic from Vespa index update methods
Eliminates 60+ test cases for deprecated user file queue routing

Impact:

Reduces code complexity by ~750 lines
Removes infrastructure overhead of maintaining separate worker and queue
Simplifies monitoring by removing user files indexing queue metrics
User files now follow standard connector lifecycle (ACTIVE status on success instead of PAUSED)

Confidence Score: 5/5

This PR is safe to merge - it's a well-executed cleanup of completed migration code
The cleanup is comprehensive and systematic across all layers (database, application logic, infrastructure, and tests). Only one minor unused constant remains. The removal of deprecated migration code is straightforward with no complex logic changes.
No files require special attention - the only issue is removing the unused USER_FILE_INDEXING_LIMIT constant in backend/onyx/configs/app_configs.py

Important Files Changed

Filename	Overview
backend/alembic/versions/a3c1a7904cd0_remove_userfile_related_deprecated_.py	Removes deprecated `document_id` from `user_file` and `is_user_file` from `connector_credential_pair` tables
backend/onyx/db/models.py	Removes `is_user_file` field from ConnectorCredentialPair and `document_id` field from UserFile models
backend/onyx/configs/constants.py	Removes legacy constants: USER_FILE_DOCID_MIGRATION task, USER_FILES_INDEXING queue, and related lock definitions
backend/onyx/db/connector_credential_pair.py	Removes `is_user_file` filtering logic and `fetch_indexable_user_file_connector_credential_pair_ids` function
backend/onyx/background/celery/tasks/docprocessing/tasks.py	Removes user file special casing in indexing logic - no longer pauses user file connectors on success or fetches them separately
backend/onyx/document_index/vespa/vespa_document_index.py	Removes `old_doc_id_to_new_doc_id` parameter and related document ID migration logic from update methods
backend/supervisord.conf	Removes celery_worker_user_files_indexing worker configuration and related logging
deployment/helm/charts/onyx/templates/celery-worker-user-files-indexing.yaml	Deletes entire Helm deployment for user files indexing worker

Sequence Diagram

sequenceDiagram
    participant Beat as Celery Beat
    participant CheckIndexing as check_for_indexing
    participant DB as Database
    participant TaskCreation as try_creating_docfetching_task
    participant DocFetchQueue as CONNECTOR_DOC_FETCHING Queue
    participant DocProcessing as docprocessing_task
    participant VespaIndex as Vespa Index

    Note over Beat,VespaIndex: Before: User files had separate queue and special handling
    Note over Beat,VespaIndex: After: All connectors use unified indexing pipeline

    Beat->>CheckIndexing: Periodic trigger
    CheckIndexing->>DB: fetch_indexable_standard_connector_credential_pair_ids()
    Note over CheckIndexing,DB: No longer fetches user files separately
    DB-->>CheckIndexing: cc_pair_ids (includes all connectors)
    
    loop For each cc_pair
        CheckIndexing->>TaskCreation: Create indexing task
        Note over TaskCreation: Removed is_user_file queue selection logic
        TaskCreation->>DocFetchQueue: Queue task (all use same queue)
    end

    DocFetchQueue->>DocProcessing: Process document
    DocProcessing->>VespaIndex: Index document
    Note over DocProcessing,VespaIndex: Removed document_id migration logic
    
    DocProcessing->>DB: Update cc_pair status
    Note over DocProcessing,DB: Set to ACTIVE (no longer pauses user files)

greptile-apps

Additional Comments (1)

backend/onyx/configs/app_configs.py, line 672 (link)

style: USER_FILE_INDEXING_LIMIT is no longer used after removing fetch_indexable_user_file_connector_credential_pair_ids. Remove this unused constant.

Context Used: Rule from dashboard - When hardcoding a boolean variable to a constant value, remove the variable entirely and clean up al... (source)

_{25 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

cubic-dev-ai

1 issue found across 25 files

Prompt for AI agents (all issues)


Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="backend/alembic/versions/a3c1a7904cd0_remove_userfile_related_deprecated_.py">

<violation number="1" location="backend/alembic/versions/a3c1a7904cd0_remove_userfile_related_deprecated_.py:31">
P1: Downgrade will fail if `user_file` table has existing rows. Adding a NOT NULL column without a `server_default` causes PostgreSQL to reject the operation when rows exist. Consider using `nullable=True` (like the similar migration in `2b75d0a8ffcb`) since data is lost during upgrade anyway.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

cubic-dev-ai · 2026-01-06T09:00:49Z

backend/alembic/versions/a3c1a7904cd0_remove_userfile_related_deprecated_.py

+    )
+    op.add_column(
+        "user_file",
+        sa.Column("document_id", sa.String(), nullable=False),


P1: Downgrade will fail if user_file table has existing rows. Adding a NOT NULL column without a server_default causes PostgreSQL to reject the operation when rows exist. Consider using nullable=True (like the similar migration in 2b75d0a8ffcb) since data is lost during upgrade anyway.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At backend/alembic/versions/a3c1a7904cd0_remove_userfile_related_deprecated_.py, line 31: <comment>Downgrade will fail if `user_file` table has existing rows. Adding a NOT NULL column without a `server_default` causes PostgreSQL to reject the operation when rows exist. Consider using `nullable=True` (like the similar migration in `2b75d0a8ffcb`) since data is lost during upgrade anyway.</comment> <file context> @@ -0,0 +1,32 @@ + ) + op.add_column( + "user_file", + sa.Column("document_id", sa.String(), nullable=False), + ) </file context>

✅ Addressed in 643df9b

acaprau · 2026-01-07T03:50:08Z

backend/onyx/document_index/vespa/index.py

            project_ids=project_ids,
        )

-        old_doc_id_to_new_doc_id: dict[str, str] = dict()


backend/onyx/document_index/interfaces_new.py

backend/onyx/server/documents/connector.py

acaprau · 2026-01-07T03:59:51Z

backend/alembic/versions/a3c1a7904cd0_remove_userfile_related_deprecated_.py

+    op.drop_column("connector_credential_pair", "is_user_file")
+
+
+def downgrade() -> None:


want someone who was around when this was implemented and has context to sanity check this, once this migration runs on prod we're done this data is gone

- Removed user files indexing worker from supervisord configuration. - Deleted user file doc ID migration task and its references from the codebase. - Updated task queues to eliminate user files indexing. - Cleaned up related constants and configurations in the Helm charts.

…aIndex - Deleted the legacy document_id field from UserFile model. - Removed temporary document ID handling from VespaIndex update method and related functions. - Cleaned up associated comments and TODOs regarding document ID migration.

- Deleted user file related fields from the database schema, including and . - Updated various functions and queries to eliminate user file handling. - Cleaned up related code and comments across multiple modules

…datable update method

…ve user_file indexing limit configuration

… model and update migration script - Deleted the document_id_migrated field from the UserFile model. - Updated the migration script to drop the corresponding column from the user_file table.

…guration - Bumped the Onyx Helm chart version from 0.4.17 to 0.4.18. - Removed the celery_worker_user_files_indexing configuration from values.yaml to clean up unused settings.

…rove logging - Removed the old_doc_id_to_new_doc_id parameter from the OpenSearchDocumentIndex update method. - Updated logging to use a defined priority variable for clarity in the connector module.

…erfile deprecations

…le deprecations

Subash-Mohan requested a review from a team as a code owner January 6, 2026 08:56

greptile-apps bot reviewed Jan 6, 2026

View reviewed changes

cubic-dev-ai bot reviewed Jan 6, 2026

View reviewed changes

Subash-Mohan force-pushed the cleanup/userfile-indexing branch from 643df9b to eec9f48 Compare January 6, 2026 10:13

acaprau reviewed Jan 7, 2026

View reviewed changes

Subash-Mohan force-pushed the cleanup/userfile-indexing branch from c0483af to be59d18 Compare January 7, 2026 08:21

acaprau approved these changes Jan 7, 2026

View reviewed changes

Subash-Mohan added 11 commits January 8, 2026 10:21

refactor: remove user file references and related fields

f6fc9cd

- Deleted user file related fields from the database schema, including and . - Updated various functions and queries to eliminate user file handling. - Cleaned up related code and comments across multiple modules

refactor: remove document_id field from user file creation

54e81ba

refactor: remove temporary old_doc_id_to_new_doc_id parameter from Up…

74f5102

…datable update method

refactor: update user_file document_id column to be nullable and remo…

10ffc74

…ve user_file indexing limit configuration

refactor: remove deprecated document_id_migrated field from user_file…

dd9c5ed

… model and update migration script - Deleted the document_id_migrated field from the UserFile model. - Updated the migration script to drop the corresponding column from the user_file table.

chore: update Helm chart version and remove user files indexing confi…

9177d2b

…guration - Bumped the Onyx Helm chart version from 0.4.17 to 0.4.18. - Removed the celery_worker_user_files_indexing configuration from values.yaml to clean up unused settings.

refactor: simplify update method by removing unused parameter and imp…

d4a0fe9

…rove logging - Removed the old_doc_id_to_new_doc_id parameter from the OpenSearchDocumentIndex update method. - Updated logging to use a defined priority variable for clarity in the connector module.

refactor: update migration script to reflect new down_revision for us…

e66af3e

…erfile deprecations

refactor: update migration script to correct down_revision for userfi…

c726b72

…le deprecations

Subash-Mohan force-pushed the cleanup/userfile-indexing branch from be59d18 to c726b72 Compare January 8, 2026 04:53

Subash-Mohan enabled auto-merge January 8, 2026 04:57

Subash-Mohan added this pull request to the merge queue Jan 8, 2026

Merged via the queue into main with commit 8ef8dfd Jan 8, 2026
71 of 73 checks passed

Subash-Mohan deleted the cleanup/userfile-indexing branch January 8, 2026 05:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cleanup/userfile indexing #7221

Cleanup/userfile indexing #7221

Uh oh!

Subash-Mohan commented Jan 6, 2026 •

edited by cubic-dev-ai bot

Loading

Uh oh!

greptile-apps bot commented Jan 6, 2026

Uh oh!

greptile-apps bot left a comment •

edited

Loading

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

cubic-dev-ai bot Jan 6, 2026 •

edited

Loading

Uh oh!

acaprau Jan 7, 2026

Uh oh!

Uh oh!

Uh oh!

acaprau Jan 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		op.drop_column("connector_credential_pair", "is_user_file")


		def downgrade() -> None:

Cleanup/userfile indexing #7221

Cleanup/userfile indexing #7221

Uh oh!

Conversation

Subash-Mohan commented Jan 6, 2026 • edited by cubic-dev-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How Has This Been Tested?

Additional Options

Summary by cubic

Uh oh!

greptile-apps bot commented Jan 6, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (1)

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

acaprau Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

acaprau Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Subash-Mohan commented Jan 6, 2026 •

edited by cubic-dev-ai bot

Loading

greptile-apps bot left a comment •

edited

Loading

cubic-dev-ai bot Jan 6, 2026 •

edited

Loading