Skip to content

Conversation

@SuaYoo
Copy link
Member

@SuaYoo SuaYoo commented Oct 28, 2025

Resolves #2862
Depends on #2868

Changes

  • Adds new "Deduplication" section to workflows
  • Allows users to use a collection for deduplication
  • Various refactors for consistency

Screenshots

Page Image/video
Edit Workflow / Deduplication Screenshot 2025-10-09 at 8 29 28 PM
Edit Workflow / Deduplication Screenshot 2025-10-27 at 5 54 28 PM
Edit Workflow / Deduplication Screenshot 2025-10-09 at 8 32 52 PM
Edit Workflow / Deduplication Screenshot 2025-10-09 at 8 32 57 PM
Edit Workflow / Deduplication Screenshot 2025-10-09 at 8 34 10 PM
Edit Workflow / Collections Screenshot 2025-10-09 at 8 34 35 PM
Edit Workflow / Collections Screenshot 2025-10-09 at 8 34 55 PM
Workflow / Settings Screenshot 2025-10-09 at 8 33 57 PM

Follow-ups

This PR adds a section to the user guide, which will be filled out as a part of #2933


Cron schedules are always in [UTC](https://en.wikipedia.org/wiki/Coordinated_Universal_Time).

## Deduplication
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add more documentation before merging the feature branch to main. Appreciate you adding this here as a stub/reminder

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, tracking in #2933

@SuaYoo SuaYoo force-pushed the feature-dedup--frontend-dedupe-form branch from 452568c to e4378a2 Compare November 3, 2025 17:02
@SuaYoo SuaYoo merged commit 43c4dc2 into feature-dedupe Nov 3, 2025
29 checks passed
@SuaYoo SuaYoo deleted the feature-dedup--frontend-dedupe-form branch November 3, 2025 17:12
emma-sg added a commit that referenced this pull request Nov 5, 2025
commit 43c4dc2
Author: sua yoo <[email protected]>
Date:   Mon Nov 3 09:12:50 2025 -0800

    task: Add dedupe form control to workflow (#2932)

    - Adds new "Deduplication" section to workflows
    - Allows users to use a collection for deduplication
    - Various refactors for consistency

commit 2fcf6d7
Author: Ilya Kreymer <[email protected]>
Date:   Tue Oct 28 09:47:25 2025 -0700

    Dedup Backend Initial Implementation (#2868)

    Fixes #2867

    The backend implementation involves:

    Operator
    - A new CollIndex CRD type, btrix-crds updated to 0.2.0
    - Operator that manages the new CRD type, creating a new Redis instance
    when the index should exist (uses redis_dedupe_memory and redis_dedupe_storage chart values)
    - dedupe_importer_channel can configure crawler channel for index imports
    - Operator starts the crawler in 'indexer' mode

    Workflows & Crawls:
    - Workflows have a new 'dedupeCollId' field for dedupe while crawling
    The `dedupeCollId` must also be a collection that the crawl is
    auto-added to.
    - There is a new waiting state: `waiting_for_dedupe_index` that is
    entered if a crawl is starting, but index is not yet ready.
    - Each crawl has bi-directional links for crawls that it requires for
    dedupe via `requiresCrawls` and other crawls for which this crawl is
    required via `requiredByCrawls`.
    - autoAddCollections automatically updated to always include
    `dedupeCollId` collection.

    Collection:
    - Collection has a new `hasDedupeIndex` field
    - Items added/removed to/from collection result in marking CollIndex object for updates by updating collItemsUpdatedAt timestamp to trigger a reindex
    - CollIndex object deleted on collection delete

    For indexing, dependent on version of crawler from
    webrecorder/browsertrix-crawler#884
    that supports indexing mode.

    ---------
    Co-authored-by: Tessa Walsh <[email protected]>
ikreymer pushed a commit that referenced this pull request Nov 5, 2025
- Adds new "Deduplication" section to workflows
- Allows users to use a collection for deduplication
- Various refactors for consistency
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants