Skip to content

Ensure wordlist data generation is checked into repo and uses cirrussearch #2

@asimihsan

Description

@asimihsan

Problem statement

The script that I use to generate the wordlist language model data is not version controlled in this repo. Moreover it currently assumes you take a raw Wikipedia dump and pass it through spacy to extract words.

Change the script to instead use the easier-to-parse Wikipedia cirrussearch dumps is very handy here. so that spacy is no longer required, then version control it. Why not add some tests too?

Acceptance criteria

  • Version control the Rust tool that processes a Wikipedia dump into a wordlist data file.
  • Change the Rust tool to be able to process a cirrussearch dump instead.
  • Add integration tests (preferable to unit tests).

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1Priority 1 - Not a launch blocker but urgent issue.passwordgenAffects the Rust library implementation

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions