-
-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
replaces wordlist-5-dice with a new word list #39
base: main
Are you sure you want to change the base?
Conversation
…om Wikipedia as well as Google Ngram data
I've given this list a bit of a refresh for spring, incorporating words from Wikipedia, thanks to this project, and ensuring that some (more) British spellings of English words are not on the list (sorry Britain!). Information/attributes of the updated list
Licensing (updated)Given that Wikipedia text is licensed as Creative Commons Attribution-ShareAlike 3.0 Unported License ("CC BY-SA"), I'm using that license for this list as well (see updated comment at top of the file). Hope that doesn't disqualify this PR! But given that this project is licensed under GPL-2.0, I think it should be fine! |
BTW, thanks for making these--I need to dive into them, as well as some other wordlists at some point, but it will require front-end changes. (I got some ideas on that, but suggestions are always welcome) |
No worries. I'm happy to have the time to push new changes, hopefully improving each proposed list with each pre-merge commit. Let me know if I can be of assistance diving into them.
Curious to find out what other lists you're considering. Separately from my two PRs to this project, I've been working on a set of wordlists, so I'm interested what passphrase generator developers like yourself are looking for in word lists. |
I'm looking at the ones from Strongbox: https://github.com/strongbox-password-safe/Strongbox/tree/master/resources/wordlists I've noticed the stats on your wordlists, is there a utility that generates those? We might be getting to the point where it would make sense to start putting the stats into a spreadsheet for analysis purposes. |
Yes, from a Rust tool I built called Tidy. Once installed, running
That sounds like a great project! I coincidentally started a short list of password managers and the word lists they use.
I've actually had a look at those word lists recently. While I love that Strongbox offers many non-English lists, those lists in particular seem a bit under-developed. For example, most lists start with 200 lines of symbols and numbers, plus non-words like "aa" (see their French list for example), which, imo, betrays the promise of a "passphrase". Also, as further evidence of poor list work, at least four of the lists have issues:
Likewise, some of the EFF fandom lists in the Strongbox repo have some profane words and some words with non-ASCII characters in 2 or 3 of them. I've offered two solutions to this issue for another password project, if you want to take a look at that. However, note that there are only 4,000 unique words on each of the fandom lists -- they're doubled to make it to 8,000 -- so they couldn't make it to the necessary 7,776 words for a 5-dice list without adding words. If you want to add word lists in foreign languages, I'd consider starting with the Wikipedia word frequency project I used to create this PR, which has word frequency data from multiple languages. It wouldn't be too difficult to use a tool like Tidy to cut them to 7,776-word lists for this project (something like |
I did a deep dive on my code last night, and have been thinking about this. Here's what I came up with:
After I finish up #46, I think we can talk about next steps in terms of this specific PR. |
Totally get it.
Do you mean word lists that other password managers and generators currently use, plus information about the passphrases they generate (word count, entropy word, etc.)? I can try getting a start on that. I don't think there are too many out there in use... Update: Here's a first pass at it: https://github.com/sts10/wordlist-information |
That's a great start! So I think I'm gonna continue my refactoring work over in #46, then I want to play around with Tidy (I am also teaching myself Rust, so that's great timing!), and run it against the Strongbox lists. I have some ideas for the |
All done with #46, and deployed it last night. I'll start poking at Tidy tomorrow. You may see some PRs from me for the wordlist-information repo later on. |
I see that my branch is a bit behind now, especially after #47. I'd update my branch and PR, but I'm not sure if you want word lists be in |
As a peek under the hood, what I do in the Javascript code is create a random number between 1 and 7776 and then pick the line out from the list. The dice rolls that are shown on the page are the results of me converting the number from Base 10 to Base 6. 😹 Going forward, I'm just going to have lists be a text file with one word per line, because that's a format that is the easiest to work with. I should add a list selection dropdown to the page, so I can just grab your list and put it and the dropdown I'll create. The big question I have is what would you like your list called? I was thinking something like -- Doug |
The list has frequent word data from Wikipedia mixed in now, so "Google 2012 Common Words" doesn't fit anymore. I'll try to think of a name for it! Separately, we can consider adding my Orchard Street Medium list instead or in addition. The Orchard Street Medium list is uniquely decodable, which brings us to an interesting question regarding your project. If your app continues to enforce a delimiter between words, the word lists you use need-not be uniquely decodable. Not being uniquely decodable usually allows the list to have shorter, more common words. This is why, in this PR, I submitted a not uniquely decodable list. |
You mean the CamelCase capitalization? Yep,I plan on keeping that, because it makes the words much easier to read. |
Yes, sorry -- CamelCase is effectively a delimiter. |
Similar to #38, I made a new word list for
assets/wordlist/wordlist-5-dice.js
. I understand that, for the 5-dice list, I'm going up against the EFF long list, so in that way it's a bit more controversial than #38, but we'll see.I understand that currently, this program uses a slightly modified version of the EFF shortlist as a 7,776-word list. One interesting property of the original EFF long list is that it is free of prefix words ("We also ensured that no word is an exact prefix of any other word."). This offers a key advantage: users can combine words from the list without delimiters or using camel case (e.g.
twigstarfishrefusalretentiontheftfreezing
is safe to use) (a better term to describe this property is to say that the list is "uniquely decodable").However, there is a trade-off to removing all prefix words: since words that are prefixes of other words are often themselves common words, a prefix-word-free list has to use slightly less common words that a list that does contain prefix words.
This diceware web app capitalizes the first character of every word (e.g.
AloofUncladCartridgeAlike
), meaning that it is free to use word lists that do have prefix codes.With this in mind, I made a new 7,776-word list for this project. It's based on 2012 Google Ngram data, and I used a tool I call Tidy to create it. The list does contain prefix words. Of course, we may still prefer the EFF list. But again, thought I'd submit this PR.
Some attributes describing the new list: