replaces wordlist-5-dice with a new word list #39

sts10 · 2023-01-17T00:43:06Z

Similar to #38, I made a new word list for assets/wordlist/wordlist-5-dice.js. I understand that, for the 5-dice list, I'm going up against the EFF long list, so in that way it's a bit more controversial than #38, but we'll see.

I understand that currently, this program uses a slightly modified version of the EFF shortlist as a 7,776-word list. One interesting property of the original EFF long list is that it is free of prefix words ("We also ensured that no word is an exact prefix of any other word."). This offers a key advantage: users can combine words from the list without delimiters or using camel case (e.g. twigstarfishrefusalretentiontheftfreezing is safe to use) (a better term to describe this property is to say that the list is "uniquely decodable").

However, there is a trade-off to removing all prefix words: since words that are prefixes of other words are often themselves common words, a prefix-word-free list has to use slightly less common words that a list that does contain prefix words.

This diceware web app capitalizes the first character of every word (e.g. AloofUncladCartridgeAlike), meaning that it is free to use word lists that do have prefix codes.

With this in mind, I made a new 7,776-word list for this project. It's based on 2012 Google Ngram data, and I used a tool I call Tidy to create it. The list does contain prefix words. Of course, we may still prefer the EFF list. But again, thought I'd submit this PR.

Some attributes describing the new list:

List length               : 7776 words
Mean word length          : 7.08 characters
Length of shortest word   : 3 characters (act)
Length of longest word    : 11 characters (willingness)
Free of prefix words?     : false
Entropy per word          : 12.925 bits

Pseudorandomly generated sample passphrases
-------------------------------------------
interact developers terrace rolling factory legislative 
alloy almost pathways duties percentages address 
tiny orbit truth resident villa injury 
run suite ran regarded chemicals enclosed 
citizen overt structures substitute figured villagers

…EFF's list

…om Wikipedia as well as Google Ngram data

sts10 · 2023-04-06T17:02:56Z

I've given this list a bit of a refresh for spring, incorporating words from Wikipedia, thanks to this project, and ensuring that some (more) British spellings of English words are not on the list (sorry Britain!).

Information/attributes of the updated list

List length               : 7776 words
Mean word length          : 7.04 characters
Length of shortest word   : 3 characters (ace)
Length of longest word    : 11 characters (willingness)
Uniquely decodable?       : false
Entropy per word          : 12.925 bits
Efficiency per character  : 1.835 bits
Assumed entropy per char  : 4.308 bits
Mean edit distance        : 7.035

Pseudorandomly generated sample passphrases
-------------------------------------------
straw provinces humble impressions ion gradually
transmitted readings defenders thrown whenever leaned
actress things reversed troy management specialist
whatever obvious wide literal risk operational
sensible bodily matched schedules blocked damages

Licensing (updated)

Given that Wikipedia text is licensed as Creative Commons Attribution-ShareAlike 3.0 Unported License ("CC BY-SA"), I'm using that license for this list as well (see updated comment at top of the file). Hope that doesn't disqualify this PR! But given that this project is licensed under GPL-2.0, I think it should be fine!

dmuth · 2023-05-05T23:28:19Z

BTW, thanks for making these--I need to dive into them, as well as some other wordlists at some point, but it will require front-end changes. (I got some ideas on that, but suggestions are always welcome)

sts10 · 2023-05-05T23:36:37Z

No worries. I'm happy to have the time to push new changes, hopefully improving each proposed list with each pre-merge commit. Let me know if I can be of assistance diving into them.

...as well as some other wordlists at some point...

Curious to find out what other lists you're considering. Separately from my two PRs to this project, I've been working on a set of wordlists, so I'm interested what passphrase generator developers like yourself are looking for in word lists.

dmuth · 2023-05-05T23:39:44Z

Curious to find out what other lists you're considering.

I'm looking at the ones from Strongbox:

https://github.com/strongbox-password-safe/Strongbox/tree/master/resources/wordlists

I've noticed the stats on your wordlists, is there a utility that generates those? We might be getting to the point where it would make sense to start putting the stats into a spreadsheet for analysis purposes.

sts10 · 2023-05-06T00:03:36Z

I've noticed the stats on your wordlists, is there a utility that generates those?

Yes, from a Rust tool I built called Tidy. Once installed, running tidy -AAAA --samples wordlist.txt prints the list, then the full suite of stats, then some passphrase samples. Tidy does far more than just print stats about a word list: You can also combine multiple lists and perform numerous other edits.

We might be getting to the point where it would make sense to start putting the stats into a spreadsheet for analysis purposes.

That sounds like a great project! I coincidentally started a short list of password managers and the word lists they use.

I'm looking at the ones from Strongbox...

I've actually had a look at those word lists recently. While I love that Strongbox offers many non-English lists, those lists in particular seem a bit under-developed.

For example, most lists start with 200 lines of symbols and numbers, plus non-words like "aa" (see their French list for example), which, imo, betrays the promise of a "passphrase". Also, as further evidence of poor list work, at least four of the lists have issues:

finnish-diceware.wordlist.utf8.txt has two copies of small words such as "a", "aa", "ab", "abc", "ad", "ar", "cj"
french-diceware.wordlist.utf8.txt seems to have a blank line (word) at line 40
icelandic-diceware.wordlist.utf8.txt has two copies of small words like "aa", "ad" and "ae"
swedish-diceware.wordlist.utf8.txt has two copies of the line "abc"

Likewise, some of the EFF fandom lists in the Strongbox repo have some profane words and some words with non-ASCII characters in 2 or 3 of them. I've offered two solutions to this issue for another password project, if you want to take a look at that. However, note that there are only 4,000 unique words on each of the fandom lists -- they're doubled to make it to 8,000 -- so they couldn't make it to the necessary 7,776 words for a 5-dice list without adding words.

If you want to add word lists in foreign languages, I'd consider starting with the Wikipedia word frequency project I used to create this PR, which has word frequency data from multiple languages. It wouldn't be too difficult to use a tool like Tidy to cut them to 7,776-word lists for this project (something like tidy -C -l --print-first 7776 --locale es-ES -z nfc -d s -m 3 -M 12 --straighten -o spanish-diceware-list.txt eswiki-2022-08-29.txt ) -- my only hesitation is not knowing which words are profane or otherwise inappropriate in languages I don't speak/read.

dmuth · 2023-05-07T19:14:15Z

I did a deep dive on my code last night, and have been thinking about this. Here's what I came up with:

First, I gotta do some refactoring, so I just opened Refactor Javascript In Preparation for Multiple Wordlists #46 to track that. The big benefit in relation to additional wordlists is that I'll be able to just use text files going forward. (Right now, my wordlist is Javascript)
Second, I agree with some of the quality of concerns of the other wordlists that you raised.
And that brings me to my third point--I wonder if it might be in the best interests of us, and any other project that has password generation capabilities, to create a separate repo that simply holds password lists as plaintext files with one word per line, along with details that relate to the quality of that password file.

After I finish up #46, I think we can talk about next steps in terms of this specific PR.

sts10 · 2023-05-07T20:54:14Z

First, I gotta do some refactoring... After I finish up #46, I think we can talk about next steps in terms of this specific PR.

Totally get it.

create a separate repo that simply holds password lists as plaintext files with one word per line, along with details that relate to the quality of that password file.

Do you mean word lists that other password managers and generators currently use, plus information about the passphrases they generate (word count, entropy word, etc.)? I can try getting a start on that. I don't think there are too many out there in use...

Update: Here's a first pass at it: https://github.com/sts10/wordlist-information

dmuth · 2023-05-07T23:42:46Z

Update: Here's a first pass at it: https://github.com/sts10/wordlist-information

That's a great start!

So I think I'm gonna continue my refactoring work over in #46, then I want to play around with Tidy (I am also teaching myself Rust, so that's great timing!), and run it against the Strongbox lists.

I have some ideas for the wordlist-information repo, but I'm gonna let them bounce around in my head while I work on Diceware for the next few nights..

dmuth · 2023-05-10T01:12:20Z

All done with #46, and deployed it last night. I'll start poking at Tidy tomorrow. You may see some PRs from me for the wordlist-information repo later on.

sts10 · 2023-05-14T13:49:40Z

I see that my branch is a bit behind now, especially after #47.

I'd update my branch and PR, but I'm not sure if you want word lists be in .txt files, with no quotes or commas, now? OR maybe my proposed new lists don't make sense anymore. FYI this proposed list lives here.

dmuth · 2023-05-15T21:39:36Z

I'm not sure if you want word lists be in .txt files, with no quotes or commas, now

As a peek under the hood, what I do in the Javascript code is create a random number between 1 and 7776 and then pick the line out from the list. The dice rolls that are shown on the page are the results of me converting the number from Base 10 to Base 6. 😹

Going forward, I'm just going to have lists be a text file with one word per line, because that's a format that is the easiest to work with.

I should add a list selection dropdown to the page, so I can just grab your list and put it and the dropdown I'll create. The big question I have is what would you like your list called? I was thinking something like Google 2012 Common Words or something similar? I'm also open to clever/fancy names with me putting details into the README.

-- Doug

sts10 · 2023-05-15T21:50:36Z

I should add a list selection dropdown to the page, so I can just grab your list and put it and the dropdown I'll create. The big question I have is what would you like your list called? I was thinking something like Google 2012 Common Words or something similar? I'm also open to clever/fancy names with me putting details into the README.

The list has frequent word data from Wikipedia mixed in now, so "Google 2012 Common Words" doesn't fit anymore. I'll try to think of a name for it!

Separately, we can consider adding my Orchard Street Medium list instead or in addition.

The Orchard Street Medium list is uniquely decodable, which brings us to an interesting question regarding your project. If your app continues to enforce a delimiter between words, the word lists you use need-not be uniquely decodable. Not being uniquely decodable usually allows the list to have shorter, more common words. This is why, in this PR, I submitted a not uniquely decodable list.

dmuth · 2023-05-17T00:47:46Z

If your app continues to enforce a delimiter between words, the word lists you use need-not be uniquely decodable. Not being uniquely decodable usually allows the list to have shorter, more common words.

You mean the CamelCase capitalization? Yep,I plan on keeping that, because it makes the words much easier to read.

sts10 · 2023-05-17T03:59:44Z

You mean the CamelCase capitalization?

Yes, sorry -- CamelCase is effectively a delimiter.

sts10 added 3 commits January 16, 2023 19:30

replaces wordlist-5-dice with a new word list that hjopefully rivals …

1d3324a

…EFF's list

a couple more edits to my new wordlist-5-dice list

2cb6111

more updates to 5-dice wordlist, incorporating word frequency data fr…

95aa6fb

…om Wikipedia as well as Google Ngram data

another update to wordlist-5-dice.js, making a few word swaps

1d8ced1

dmuth mentioned this pull request May 7, 2023

Refactor Javascript In Preparation for Multiple Wordlists #46

Closed

performs a few words swaps on wordlist-5-dice.js

da792fd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

replaces wordlist-5-dice with a new word list #39

replaces wordlist-5-dice with a new word list #39

sts10 commented Jan 17, 2023 •

edited

Loading

sts10 commented Apr 6, 2023 •

edited

Loading

dmuth commented May 5, 2023

sts10 commented May 5, 2023

dmuth commented May 5, 2023

sts10 commented May 6, 2023 •

edited

Loading

dmuth commented May 7, 2023

sts10 commented May 7, 2023 •

edited

Loading

dmuth commented May 7, 2023

dmuth commented May 10, 2023 •

edited

Loading

sts10 commented May 14, 2023

dmuth commented May 15, 2023 •

edited

Loading

sts10 commented May 15, 2023 •

edited

Loading

dmuth commented May 17, 2023

sts10 commented May 17, 2023

replaces wordlist-5-dice with a new word list #39

Are you sure you want to change the base?

replaces wordlist-5-dice with a new word list #39

Conversation

sts10 commented Jan 17, 2023 • edited Loading

sts10 commented Apr 6, 2023 • edited Loading

Information/attributes of the updated list

Licensing (updated)

dmuth commented May 5, 2023

sts10 commented May 5, 2023

dmuth commented May 5, 2023

sts10 commented May 6, 2023 • edited Loading

dmuth commented May 7, 2023

sts10 commented May 7, 2023 • edited Loading

dmuth commented May 7, 2023

dmuth commented May 10, 2023 • edited Loading

sts10 commented May 14, 2023

dmuth commented May 15, 2023 • edited Loading

sts10 commented May 15, 2023 • edited Loading

dmuth commented May 17, 2023

sts10 commented May 17, 2023

sts10 commented Jan 17, 2023 •

edited

Loading

sts10 commented Apr 6, 2023 •

edited

Loading

sts10 commented May 6, 2023 •

edited

Loading

sts10 commented May 7, 2023 •

edited

Loading

dmuth commented May 10, 2023 •

edited

Loading

dmuth commented May 15, 2023 •

edited

Loading

sts10 commented May 15, 2023 •

edited

Loading