Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

7,776 possible words per dice roll #2

Closed
dmuth opened this issue Jan 13, 2017 · 9 comments
Closed

7,776 possible words per dice roll #2

dmuth opened this issue Jan 13, 2017 · 9 comments
Assignees

Comments

@dmuth
Copy link
Owner

dmuth commented Jan 13, 2017

It was pointed out to me that with only 7,776 words per dice roll (https://twitter.com/shazow/status/819991188388323329), even though we have a decent amount of randomness, there are only 7,776 possibilities which may be a concern. As such, I should consider a greater number of dice rolls per word. Here are my options

6 dice per word: 46,656 words
7 dice per word: 279,936 words
8 dice per word: 1,679,616 words

This is easy enough to write code for, but I need to figure out the following things:

  • Would a longer wordlist.js file cause problems in web browsers on both desktops and mobile phones?
  • Would there be UI issues related to the size of the dice, especially on mobile phones?

Suggestions are welcome! I'm going to wait a few days before starting on this one...

@dmuth dmuth self-assigned this Jan 13, 2017
@shazow
Copy link

shazow commented Jan 13, 2017

I suggest increasing the number of words. The word list in this project is an improvement.

If you're worried about the size, I suggest encoding it into a Trie datastructure which should compress quite nicely.

@dmuth
Copy link
Owner Author

dmuth commented Jan 16, 2017

Hmm... just to make sure we're on the same page regarding the Trie data structure, this is the structure I'm using now (basically a hash table):

var wordlist = {
        11111:"aaron",
        11112:"abandon",
        11113:"abbey",
        11114:"abbott",
        11115:"abide",
        11116:"ability",
        11121:"able",
        11122:"aboard",
};

If I were to use a Trie, the graph would look something like this (for the words given above):

image

And the underlying Javascript implementation of that graph might look something like this:

var wordlist = {};
wordlist[1] = {};
wordlist[1][1] = {};
wordlist[1][1][1] = {};
wordlist[1][1][1][1] = {};
wordlist[1][1][1][1][1] = "aaron";
wordlist[1][1][1][1][2] = "abandon";
wordlist[1][1][1][1][3] = "abbey";
wordlist[1][1][1][1][4] = "abbott";
wordlist[1][1][1][1][5] = "abide";
wordlist[1][1][1][1][6] = "ability";
wordlist[1][1][1][2][1] = "able";
wordlist[1][1][1][2][2] = "aboard";

So from a code perspective, I'd still have at least as many lines of code as I do words, with a few extra lines for initializing the hashtables for each possible die roll.

However... I feel like I might be misunderstanding what you had in mind for implementing Tries. If I'm way off the mark, I'd appreciate a nudge in the right direction. :-)

Thanks,

-- Doug

@shazow
Copy link

shazow commented Jan 16, 2017

Indeed, that's not what I had in mind.

First, a quick aside: The current approach could be improved by using an array. Keying on the dice roll doesn't get us anything, because that can just be represented as a random number rather than 6 dice. So if I were to improve the current approach, I'd do...

var wordlist = [
        "aaron",
        "abandon",
        "abbey",
        ...
];

And then pick a random number between 0 and wordlist.length-1, call it i, then we pick wordlist[i] and that's our word. We can represent that as a dice visualization by converting the number into base 6. This should cut down the size of the javascript file by almost half.

Similarly, with the Trie, we don't gain anything by keying on the dice representation. A Trie exists to query word prefixes, so we'd key on our word list.

Then, same as before, we pick a random number of the size of the dictionary and traverse the Trie until we settle on the appropriate word. Mapping the random number to a word in the Trie is a bit more work because we can't just access it directly the way we do with an array. We'd need to write a Trie traversal and stop at the i'th valid word.

Good luck!

@dmuth
Copy link
Owner Author

dmuth commented Jan 16, 2017

Cool, thanks for the detailed explanation! 👍

A bit of history on why I have things the way they are now: when I first built the app, I sought to emulate the physical diceware technique as closely as I could. It seems to work fine for 5 dice but, as we can see, might not scale so well for more dice.

Since I've never actually built a Trie data structure before, I think I'm going to try approaching this in phases. First phase will be to convert the array into a list and the dice rolls into random numbers that are then base-6ed back to dice roles to show the user. Then I'll try turning the array of 7,776 elements into a Trie (which should be fun!). Finally, I'll try the same approach with a larger word list.

I'm doing some traveling later in the week, so I don't expect this to be done soon. But do watch this space for updates when they happen. :-)

-- Doug

@shazow
Copy link

shazow commented Jan 16, 2017

That's right. In practice, there's no difference between choosing 5 random numbers between 1 and 6, or choosing one random number between 1 and 7776. The only difference aesthetic in how we choose to present it to the user afterwards.

Your steps sound good. :)

@dmuth
Copy link
Owner Author

dmuth commented Feb 2, 2017

Sorry for the delay, but I got the first step of this done, which was to rebuild the wordlist as an array, and rewrite the dice rolling logic to just be random number selection. Turns out doing base-6 conversion was a bit of work, at least when all of the edge cases were factored in.

main.js now weighs in at 535 LOC vs 382, but I think that's largely due to the edge case testing I have now--if someone one of those functions gets passed a negative number, I want an exception thrown right then and there, since these are passwords we're talking about.

But, hey, I got into Qunit for the first time and we have unit tests now, so that's nice:

https://www.dmuth.org/diceware/tests/

Now, getting back to the Trie, building one seems easily enough. But I'll need a way to navigate that Trie. Putting words into a Trie is easy enough--I read online code samples, and it makes perfect sense. I should be able to build that with no difficulty.

Coming up with a way to navigate the Trie, however, is proving to be a little more elusive to me. Looking at your comments higher up, you mentioned writing a Trie traversal and stopping at the ith word, but with thousands of possible words and an algorithm which seems to be O(n), that could get really CPU-expensive really quickly.

Unless, each Parent node in the Trie included, in addition to its children, metadata that told how many total children were under it. Something like this maybe:

image

Then traversing the Trie would be much quicker, because I'll know how many words are present under the current node, and can navigate accordingly.

What do you think? Good idea? Awful idea? :-)

Thanks,

-- Doug

@shazow
Copy link

shazow commented Feb 2, 2017

Sorry I can't give everything a thorough review, but here are some hints:

  • To convert to base6, use Number(foo).toString(6)
  • There are some accidentally-global variables in the code, e.g. for(i=0; ...), the i will get global scope. Should be for(var i=0; ...).
  • CPUs perform many billions of operations per second; traversing even a 100,000 word trie is not going to take very long. If it's implemented efficiently (closer to O(logn)), it shouldn't be more than several operations per layer and if an average word is say 6 letters, then we're talking about just a few hundred operations.
  • While metadata per-node might be helpful (I'm not convinced that it's necessary), adding extra layers into the trie for metadata is probably counterproductive.
  • Traditionally tries represent the possibility that a word can be completed at this layer with the existence of a null/space node.

This is a fairly advanced datastructure so it may take some effort to get right.

Edit: Might be better off continuing to use the array implementation and add the rest of the dictionary, it should gzip compress to a reasonable size (~250kb).

@dmuth
Copy link
Owner Author

dmuth commented Feb 3, 2017

I wish I knew about toString() last week. :-) Oh well, it was a good exercise, and I got to learn Qunit. Ain't no harm.

Thank you for catching the unintentional global issue--I don't know how I missed that! Filed as #3 and fixed.

Thanks for the other thoughts on Tries. I think what I'll do is try scaling up the array to more and more words and see how that performs first. Depending on how that goes, I may opt to close this bug out in the future and file a separate one for Trie research.

And thanks for all the info, I really appreciate it.

-- Doug

@dmuth
Copy link
Owner Author

dmuth commented Feb 9, 2017

In the interest of not having this issue expand into like 100 comments, I've gone and split the work that needs to be done into two separate issue: #4 and #5. As such, I'll be closing this one out, and anyone who would like to follow the work on either/both of those two may. :-)

@dmuth dmuth closed this as completed Feb 9, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants