Distributed Mnesia cache guide #220

danschultzer · 2019-06-11T21:44:58Z

Based on #219, I think it would be great to have a distributed :mnesia guide. There isn't all that much documentation for :mnesia distribution, so it would be difficult for a lot of developers to get started.

One caveat that may be important to document is the split-brain and handle recovery. However, I believe most use cases will just be single machine, e.g. blue-green deployment setup.

I don't really know that much about this stuff, and what risks there may be so any help would be much appreciated.

The text was updated successfully, but these errors were encountered:

sensiblearts · 2019-06-11T23:37:14Z

I don't know much about it either, but after I understand more and get this working I'll try to draft an outline; it might not be worth much, but it should highlight some places where I can help someone who knows more.

sensiblearts · 2019-06-12T16:01:48Z

@danschultzer , if you don't know the problem immediately don't feel obligated to spend time; just point me to mnesia-related file in the code to start digging.

I wasn't sure whether to start a new issue, but this is related:

Regarding that quick manual connection code that you posted yesterday (my code shown below), it breaks if the first node started goes down.

I have HAProxy for backends A and B, with a health check ignoring A or B if either is down, until it comes back up.

If I do this:

start A
start B
click around, all is fine
kill B
click around, all is fine
bring B back online

then, from the frontend you never notice B was gone. No errors.

But if I do this:

start A
start B
kill A
...
then I get this error from B:

[error] #PID<0.860.0> running GjwappWeb.Endpoint (connection #PID<0.859.0>, stream id 1) terminated
Server: localhost:80 (http)
Request: HEAD /
** (exit) {:aborted, {:no_exists, [Pow.Store.Backend.MnesiaCache, "credentials:"]}}

The same thing happens if you reverse the order: start sname "b" first (leaving the code below unchanged), then it breaks if you kill "b" but not if you kill "a".

It's as if the first node to launch owns the key namespace, and takes it with it when it it dies.

(Aside: there is also an issue of a conflict with :mnesia :dir config setting when using Que lib for background jobs, see this issue -- just the presence of the config setting for :dir makes que think I want persistence, which I don't. Probably unrelated; fyi.)

defmodule Gjwapp.Application do
  use Application

  def start(_type, _args) do

    init_mnesia_cluster(node())
    
    children = [
      Gjwapp.Repo,
      GjwappWeb.Endpoint,
      {Pow.Store.Backend.MnesiaCache, nodes: Node.list()}
    ]
    opts = [strategy: :one_for_one, name: Gjwapp.Supervisor]
    Supervisor.start_link(children, opts)
  end

  defp init_mnesia_cluster(node) do
    connect_nodes()
    :mnesia.start()
    :mnesia.change_config(:extra_db_nodes, Node.list())
    :mnesia.change_table_copy_type(:schema, node, :disc_copies)
    :mnesia.add_table_copy(Pow.Store.Backend.MnesiaCache, node, :disc_copies)
  end

  defp connect_nodes(), do: Enum.each(nodes(), &Node.connect/1)

  defp nodes() do
    {:ok, hostname} = :inet.gethostname()

    for sname <- ["a", "b"], do: :"#{sname}@#{hostname}"
  end

  def config_change(changed, _new, removed) do
    GjwappWeb.Endpoint.config_change(changed, removed)
    :ok
  end
end

danschultzer · 2019-06-12T21:39:39Z

It's no problem with me debugging this as well. It's something I think I would need in the near future, so it's good for me to get a better understanding. Also, it would be great if Pow works out of the box in a multi-node setup.

I wasn't able to replicate the error by killing the first node. However, I was able to trigger brain-split partition error, by messing around with killing and starting the different nodes.

I think I should make MnesiaCache distribution friendly based on how Mnesiac handles it. My guess is that there is an issue in how the current mnesia cache initializes the table when it's actually joining a cluster, and it would be better if it just copies from the existing cluster if there is one. Also, I can build in recovery since I can just ignore potential data-loss with Pow (at worst, a data loss would mean that a few users have to sign in again).

sensiblearts · 2019-06-13T00:19:34Z

I'll do some reading the next few days on distributed erlang, mnesia, etc. Time to get to know this stuff. One thing I'm realizing: Once you go multi-node, you have to learn a lot more about the guts of the system. Bringing in mnesia will probably be worth it in the long run. I wanted to try it rather than fall back on redis.

Somewhat related: You have any idea why:

I would get a circular dependency between my app and memento?
This circular dependency would not show up with mix phx.server (prod or dev), but shows up only with mix release?

This popped up since I went multi-node / mnesia. Mix release worked fine before that.

My app is gjwapp. I did not fork or modify memento.

MIX_ENV=prod mix release beta1
...
Generated gjwapp app
* assembling beta1-0.1.0 on MIX_ENV=prod
* skipping runtime configuration (config/releases.exs not found)
** (Mix) Circular dependencies among applications: [{memento,"0.3.1"},{gjwapp,"0.1.0"}]

UPDATE: SOLVED and maybe relevant to your docs (or to the author of que lib):

I ran mix app.tree and found these dependencies:

gjwapp --> que --> memento --> mnesia

// memento has it in :extra_applications

gjwapp -> mnesia

// I had it in :included_applications, per Pow docs

I still don't understand how this would be circular, but when I remove my included_applications line, the error disappears and the release builds.

Your docs might want to say something like add to :included_applications if it is not already loaded and started by one of your other dependencies, or something like that.

danschultzer · 2019-06-13T01:39:47Z

Good catch! Having :mnesia in :extra_applications is the proper way, :included_applications is elixir pre 1.4. I've updated the readme, thanks.

danschultzer · 2019-06-16T16:51:17Z

I've been working on a distributed version of the MnesiaCache that can handle cluster automatically these last days. I'll push some code as soon as I get it working fully.

@sensiblearts I suspect the issue you experienced was that the second node was in replication mode, rather than also being a master node. Also, the first node should have cleaned out its data before joining the cluster when you restarted it.

Hopefully I can get the distributed MnesiaCache working soon, and it'll handle all of this, including self-healing after brain-split. It would be a great addition to Pow, that you won't have to deal with multi-node setup yourself :)

sensiblearts · 2019-06-17T12:00:05Z

Very cool! I'll help with this, and documentation, in a few weeks. Right now I'm pushing to try to get this app launched and get the flutter app submitted to the app store. It's gardening app, and I'm about to miss the gardening season :-(

Erlang multi-node technology is interesting and I look forward to learning a lot about it and helping with Pow. Your lib has been a joy to use and I plan to study it, too, to learn good practices.
DA

sensiblearts · 2019-07-02T22:00:35Z

Hi Dan,
I got my web app up and flutter app into the store. Not quite ready to announce/launch yet, but can get back to the mnesia / distributed node issues. Where does it stand, or where woudl you like me to dig in?
David

danschultzer · 2019-07-03T01:45:25Z

Yeah, I have an almost working version, but have been occupied with other stuff. Let me clean up the code and push a WIP PR. Having the MnesiaCache working by default for distribution would be the best.

I believe that if you just clear out the previous Mnesia data in A before reconnecting to B after restart it should be working. Something like :mnesia.delete_schema([node()]), but obviously this should only run if there is any other nodes already running (e.g. checking Node.list()), otherwise if the whole cluster went down all data will be reset.

danschultzer · 2019-07-06T01:06:01Z

Ok, you can check it out in #233. I can't get the tests to work, but will continue to work on it the next days.

danschultzer · 2019-07-08T04:08:28Z

Finally got the tests working! Basically when a node reconnects to the cluster it'll just purge all the data unless it's the first note to start (no other nodes connected). Let me know if it works for you @sensiblearts 😄

sensiblearts · 2019-07-08T13:14:02Z

WIll do. I'll dig into this week.

sensiblearts · 2019-07-08T15:11:57Z

@danschultzer , sorry to trouble you with elementary training, but what is the proper workflow for me to work on this. Should I fork it, point my deps to a local copy, then checkout the commit with the cluster fix?(I've never worked as part of a team, so never learned this stuff.)

Also, do I still (in application.ex), loop over nodes and connect, or is the config extra_db_nodes: Node.list() sufficient?

It will take a while for me to understand all this, but I'll stick with it until I do, and make some docs.

sensiblearts · 2019-07-08T16:16:31Z

I cloned the repository and then

git checkout 8b7a988d3c6dfd88fc3a2837d038a4df8e212d87
...
You are in 'detached HEAD' state. You can look around, make experimental
...
HEAD is now at 8b7a988... Distributed cluster support in MnesiaCache

And I rebuild, start 2 phx servers behind HAProxy:

MIX_ENV=dev PORT=4000 elixir --sname a -S mix phx.server
MIX_ENV=dev PORT=4002 elixir --sname b -S mix phx.server

And, as before, the load is balanced fine and pow backend is shared across nodes; however, if I kill the first node that was started, I still get

** (exit) {:aborted, {:no_exists, [Pow.Store.Backend.MnesiaCache, "credentials:"]}}

which does not happen if I kill the node that was started second.

Here's part of my application.ex:

 def start(_type, _args) do
  
    init_mnesia_cluster(node())
    
    children = [
      Gjwapp.Repo,
      GjwappWeb.Endpoint,
      {Pow.Store.Backend.MnesiaCache, nodes: Node.list()}
    ] 
    opts = [strategy: :one_for_one, name: Gjwapp.Supervisor]
    Supervisor.start_link(children, opts)
  end

  defp init_mnesia_cluster(node) do
    # connect_nodes()  # no longer doing this
    :mnesia.start()
    :mnesia.change_config(:extra_db_nodes, Node.list())
    :mnesia.change_table_copy_type(:schema, node, :disc_copies)
    :mnesia.add_table_copy(Pow.Store.Backend.MnesiaCache, node, :disc_copies)
  end

danschultzer · 2019-07-08T20:47:04Z

You can use the PR branch by setting {:pow, github: "danschultzer/pow", ref: "distributed-mnesia-cache"} in your mix.exs.

Now that the MnesiaCache handles cluster, you can remove the custom init_mnesia_cluster/1 and just set the :extra_db_nodes setting: {Pow.Store.Backend.MnesiaCache, extra_db_nodes: Node.list()}. It should work from there on, as long as your nodes are connected already 😄

There's a caveat with the current version; I realized that the clear all approach I've taken may interrupt other applications use of mnesia, e.g. you are using que. I'll see if I can make it just clear the pow related stuff, and let the rest run as usual, but since the Pow MnesiaCache already sets up Mnesia to replicate a cluster it may not make much sense anyway.

sensiblearts · 2019-07-09T20:16:43Z

Well, I spent about 3 hours experimenting and drafting a note to you about what was not working. And then... I realized I was running my 2 phx backends from the same folder! and wondering why the mnesia files disappeared.

The up side is that it forced me to examine mnesia_cache.ex, do some google searches, and think about mnesia.

Anyway, it seems to be working fine. This week I'll draft an outline of docs for distributed mnesia cache.

danschultzer · 2019-07-09T20:35:39Z

That's excellent! Thanks for testing it out. I've updated the docs so it now explains this caveat with directory when running multiple nodes from the same directory:

https://github.com/danschultzer/pow/blob/1fee285eae7ccc0770a0cfddd0e32c4ded2aa496/lib/pow/store/backend/mnesia_cache.ex#L23-L28

I'll also create a thread on elixir forum to get some more input on this PR, since I want to be sure that I haven't opened up for any potential pitfalls with this implementation.

From all my research, this does seem to be the right way of handling replication with Mnesia (the purge all approach might be problematic if you use Mnesia for other stuff, but then again, the MnesiaCache node should then be run on a separate node to the rest of the stuff that uses Mnesia)

sensiblearts · 2019-07-10T17:53:41Z

I drafted an outline of what I'm thinking for the guide.

No rush for feedback, and be as critical as you wish, regarding what is already there, or what is planned.

danschultzer · 2019-07-15T15:50:23Z

Sorry for the delay, I wanted to carefully read the guide, and was too occupied the last week.

The guide looks good, great work! I would add a note about adding :mnesia to :extra_applications in mix.exs to ensure that it's also included in the release.

Also, I think I would be more light on explaining distribution/mnesia in general, and instead just start out with explaining how Pow.Store.Backend.MnesiaCache handles distribution (e.g. how you only have to pass in :extra_db_nodes to get it running, and what happens upon reconnecting to the cluster), and after that go into strategies.

The overall structure is how I prefer it, starting out as basic as possible with just manually connecting the nodes when starting the app, and then discuss alternatives with e.g. libcluster. I think some of the paragraphs can be condensed/rearranged, but that's something that will be done once all information is there.

Does the current setup work for you with the updated MnesiaCache? Is it in production?

I'll review the PR asap, get some more eyes on it, and if all is good get it merged so a new version of Pow can be released that deals with distribution. I just want to make sure there is no issues with it since it wasn't super easy to implement/understand 😄

sensiblearts · 2019-07-15T22:47:53Z

@danschultzer No need to apologize; I don't consider that a delay.

You're suggestions sound good and I'll incorporate them when I get back to it in a few days.

I wouldn't consider this a "Pull Request" just yet -- I intend to test what I write by doing it in production (and figure out what to do in production by writing!), and when it works, at that point you can consider it a PR. But you can take what you like at any time. It worked fine on the same machine, either changing the mnesia file location, or cloning the entire phoenix folder

I have a lot to learn about how erlang nodes communicate; e.g., I just learned that the .hosts.erlang file, which lists the nodes, does not really have to list all the nodes, because the nodes will learn about nodes from nodes they connect to..? Now, I'm wondering about when you would use .hosts.erlang vs. config.sys (as in some tutorials). I'm thinking it's OTP/os-level vs. application-level approach to config, but I'm not sure. Do you know of a good review article or book that covers this?

Also, as far as testing in "production," I don't really have any traffic yet (this is my app, https://gardenjournal.app/ ), but I want to have it ready so I can add a second server on short notice.

(BTW, you mentioned that you were working on a flutter app, too. How do you like it? I was quite pleased with the experience. I'm using Couchdbase Lite 1.4 java plugin to sync with server-side CouchDb. I was impressed with how easy it is to understand the plugin api and work on the java side, not having used java in 20 years!)

sensiblearts · 2019-07-16T00:59:16Z

@danschultzer udate: I found Learn you some earlang.., which is pretty good on the node networking.

Also, I now understand the libcluster erlang_hosts strategy; specifically, that the host file contains hosts -- not names (duh!), and that after booting a node uses the lists of hosts to call :net_adm.names/1 (on the epmd daemon port) at each host to get whatever nodes are running there.

And the "epmd strategy" is where node names are specified (either in code or in a sys.config file).

(For a typical web application, I expect that there is one node per host, which is why all the elixir tutorials I've seen talk about using the config file with [a@host, b@host, etc] rather than using an erlang.hosts file.)

Anyway, I'm stuck at the stage of trying to think through use cases:

(node == host == vps)

I have 1 backend phoenix node server and want to add a second without having to restart all nodes
I have 2 phoenix nodes running, one goes down, I create a new node and add it without having to restart the others

In both cases I'm not sure how to handle :extra_db_nodes -- since it is an initialization option, I could not see anywhere in the code where changing it would have an effect on an already running server. All the mnesia related functions are defp.

Maybe I'm asking too much; just update the sys.config file (with any new node names), and restart the nodes, one at a time. No big deal..?

danschultzer · 2019-07-16T15:41:07Z

As I understand it, you wouldn't need to restart the other nodes. You actually only need one node in :extra_db_nodes to connect to the cluster, all other nodes will automatically be connected. That makes it very easy to join clusters, since you won't have to update the config on the old nodes at all. And as long as the old nodes connect to at least one node that's in the cluster when restarting, it'll automatically connect to the new node as well, even without updating the :extra_db_nodes setting.

I would probably still update the sys.config with the new nodes just to be sure it got access to all nodes in the cluster when it'll be eventually restarted. It may be that some of the old nodes will be removed entirely.

danschultzer · 2019-07-16T15:47:24Z

BTW, you mentioned that you were working on a flutter app, too. How do you like it?

I have only just started on the client I'm building, but so far I enjoy it too 😄

Thanks for all the notes by the way. It's great info. I hope I'll get the chance to deploy a distributed app soon.

danschultzer · 2019-08-16T14:21:23Z

Hey @sensiblearts, just FYI I've finished up #233 now. There's a GenServer that can be added to automatically heal after netsplit, and I think the solution overall is pretty solid now! I'll get it merged in today, and we can take a second look at the guide 😄

sensiblearts · 2019-08-16T14:30:25Z

Hi, I just got back from a (too short) vacation. I'll take a look at your work today and start working on the guide again (and on testing multi-node in "production." I put it in quotes because there are currently only 3 users, all family.) Thanks for the notice / nudge :-) BTW, you're not by chance participating in this are you? https://phoenixphrenzy.com/ I have a simple idea, related to a (premium) feature that I may add to my site. But 2 heads are better than 1, so I thought I'd see if you have time/interest. David

…

On Fri, Aug 16, 2019 at 10:21 AM Dan Schultzer ***@***.***> wrote: Hey @sensiblearts <https://github.com/sensiblearts>, just FYI I've finished up #233 <#233> now. There's a GenServer that can be added to automatically heal after netsplit, and I think the solution overall is pretty solid now! I'll get it merged in today, and we can take a second look at the guide 😄 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#220>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAGETQXP4MCQLHDGZUISYCLQE2ZWJANCNFSM4HXDTTMQ> .

-- </> David Alm

danschultzer · 2019-08-16T14:32:19Z

Not participating, but sure, shoot me an e-mail :)

sensiblearts · 2019-08-16T14:36:51Z

Actually, I had an idea a while back that might be a use for LiveView in Pow: In a mobile app (e.g., flutter) webview, when you sign up, then go to your email app to confirm your email, it opens a separate browser window; then, when you go back to your mobile app webview, it does not know yet, so you have to sign in. LiveView could serve to push notify the webview that you're confirmed and just let you in. I havent't spent much time thinking about it, so it's a rough idea. Also, I don't recall the details, but I started to do something like this in my app, using Channels (before LV was released). I abandoned it because I changed the UX flow or something. (Also, there's a flutter lib for Channels out there.) D

…

On Fri, Aug 16, 2019 at 10:30 AM David Alm ***@***.***> wrote: Hi, I just got back from a (too short) vacation. I'll take a look at your work today and start working on the guide again (and on testing multi-node in "production." I put it in quotes because there are currently only 3 users, all family.) Thanks for the notice / nudge :-) BTW, you're not by chance participating in this are you? https://phoenixphrenzy.com/ I have a simple idea, related to a (premium) feature that I may add to my site. But 2 heads are better than 1, so I thought I'd see if you have time/interest. David On Fri, Aug 16, 2019 at 10:21 AM Dan Schultzer ***@***.***> wrote: > Hey @sensiblearts <https://github.com/sensiblearts>, just FYI I've > finished up #233 <#233> now. > There's a GenServer that can be added to automatically heal after netsplit, > and I think the solution overall is pretty solid now! I'll get it merged in > today, and we can take a second look at the guide 😄 > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#220>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AAGETQXP4MCQLHDGZUISYCLQE2ZWJANCNFSM4HXDTTMQ> > . > -- </> David Alm

-- </> David Alm

danschultzer · 2019-08-16T17:38:46Z

For the email confirmation live view, the issue is that the auth is rejected unless the email has been confirmed. But I think with a custom controller it can be dealt with, and I really like the idea that the live view just awaits for confirmation and then redirects. Not sure if it goes against what live view/socket should be used for (I think Chris has mentioned that using auth in liveview is a bad idea).

As for your other comment, I'll comment on it on e-mail instead of this issue 😄

danschultzer added enhancement New feature or request help wanted Extra attention is needed labels Jun 11, 2019

danschultzer mentioned this issue Jun 12, 2019

Add distributed mnesia guide #221

Closed

danschultzer mentioned this issue Jul 6, 2019

Distributed cluster support in MnesiaCache #233

Merged

2 tasks

danschultzer closed this as completed in #233 Aug 16, 2019

danschultzer mentioned this issue Sep 18, 2019

Guide on cluster setup with the MnesiaCache pow-auth/pow_site#10

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed Mnesia cache guide #220

Distributed Mnesia cache guide #220

danschultzer commented Jun 11, 2019 •

edited

Loading

sensiblearts commented Jun 11, 2019

sensiblearts commented Jun 12, 2019

danschultzer commented Jun 12, 2019

sensiblearts commented Jun 13, 2019 •

edited

Loading

danschultzer commented Jun 13, 2019 •

edited

Loading

danschultzer commented Jun 16, 2019 •

edited

Loading

sensiblearts commented Jun 17, 2019

sensiblearts commented Jul 2, 2019

danschultzer commented Jul 3, 2019 •

edited

Loading

danschultzer commented Jul 6, 2019

danschultzer commented Jul 8, 2019

sensiblearts commented Jul 8, 2019

sensiblearts commented Jul 8, 2019 •

edited

Loading

sensiblearts commented Jul 8, 2019

danschultzer commented Jul 8, 2019 •

edited

Loading

sensiblearts commented Jul 9, 2019

danschultzer commented Jul 9, 2019

sensiblearts commented Jul 10, 2019

danschultzer commented Jul 15, 2019

sensiblearts commented Jul 15, 2019

sensiblearts commented Jul 16, 2019

danschultzer commented Jul 16, 2019 •

edited

Loading

danschultzer commented Jul 16, 2019

danschultzer commented Aug 16, 2019

sensiblearts commented Aug 16, 2019 via email

danschultzer commented Aug 16, 2019

sensiblearts commented Aug 16, 2019 via email

danschultzer commented Aug 16, 2019

Distributed Mnesia cache guide #220

Distributed Mnesia cache guide #220

Comments

danschultzer commented Jun 11, 2019 • edited Loading

sensiblearts commented Jun 11, 2019

sensiblearts commented Jun 12, 2019

danschultzer commented Jun 12, 2019

sensiblearts commented Jun 13, 2019 • edited Loading

danschultzer commented Jun 13, 2019 • edited Loading

danschultzer commented Jun 16, 2019 • edited Loading

sensiblearts commented Jun 17, 2019

sensiblearts commented Jul 2, 2019

danschultzer commented Jul 3, 2019 • edited Loading

danschultzer commented Jul 6, 2019

danschultzer commented Jul 8, 2019

sensiblearts commented Jul 8, 2019

sensiblearts commented Jul 8, 2019 • edited Loading

sensiblearts commented Jul 8, 2019

danschultzer commented Jul 8, 2019 • edited Loading

sensiblearts commented Jul 9, 2019

danschultzer commented Jul 9, 2019

sensiblearts commented Jul 10, 2019

danschultzer commented Jul 15, 2019

sensiblearts commented Jul 15, 2019

sensiblearts commented Jul 16, 2019

danschultzer commented Jul 16, 2019 • edited Loading

danschultzer commented Jul 16, 2019

danschultzer commented Aug 16, 2019

sensiblearts commented Aug 16, 2019 via email

danschultzer commented Aug 16, 2019

sensiblearts commented Aug 16, 2019 via email

danschultzer commented Aug 16, 2019

danschultzer commented Jun 11, 2019 •

edited

Loading

sensiblearts commented Jun 13, 2019 •

edited

Loading

danschultzer commented Jun 13, 2019 •

edited

Loading

danschultzer commented Jun 16, 2019 •

edited

Loading

danschultzer commented Jul 3, 2019 •

edited

Loading

sensiblearts commented Jul 8, 2019 •

edited

Loading

danschultzer commented Jul 8, 2019 •

edited

Loading

danschultzer commented Jul 16, 2019 •

edited

Loading