-
-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed Mnesia cache guide #220
Comments
I don't know much about it either, but after I understand more and get this working I'll try to draft an outline; it might not be worth much, but it should highlight some places where I can help someone who knows more. |
@danschultzer , if you don't know the problem immediately don't feel obligated to spend time; just point me to mnesia-related file in the code to start digging. I wasn't sure whether to start a new issue, but this is related: Regarding that quick manual connection code that you posted yesterday (my code shown below), it breaks if the first node started goes down. I have HAProxy for backends A and B, with a health check ignoring A or B if either is down, until it comes back up. If I do this:
then, from the frontend you never notice B was gone. No errors. But if I do this:
The same thing happens if you reverse the order: start sname "b" first (leaving the code below unchanged), then it breaks if you kill "b" but not if you kill "a". It's as if the first node to launch owns the key namespace, and takes it with it when it it dies. (Aside: there is also an issue of a conflict with :mnesia :dir config setting when using Que lib for background jobs, see this issue -- just the presence of the config setting for :dir makes que think I want persistence, which I don't. Probably unrelated; fyi.) defmodule Gjwapp.Application do
use Application
def start(_type, _args) do
init_mnesia_cluster(node())
children = [
Gjwapp.Repo,
GjwappWeb.Endpoint,
{Pow.Store.Backend.MnesiaCache, nodes: Node.list()}
]
opts = [strategy: :one_for_one, name: Gjwapp.Supervisor]
Supervisor.start_link(children, opts)
end
defp init_mnesia_cluster(node) do
connect_nodes()
:mnesia.start()
:mnesia.change_config(:extra_db_nodes, Node.list())
:mnesia.change_table_copy_type(:schema, node, :disc_copies)
:mnesia.add_table_copy(Pow.Store.Backend.MnesiaCache, node, :disc_copies)
end
defp connect_nodes(), do: Enum.each(nodes(), &Node.connect/1)
defp nodes() do
{:ok, hostname} = :inet.gethostname()
for sname <- ["a", "b"], do: :"#{sname}@#{hostname}"
end
def config_change(changed, _new, removed) do
GjwappWeb.Endpoint.config_change(changed, removed)
:ok
end
end |
It's no problem with me debugging this as well. It's something I think I would need in the near future, so it's good for me to get a better understanding. Also, it would be great if Pow works out of the box in a multi-node setup. I wasn't able to replicate the error by killing the first node. However, I was able to trigger brain-split partition error, by messing around with killing and starting the different nodes. I think I should make MnesiaCache distribution friendly based on how Mnesiac handles it. My guess is that there is an issue in how the current mnesia cache initializes the table when it's actually joining a cluster, and it would be better if it just copies from the existing cluster if there is one. Also, I can build in recovery since I can just ignore potential data-loss with Pow (at worst, a data loss would mean that a few users have to sign in again). |
I'll do some reading the next few days on distributed erlang, mnesia, etc. Time to get to know this stuff. One thing I'm realizing: Once you go multi-node, you have to learn a lot more about the guts of the system. Bringing in mnesia will probably be worth it in the long run. I wanted to try it rather than fall back on redis. Somewhat related: You have any idea why:
This popped up since I went multi-node / mnesia. Mix release worked fine before that. My app is gjwapp. I did not fork or modify memento.
UPDATE: SOLVED and maybe relevant to your docs (or to the author of que lib): I ran gjwapp --> que --> memento --> mnesia // memento has it in :extra_applications gjwapp -> mnesia // I had it in :included_applications, per Pow docs I still don't understand how this would be circular, but when I remove my included_applications line, the error disappears and the release builds. Your docs might want to say something like add to :included_applications if it is not already loaded and started by one of your other dependencies, or something like that. |
Good catch! Having |
I've been working on a distributed version of the MnesiaCache that can handle cluster automatically these last days. I'll push some code as soon as I get it working fully. @sensiblearts I suspect the issue you experienced was that the second node was in replication mode, rather than also being a master node. Also, the first node should have cleaned out its data before joining the cluster when you restarted it. Hopefully I can get the distributed MnesiaCache working soon, and it'll handle all of this, including self-healing after brain-split. It would be a great addition to Pow, that you won't have to deal with multi-node setup yourself :) |
Very cool! I'll help with this, and documentation, in a few weeks. Right now I'm pushing to try to get this app launched and get the flutter app submitted to the app store. It's gardening app, and I'm about to miss the gardening season :-( Erlang multi-node technology is interesting and I look forward to learning a lot about it and helping with Pow. Your lib has been a joy to use and I plan to study it, too, to learn good practices. |
Hi Dan, |
Yeah, I have an almost working version, but have been occupied with other stuff. Let me clean up the code and push a WIP PR. Having the MnesiaCache working by default for distribution would be the best. I believe that if you just clear out the previous Mnesia data in A before reconnecting to B after restart it should be working. Something like |
Ok, you can check it out in #233. I can't get the tests to work, but will continue to work on it the next days. |
Finally got the tests working! Basically when a node reconnects to the cluster it'll just purge all the data unless it's the first note to start (no other nodes connected). Let me know if it works for you @sensiblearts 😄 |
WIll do. I'll dig into this week. |
Also, do I still (in application.ex), loop over nodes and connect, or is the config It will take a while for me to understand all this, but I'll stick with it until I do, and make some docs. |
I cloned the repository and then
And I rebuild, start 2 phx servers behind HAProxy:
And, as before, the load is balanced fine and pow backend is shared across nodes; however, if I kill the first node that was started, I still get
which does not happen if I kill the node that was started second. Here's part of my application.ex: def start(_type, _args) do
init_mnesia_cluster(node())
children = [
Gjwapp.Repo,
GjwappWeb.Endpoint,
{Pow.Store.Backend.MnesiaCache, nodes: Node.list()}
]
opts = [strategy: :one_for_one, name: Gjwapp.Supervisor]
Supervisor.start_link(children, opts)
end
defp init_mnesia_cluster(node) do
# connect_nodes() # no longer doing this
:mnesia.start()
:mnesia.change_config(:extra_db_nodes, Node.list())
:mnesia.change_table_copy_type(:schema, node, :disc_copies)
:mnesia.add_table_copy(Pow.Store.Backend.MnesiaCache, node, :disc_copies)
end |
You can use the PR branch by setting Now that the MnesiaCache handles cluster, you can remove the custom There's a caveat with the current version; I realized that the clear all approach I've taken may interrupt other applications use of mnesia, e.g. you are using que. I'll see if I can make it just clear the pow related stuff, and let the rest run as usual, but since the Pow MnesiaCache already sets up Mnesia to replicate a cluster it may not make much sense anyway. |
Well, I spent about 3 hours experimenting and drafting a note to you about what was not working. And then... I realized I was running my 2 phx backends from the same folder! and wondering why the mnesia files disappeared. The up side is that it forced me to examine mnesia_cache.ex, do some google searches, and think about mnesia. Anyway, it seems to be working fine. This week I'll draft an outline of docs for distributed mnesia cache. |
That's excellent! Thanks for testing it out. I've updated the docs so it now explains this caveat with directory when running multiple nodes from the same directory: I'll also create a thread on elixir forum to get some more input on this PR, since I want to be sure that I haven't opened up for any potential pitfalls with this implementation. From all my research, this does seem to be the right way of handling replication with Mnesia (the purge all approach might be problematic if you use Mnesia for other stuff, but then again, the MnesiaCache node should then be run on a separate node to the rest of the stuff that uses Mnesia) |
I drafted an outline of what I'm thinking for the guide. No rush for feedback, and be as critical as you wish, regarding what is already there, or what is planned. |
Sorry for the delay, I wanted to carefully read the guide, and was too occupied the last week. The guide looks good, great work! I would add a note about adding Also, I think I would be more light on explaining distribution/mnesia in general, and instead just start out with explaining how The overall structure is how I prefer it, starting out as basic as possible with just manually connecting the nodes when starting the app, and then discuss alternatives with e.g. libcluster. I think some of the paragraphs can be condensed/rearranged, but that's something that will be done once all information is there. Does the current setup work for you with the updated MnesiaCache? Is it in production? I'll review the PR asap, get some more eyes on it, and if all is good get it merged so a new version of Pow can be released that deals with distribution. I just want to make sure there is no issues with it since it wasn't super easy to implement/understand 😄 |
@danschultzer No need to apologize; I don't consider that a delay. You're suggestions sound good and I'll incorporate them when I get back to it in a few days. I wouldn't consider this a "Pull Request" just yet -- I intend to test what I write by doing it in production (and figure out what to do in production by writing!), and when it works, at that point you can consider it a PR. But you can take what you like at any time. It worked fine on the same machine, either changing the mnesia file location, or cloning the entire phoenix folder I have a lot to learn about how erlang nodes communicate; e.g., I just learned that the .hosts.erlang file, which lists the nodes, does not really have to list all the nodes, because the nodes will learn about nodes from nodes they connect to..? Now, I'm wondering about when you would use .hosts.erlang vs. config.sys (as in some tutorials). I'm thinking it's OTP/os-level vs. application-level approach to config, but I'm not sure. Do you know of a good review article or book that covers this? Also, as far as testing in "production," I don't really have any traffic yet (this is my app, https://gardenjournal.app/ ), but I want to have it ready so I can add a second server on short notice. (BTW, you mentioned that you were working on a flutter app, too. How do you like it? I was quite pleased with the experience. I'm using Couchdbase Lite 1.4 java plugin to sync with server-side CouchDb. I was impressed with how easy it is to understand the plugin api and work on the java side, not having used java in 20 years!) |
@danschultzer udate: I found Learn you some earlang.., which is pretty good on the node networking. Also, I now understand the libcluster erlang_hosts strategy; specifically, that the host file contains hosts -- not names (duh!), and that after booting a node uses the lists of hosts to call :net_adm.names/1 (on the epmd daemon port) at each host to get whatever nodes are running there. And the "epmd strategy" is where node names are specified (either in code or in a sys.config file). (For a typical web application, I expect that there is one node per host, which is why all the elixir tutorials I've seen talk about using the config file with [a@host, b@host, etc] rather than using an erlang.hosts file.) Anyway, I'm stuck at the stage of trying to think through use cases: (node == host == vps)
In both cases I'm not sure how to handle :extra_db_nodes -- since it is an initialization option, I could not see anywhere in the code where changing it would have an effect on an already running server. All the mnesia related functions are defp. Maybe I'm asking too much; just update the sys.config file (with any new node names), and restart the nodes, one at a time. No big deal..? |
As I understand it, you wouldn't need to restart the other nodes. You actually only need one node in I would probably still update the |
I have only just started on the client I'm building, but so far I enjoy it too 😄 Thanks for all the notes by the way. It's great info. I hope I'll get the chance to deploy a distributed app soon. |
Hey @sensiblearts, just FYI I've finished up #233 now. There's a GenServer that can be added to automatically heal after netsplit, and I think the solution overall is pretty solid now! I'll get it merged in today, and we can take a second look at the guide 😄 |
Hi, I just got back from a (too short) vacation.
I'll take a look at your work today and start working on the guide again
(and on testing multi-node in "production." I put it in quotes because
there are currently only 3 users, all family.)
Thanks for the notice / nudge :-)
BTW, you're not by chance participating in this are you?
https://phoenixphrenzy.com/
I have a simple idea, related to a (premium) feature that I may add to my
site. But 2 heads are better than 1, so I thought I'd see if you have
time/interest.
David
…On Fri, Aug 16, 2019 at 10:21 AM Dan Schultzer ***@***.***> wrote:
Hey @sensiblearts <https://github.com/sensiblearts>, just FYI I've
finished up #233 <#233> now.
There's a GenServer that can be added to automatically heal after netsplit,
and I think the solution overall is pretty solid now! I'll get it merged in
today, and we can take a second look at the guide 😄
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#220>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAGETQXP4MCQLHDGZUISYCLQE2ZWJANCNFSM4HXDTTMQ>
.
--
</>
David Alm
|
Not participating, but sure, shoot me an e-mail :) |
Actually, I had an idea a while back that might be a use for LiveView in
Pow: In a mobile app (e.g., flutter) webview, when you sign up, then go to
your email app to confirm your email, it opens a separate browser window;
then, when you go back to your mobile app webview, it does not know yet, so
you have to sign in.
LiveView could serve to push notify the webview that you're confirmed and
just let you in. I havent't spent much time thinking about it, so it's a
rough idea.
Also, I don't recall the details, but I started to do something like this
in my app, using Channels (before LV was released). I abandoned it because
I changed the UX flow or something. (Also, there's a flutter lib for
Channels out there.)
D
…On Fri, Aug 16, 2019 at 10:30 AM David Alm ***@***.***> wrote:
Hi, I just got back from a (too short) vacation.
I'll take a look at your work today and start working on the guide again
(and on testing multi-node in "production." I put it in quotes because
there are currently only 3 users, all family.)
Thanks for the notice / nudge :-)
BTW, you're not by chance participating in this are you?
https://phoenixphrenzy.com/
I have a simple idea, related to a (premium) feature that I may add to my
site. But 2 heads are better than 1, so I thought I'd see if you have
time/interest.
David
On Fri, Aug 16, 2019 at 10:21 AM Dan Schultzer ***@***.***>
wrote:
> Hey @sensiblearts <https://github.com/sensiblearts>, just FYI I've
> finished up #233 <#233> now.
> There's a GenServer that can be added to automatically heal after netsplit,
> and I think the solution overall is pretty solid now! I'll get it merged in
> today, and we can take a second look at the guide 😄
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#220>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAGETQXP4MCQLHDGZUISYCLQE2ZWJANCNFSM4HXDTTMQ>
> .
>
--
</>
David Alm
--
</>
David Alm
|
For the email confirmation live view, the issue is that the auth is rejected unless the email has been confirmed. But I think with a custom controller it can be dealt with, and I really like the idea that the live view just awaits for confirmation and then redirects. Not sure if it goes against what live view/socket should be used for (I think Chris has mentioned that using auth in liveview is a bad idea). As for your other comment, I'll comment on it on e-mail instead of this issue 😄 |
Based on #219, I think it would be great to have a distributed
:mnesia
guide. There isn't all that much documentation for:mnesia
distribution, so it would be difficult for a lot of developers to get started.One caveat that may be important to document is the split-brain and handle recovery. However, I believe most use cases will just be single machine, e.g. blue-green deployment setup.
I don't really know that much about this stuff, and what risks there may be so any help would be much appreciated.
The text was updated successfully, but these errors were encountered: