Using Supervisors to Keep ErlyBank Afloat

Posted by on September 13, 2008

This is the fourth article in the otp introduction series. If you haven’t yet, I recommend you start with the first article which talks about gen_server and lays the foundation for our bank system. If you are a quick learner, you can view the currently completed erlang files so far: eb_server.erl, eb_event_manager.erl, eb_withdrawal_handler.erl, and eb_atm.erl.

The Scenario: The thing that makes us feel good about banks and ATMs is that they are always there. We can deposit and get money whenever want, 24 hours a day, using an ATM. Or we can go into any bank branch when they’re open, and know we have complete access to our funds. To achieve this security, we need to make sure that our system to run ErlyBank always stays working: the processes must always be running. ErlyBank has commissioned us to achieve this goal. 100% uptime! (Or as close to that as we can get)

The Result: Using an OTP supervisor, we will create a process whose responsibility it is to watch the running processes and make sure they stay up.

What is a Supervisor?

A supervisor is a process which monitors what are called child processes. If a child process goes down, it uses that child’s restart strategy to restart the process. This system can keep Erlang systems running forever.

The supervisor is part of what is called a supervision tree. A well written Erlang/OTP application starts with a root supervisor, which watches over child supervisors, which in turn watch over more supervisors or processes. The idea is that if a supervisor goes down, the parent supervisor will restart it, all the way up to the root supervisor. The Erlang runtime has a heart option which will watch the entire system and restart it if the root supervisor were to die. This way, the supervision tree will always be intact.

There is only one callback for a supervisor: init/1. Its role is to return a list of child processes and restart strategies for each process, so the supervisor knows what to watch and what actions to take if something goes wrong.

Decoupling eb_server and the Event Manager

One of the things I did in the last article on gen_event was explicitly start the event manager process in the init method of eb_server. I did this at the time because it was the only option I really had if I wanted to easily start the server with that dependency. But now that we’re going to be implementing startup and stop using a supervisor, we can start the event manager within the supervisor tree. So let’s take out the eb_event_manager startup from the server.

To do this, simply remove line 84 from eb_server.erl, which is the startup for the event manager. Additionally, I added at this location the add_handler call to add eb_withdrawal_handler to the event manager. So the init method of eb_server now looks like this:

init([]) ->
  eb_event_manager:add_handler(eb_withdrawal_handler),
  {ok, dict:new()}.

 

Click here to view eb_server.erl after this change.

The Supervisor Skeleton

A basic skeleton for writing a supervisor can be viewed here. As you can see, it has a start method and has a basic init method, which is returning a restart strategy and a fake child spec for now. Restart strategies and child specifications are covered in the next section of this article.

Save the skeleton as eb_sup.erl. The naming of this file is another convention. The supervisor for a certain group is always suffixed with “_sup.” Its not mandatory but its standard practice.

Restart Strategies

A supervisor has one restart strategy, which it uses in conjunction with the child specification, to determine what it should in case one of the supervisor’s child processes dies. The following are the possible restart strategies:

  • one_for_one - When one of the child processes dies, the supervisor restarts it. Other child processes aren’t affected.
  • one_for_all - When one of the child processes dies, all the other child processes are terminated, and then all restarted.
  • rest_for_one - When one of the child processes dies, the “rest” of the child processes defined after it in the child specification list are terminated, then all restarted.

When specifying the restart strategy, it takes the following format:

{RestartStrategy, MaxRetries, MaxTime}

 

This is very simple to understand after you wrap your mind around the english: If a child process is restarted more than MaxRetries times in MaxTime seconds, then the supervisor terminates all child processes and then itself. This is to avoid an infinite loop of restarting a child process.

Child Specification Syntax and Concepts

The init callback for the supervisor is responsible for returning a list of child specifications. These specs tell the supervisor which processes to start and how to start them. The supervisor starts the processes in order from left to right (beginning of the list to the end of the list). A restart strategy is a tuple with the following format:

{Id, StartFunc, Restart, Shutdown, Type, Modules}

Definitions:
Id = term()
 StartFunc = {M,F,A}
  M = F = atom()
  A = [term()]
 Restart = permanent | transient | temporary
 Shutdown = brutal_kill | int()>=0 | infinity
 Type = worker | supervisor
 Modules = [Module] | dynamic
  Module = atom()

 

Id is only used internally by the supervisor to store the child specification, but its a general convention to have the ID be the same as the module name of the child process unless you’re starting multiple instances of your module, in that case suffix the ID with the number.

StartFunc is a tuple in the format of {Module, Function, Args} which specifies the function to call to start the process. REALLY IMPORTANT: The start function must create and link to the process, and should return {ok, Pid}, {ok, Pid, Other}, or {error, Reason}. The normal OTP start_link methods follow this rule. But if you implement a module which starts its own custom processes, make sure you use spawn_link to start them (hence the blog title, if you didn’t know).

Restart is one of three atoms, defined above in the code block. If restart is “permanent” then the process is always restarted. If the value is “temporary” then the process is never restarted. And if the value is “transient” the process is only restarted if it terminated abnormally.

Shutdown tells the supervisor how to terminate child processes. The atom “brutal_kill” shuts the child process down without calling the terminate method. Any integer greater than zero represents a timeout for a graceful shutdown. And the atom “infinity” will gracefully shutdown the process and wait forever for it to stop.

Type tells the supervisor whether the child is another supervisor or any other process. If it is a supervisor, use the atom “supervisor” otherwise use the atom “worker.”

Modules is either a list of modules this process affects or the atom “dynamic.” 95% of the time, you will just use the single OTP callback module in a list for this value. You use “dynamic” if the process is a gen_event process, since the modules it affects are dynamic (multiple handlers that can’t be determined right away). This list is only used for release handling and is not important in the context of this article, but will be used in a future article about release handling.

Whew! That was a lot of information to soak up in so little time. It took me quite a long time to remember the format of the child specs and the different restart strategies, so don’t sweat it if you can’t remember. You can always reference this information on the supervisor manual page.

Event Manager Child Spec

The first thing we want to start is the event manager, since the server depends on it. The child specification looks something like this:

  EventManager = {eb_event_manager,{eb_event_manager, start_link,[]},
            permanent,2000,worker,dynamic}.

 

After reading the child specification syntax section this piece of code should be fairly straightforward. You will probably need to go back and reference the spec to see what each parameter does, and that is completely normal! Its better to go back and understand the code than nod your head and forget it in a few minutes. The one “weird” thing in the spec, I suppose, is the module list is set to “dynamic.” This is because it is a gen_event and the number of modules it uses is dynamic because of the handlers plugging into it. In other cases, you would list out all modules the process uses.

Here is the init method after adding this child spec:

init([]) ->
  EventManager = {eb_event_manager,{eb_event_manager, start_link,[]},
            permanent,2000,worker,dynamic},
  {ok,{{one_for_one,5,10}, [EventManager]}}.

 

I like to assign each child spec to a variable, and then use these variables for the return value, rather than putting the specs directly into the return value. One of my biggest peeves in Erlang is when a programmer nests lists and tuples so deeply that you can’t see where one ends and another begins, so I recommend you assign each to a variable too.

If you compile and run the supervisor now (I think you should!), after running the start_link method of the supervisor, type whereis(eb_event_manager) and it should return the pid of the event manager process. Then, if you kill the supervisor, by doing exit(whereis(eb_sup), kill), and then try to get the eb_event_manager pid again, you should get the result that it is undefined, since the process has been killed.

Also, for fun, kill the eb_event_manager while it is running under the supervisor. Wait a couple seconds and check the process again. It should be back up!

Server and ATM

With the child spec reference and the example given above, you should have enough know-how to get the server and ATM up and running. So if you feel like challenging yourself, do that now. If not, I’ve posted the specs to get both up below:

Server = {eb_server, {eb_server, start_link, []},
            permanent,2000,worker,[eb_server]},
  ATM = {eb_atm, {eb_atm, start_link, []},
         permanent,2000,worker,[eb_atm]},

 

After you create these specs, add them to the list returned by the init method. Make sure that you add them after the event manager.

You can view the completed eb_sup.erl by clicking here.

Adding and Removing Children at Runtime

Unfortunately I couldn’t think of a witty scenario to fit this into ErlyBank, but I felt that it was important to mention that you can dynamically add and remove child specs to an already running supervisor process by using the start_child and delete_child methods.

They are pretty straightforward so I won’t repeat what the manual says here and I’ve linked the methods so you can go directly to them to check them out.

Final Notes

In this article about supervisors I introduced concepts such as the supervisor tree, restart strategies, child specifications, and dynamically adding and removing children.

This concludes article four of the Erlang/OTP introduction series. Article five is already written and queued up for publishing in another few days and will introduce applications.

Trackbacks

Use this link to trackback from your own site.

Comments

Leave a response

  1. Jonathon Mah Sep 13, 2008 15:00

    Thanks again, Mitchell! I’m really taking a lot away from your series.

    Errata: The event manager child spec uses “[eb_event_manager]” in the first occurrence, but “dynamic” in init/1.

  2. Mitchell Sep 13, 2008 18:40

    I fixed the error you caught! Awesome, thanks ;) And I’m glad the series has been helpful!

  3. Jeremy Sep 16, 2008 17:48

    These tutorials have been immensely helpful. I appreciate them very much. I do have one question concerning the supervisor and the event manager. Using the supervisor, how should event handlers be registered? Particularly, how should the eb_withdrawal_handler be registered? Obviously, it can be done in the repl, but how would this be done in a running system?

  4. Mitchell Sep 17, 2008 07:46

    Jeremy,

    Right, this is a good question. The technique I use is to use the gen_event:add_sup_handler/3 method. Its not as simple as throwing something into a supervision tree but at least using that method your handler will receive messages if the event handler crashes or if the handler somehow went bad.

    Using these messages you can reattach the event handler as soon as the event server re-registers itself. Although theoretically this would require a timer to check if the event server is up, in my project its always been back up almost instantaneously. But to be safe, you should implement a timer that tries adding itself to the event server every two seconds or so.

    Also be sure if you do use this that you add some logic to make sure that it doesn’t retry to connect forever and that it caps out at some point :)

    This is what I do but I haven’t seen any “official” word on how to do it so if any of the other readers has a better way of doing this, I’d greatly appreciate it!

  5. David Weldon Oct 04, 2008 14:27

    As I’m writing this I’m watching Kevin Smith’s “Erlang in Practice” episode 8. In it he claims that you need to explicitly trap exits in order for a supervisor to notice that something has gone wrong. In our example we would add:

    init([]) ->
    process_flag(trap_exit, true),

    to eb_server.erl. Did Kevin get it wrong on this one?

Comments

Comments: