How to Capture All Errors Returned by a Function Call in Elixir

· · Semaphore Engineering Blog

If there is an Elixir library function that needs to be called, how can we be sure that all possible errors coming from it will be captured?

Elixir/Erlang vs. Mainstream Languages

Elixir is an unusual language because it functions as a kind of a wrapper around another language. It utilizes Erlang and its rock solid libraries to build new concepts on top of it. Erlang is also different compared to what one may call usual or mainstream languages, e.g. Java, C++, Python, Ruby, etc. in that it's a functional programming language, designed with distributed computing in mind.

Process and Concurrency

The languages we are used to working with (at least here in Rendered Text) are imperative programming languages. They're quite different compared to Erlang and Elixir. It's taken for granted that all those languages do not provide any significant abstraction over the operating system process model. In contrast to that, threads of execution (i.e. units of scheduling) are implemented as user-space processes in Erlang.

Also, these mainstream programming languages do not support concurrency themselves. They support it through libraries, which are usually based on OS capabilities. There is usually either no native support for concurrency in the language at all, or there is minimal support which is backward compatible with the initial sequential model of the language. Consequently, in mainstream languages there are no error handling mechanisms designed for concurrent/distributed processing.

The Consequence

When a new language is introduced, we search for familiar concepts. In the context of error handling, we look for exceptions, and try to use them the way we are used to. And, with Elixir — we fail. Miserably.

Error Capturing

Why is error capturing a challenge? Isn't it trivial? Well, in Erlang/Elixir, errors can be propagated using different mechanisms. When a library function is called, it's sometimes unclear which mechanism it's using for error propagation. Also, it might happen that an error is generated in some other process (created by an invoked function) and propagated to the invoking function/process using some of the language mechanisms.

Let's consider the foo/0 library function. All we know is that it returns a numeric value on successful execution. It can also fail for different reasons and notify the caller in non-obvious ways. Here's a trivial example of foo/0:

defmodule Library do
  def foo, do: 1..4 |> Enum.random |> choice

  defp choice(1), do: 1/3
  defp choice(2), do: 1/0
  defp choice(3), do: Process.exit(self, :unknown_error)
  defp choice(4), do: throw :overflow
end

What can we do to capture all possible errors generated by foo/0?

Error Propagation

In mainstream languages, there is only one way to interrupt processing and propagate errors up the call stack — exceptions. Nothing else. Exceptions operate within the boundaries of a single operating system thread. They cannot reach outside the thread scope, because the language does not recognize anything beyond that scope.

In Elixir, error handling works differently. There are multiple mechanisms for error notification, and this can be quite confusing to novice users.

The error condition can be propagated as an exception, or as an exit signal. There are two mutually exclusive flavors of exceptions: raised and thrown. When it comes to signals, a process can either send an exit signal to itself, or to other processes. Reaction to receiving an exit signal is different based on the state of the receiving process and signal value. Also, a process can choose to terminate itself because of an error, calling the exit/1 function.

Exceptions

There are two mechanisms to create an exception, and two mechanisms to handle them. As previously mentioned, these mechanisms are mutually exclusive!

A Raised exception can only be rescued and a thrown exception can only be caught. So, the following exceptions will be captured:

try do raise "error notification" rescue e -> e end
try do throw :error_notification  catch  e -> e end

But these won't:

try do raise "error notification" catch  e -> e end
try do throw :error_notification  rescue e -> e end

This will do the job:

try do raise "error notification" rescue e -> e catch e -> e end
try do throw :error_notification  rescue e -> e catch e -> e end

However, this is still not good enough because neither rescue nor catch will handle the exit signal sent from this or any other process:

try do Process.exit self, :error_notification rescue e -> e catch e -> e end

From a single-process (and try block mechanism) perspective, there are just too many moving parts to get it right. And at the end of the day, we cannot cover all possible scenarios anyway.

Something is obviously wrong with this approach. Let's look for a different solution.

The Erlang Way

Now, let's move one step back and look at Erlang again. It's an intrinsically concurrent language. Everything in it is designed to support distributed computing. Not Google scale distributed, but still distributed. Elixir is built on top of that.

An Elixir application, no matter how simple, should not be perceived as a single entity. It's a distributed system on its own, consisting of tens, and often even hundreds of processes.

The try block is useful in the scope of a single process, and that's where it should be used: to capture errors generated in the same process. However, if we need to handle all errors that might affect a process while a particular function is being executed (possibly originating in some other process), we'll need to use some other mechanism. The try block cannot take care of that. This is a mindset change we'll have to accept.

When in Rome, Do as the Romans Do

The Erlang philosophy is "fail fast". In theory, this is a sound fault-tolerance approach. It basically means that you shouldn't try to fix the unexpected! This makes much more sense than the alternative, since the unexpected is difficult to test. Instead, you should let the process or the entire process group die, and start over, from a known state. This can be easily tested.

So, what happens when an error notification is propagated above a process's initial function? The process is terminated, and a notification is sent to all interested parties — all the processes that need to be notified. This is done consistently for all processes, and for all termination reasons, including a normal exit of the initial function.

If you want to capture all errors, you will need to engage an interprocess notification mechanism. This cannot be done using an intraprocess mechanism like the try block, at least not in Elixir.

Now, let's discuss some approaches to capturing errors.

Approach 1: Exit Signals

Erlang's "fail fast" mechanism are exit signals combined with Erlang messages. When a process terminates for any reason (whether it's a normal exit, or an error), it sends an exit signal to all processes it is linked with.

When a process receives an exit signal, it usually dies, unless it's trapping exit signals. In that case, the signal is transformed into a message and delivered to the process message box.

So, to capture all errors from a function, we can:

  • enable the exit signal trapped in a calling process,
  • execute the function in separate but linked processes, and
  • wait for the process exit signal message and determine if the process/function has finished successfully or failed, and if it failed, for what reason.
def capture_link(callback) do
  Process.flag(:trap_exit, true)
  pid = spawn_link(callback)
  receive do
    {:EXIT, ^pid, :normal} -> :ok
    {:EXIT, ^pid, reason}   -> {:error, reason}
  end
end

This approach is acceptable, but it's a little intrusive, since capture_link/1 changes the invoking process state by calling the Process.flag/2 function. A non-intrusive approach (with no side effects involving the running process) is preferable.

Approach 2: Process Monitoring

Instead of linking (and possibly dying) with the process whose lifecycle is to be monitored, a process can be simply monitored. The process that requested monitoring will be informed when the monitored process terminates for any reason. The algorithm becomes as follows:

  • execute the function in a separate process that is monitored, but not linked to,
  • wait for the process termination message delivered by the monitor, and determine if the process/function has successfully completed or failed, and if it has failed, what is the reason behind the failure.

Here's an example of a successfully completed monitored process:

iex> spawn_monitor fn -> :a end
{#PID<0.88.0>, #Reference<0.0.2.114>}
iex> flush
{:DOWN, #Reference<0.0.2.114>, :process, #PID<0.88.0>, :normal}

When a monitored process terminates, the process that requested monitoring receives a message in the following form: {:DOWN, MonitorRef, Type, Object, Info}.

Here's a non-intrusive example of capturing all errors:

def capture_monitor do
  {pid, monitor} = spawn_monitor(&Library.foo/0)
  receive do
    {:DOWN, ^monitor, :process, ^pid, :normal} -> :ok
    {:DOWN, ^monitor, :process, ^pid, reason}  -> {:error, reason}
  end
end

Let's take a look at an example implementation of described capturing mechanism that can:

  • invoke any function and capture whatever output the invoked function generates (a return value or the reason behind the error) and
  • transfer it to the caller in a uniform way:
    • {:ok, state} or
    • {:error, reason}

The example implementation is as follows:

def capture(callback, timeout_ms) do
  {pid, monitor} = callback |> propagate_return_value_wrapper |> spawn_monitor
  receive do
    {:DOWN, ^monitor, :process, ^pid, :normal} ->
      receive do
        {__MODULE__, :response, response} -> {:ok, response}
      end
    {:DOWN, ^monitor, :process, ^pid, reason}  ->
      Logger.error "#{__MODULE__}: Error in handled function: #{inspect reason}";
      {:error, reason}
  after timeout_ms ->
    pid |> Process.exit(:kill)
    Logger.error "#{__MODULE__}: Timeout..."
    {:error, {:timeout, timeout_ms}}
  end
end

defp propagate_return_value_wrapper(callback) do
  caller_pid = self
  fn-> caller_pid |> send( {__MODULE__, :response, callback.()}) end
end

Approach 3: The Wormhole

We’ve covered some possible approaches to ensuring that all errors coming from an Elixir function are captured. To simplify error capturing, we created the Wormhole module, a production-ready callback wrapper. You can find it here, feel free to use it!

In Wormhole, we used Task.Supervisor to monitor the callback lifecycle. Here is the most important part of the code:

def capture(callback, timeout_ms) do
  Task.Supervisor.start_link
  |> callback_exec_and_response(callback, timeout_ms)
end

defp callback_exec_and_response({:ok, sup}, callback, timeout_ms) do
  Task.Supervisor.async_nolink(sup, callback)
  |> Task.yield(timeout_ms)
  |> supervisor_stop(sup)
  |> response_format(timeout_ms)
end
defp callback_exec_and_response(start_link_response, _callback, _timeout_ms) do
  {:error, {:failed_to_start_supervisor, start_link_response}}
end

defp supervisor_stop(response, sup) do
  Process.unlink(sup)
  Process.exit(sup, :kill)

  response
end

defp response_format({:ok,   state},  _),          do: {:ok,    state}
defp response_format({:exit, reason}, _),          do: {:error, reason}
defp response_format(nil,             timeout_ms), do: {:error, {:timeout, timeout_ms}}

Wormhole.capture starts Task.Supervisor, executes callback under it, waits for the response at most timeout_ms milliseconds, stops the supervisor, and returns a response in the :ok/:error tuple form.

Takeaways

Elixir is inherently a concurrent language designed for developing highly distributed, fault-tolerant applications. Elixir provides multiple mechanisms for error handling. A user needs to be precise abut what kinds of errors are to be handled and where they are coming from. If our intention is to handle errors originating from the same process they are being handled in, we can use common mechanisms utilized in mainstream, sequential languages like the try block.

When capturing errors originating from a nontrivial logical unit (involving multiple processes), well known, sequential mechanisms will not be appropriate. In these types of situations, process monitoring mechanisms and a supervisor-like approach are in order.

A logical unit entry function (callback) needs to be executed in a separate process, in which it can succeed or fail without affecting the function-invoking process. In such a scenario, the function-invoking process spawns a supervisor. Then, it engages the language mechanism to transport the pass or fail information from the callback-executing process to the supervisor. All of this can be achieved without making any changes in the code from which errors are being captured, which makes this approach generally applicable.

Semaphore is described by Elixir developers as the only CI which supports Elixir out of the box. To make testing in Elixir even easier, we regularly publish tutorials on TDD, BDD, and using Docker with Elixir. Read our tutorials and subscribe here.

Join 25k+ other developers receiving these posts by email

comments powered by Disqus
Newsletter

Occasional lightweight product and blog updates. Unsubscribe at any time.

© 2009-2018 Rendered Text. All rights reserved. Terms of Service, Privacy policy, Security.