guest - flak

SIGPIPE can happen to you

Some recent flak outages were mysterious. One day things would be working, but the next they wouldn’t. All the flak.lua processes had disappeared. No error messages were reported in any observable location. No unusual looking requests were observed in any recorded location. Sometimes a process would survive days of heavy traffic. Other times it would die after only a few hours of light traffic. It was as if the process involved simply lost the will to live.

Finally, hooking up ktrace to the process, the culprit was, after a few days, revealed: SIGPIPE after a write. It’s unclear what I changed to cause this to be a problem now, after several years of successfully ignoring the problem, but that’s life.

SIGPIPE is a little strange if you’re not expecting it. By default, any time a write system call will fail with errno set to EPIPE, just before it returns the process also receives a signal. The default action for this signal is to terminate the process. As the name implies, this error can occur with pipes as created by pipe, but also with sockets or FIFOs.

The Stevens unix programming book discusses SIGPIPE in the context of TCP connections. As does the socket FAQ answer. The connection has received a RST in response to a previous write, and now that RST is being returned to the process in the form of a signal. However, the explanation of the mechanism doesn’t contain much in the way of rationale. Nor does it explain why it happens for purely local pipes that have nothing to do with TCP.

If you ask the internet, the common advice is to always ignore SIGPIPE and check the return value of write. Also not a rationale.

Some people imagine a scenario where a program ignores write errors, but we want to stop processing: cat /dev/zero | cat | cat | cat. If we kill the last cat, closing the read end of it’s input pipe, we want the previous cat to also quit instead of continuing to read. That’s true. Or we can use a version of cat that does proper error checking. Signal delivery is also an unsatisfactory solution for a process which performs a great deal of calculation before writing. cat /dev/zero | sha512 | cat will not be terminated in any finite amount of time by killing the final cat.

Ignoring SIGPIPE is certainly the surest way to avoid an untimely death. And it puts some of the everything is a file semantics back into everything being a file by eliminating a class of file one may not touch. Reviewing some network software shows that pretty much everyone does exactly the same thing: ignore the signal. It’s not really a question of what to do. It’s only a question of whether you remember to do it.

Posted 2015-12-02 16:06:12 by tedu Updated: 2015-12-02 16:06:12
Tagged: network programming