guest - flak

select works poorly

At the bottom of the OpenBSD man page for select is a little note. “Internally to the kernel, select() and pselect() work poorly if multiple processes wait on the same file descriptor.” There’s a similar warning in the poll man page. Where does this warning come from and what does it mean?

The code to implement these system calls lives in src/sys/kern/sys_generic.c. Despite differences in interface, the internal implementation is mostly shared, which is why they both have the same affliction. select and poll both scan a set of file descriptors for readiness, then if none are ready we sleep and wait.

The primary function for sleeping is tsleep, which requires a wait channel. Conceptually similar to a condition variable. At some later point, when something changes, another process or interrupt will call wakeup on the same wait channel and we’ll resume running. For example, if we’re trying to read from a pipe, but there’s no data, we’ll sleep using the address of the pipe data structure. When data is written to the pipe, it will call wakeup with the same address. We only wake up the reader(s) of this pipe, and don’t disturb the slumber of all the readers blocked waiting on other pipes.

Now the question is what wait channel should select use? We could be watching a dozen different files. How do we choose? The answer is we don’t. Instead, there is a single global wait channel for all select and poll operations. It’s called selwait. The main loop of select lives in a function called dopselect. Minus some code we don’t care about, we scan for changes in the relevant files, then sleep, then try again.

        error = selscan(p, pibits[0], pobits[0], nd, ni, retval);
        if (error || *retval)
                goto done;
        error = tsleep(&selwait, PSOCK | PCATCH, "select", timo);
        if (error == 0)
                goto retry;

And then whenever some data gets written, we call wakeup(&selwait);. Based on what we’ve seen so far, one can conclude that this is likely to be inefficient. Every time any socket has some data available, we wake up every selecting process in the system. Works poorly indeed.

But that’s not the whole story. Behind the scenes, selscan calls a function called selrecord to record our interest. It contains a funny bit of code that checks if anybody is already waiting.

selrecord(struct proc *selector, struct selinfo *sip)
        if (sip->si_selpid && (p = pfind(sip->si_selpid)) &&
            p->p_wchan == (caddr_t)&selwait)
                sip->si_flags |= SI_COLL;
                sip->si_selpid = mypid;

If this selinfo already has a pid, and that pid refers to an existing process, and that process is already sleeping on the global select wait channel, we set a flag indicating that there has been a collision. Otherwise we save our pid.

On the flip side, selwakeup is a wrapper for wakeup that checks both pid and flag.

selwakeup(struct selinfo *sip)
        if (sip->si_flags & SI_COLL) {
                sip->si_flags &= ~SI_COLL;
        p = pfind(sip->si_selpid);
        if (p != NULL) {
                if (p->p_wchan == (caddr_t)&selwait) {
                        if (p->p_stat == SSLEEP)

We only call the full broadcast wakeup in the event of a collision. Otherwise, if only a single process is selecting on this file descriptor, we cheat and wake it manually. Thus avoiding the thundering herd. However, if there’s a collision, it’s not just the colliding selectors that wake up. It’s everybody.

Referring back to the warning in the man page, select usually manages to avoid poor behavior, but bad things will happen if two or more processes try selecting on the same descriptor. The term “select collisions” appears in documentation and reference material of a certain age.

This is not an intractable problem. kevent avoids it entirely. Other implementations may too. But practically, does it need to be solved? One can monitor sysctl kern.nselcoll to see how many select collisions have occurred. My laptop says it’s happened 43 times. A server with substantially more uptime says 0. Doesn’t seem so bad.

Posted 2016-06-07 13:59:14 by tedu Updated: 2016-06-07 13:59:14
Tagged: c openbsd programming