Section 14.5. I/O Multiplexing

14.5. I/O Multiplexing

When we read from one descriptor and write to another, we can use blocking I/O in a loop, such as

        while ((n = read(STDIN_FILENO, buf, BUFSIZ)) > 0)
            if (write(STDOUT_FILENO, buf, n) != n)
                err_sys("write error");

We see this form of blocking I/O over and over again. What if we have to read from two descriptors? In this case, we can't do a blocking read on either descriptor, as data may appear on one descriptor while we're blocked in a read on the other. A different technique is required to handle this case.

Let's look at the structure of the telnet(1) command. In this program, we read from the terminal (standard input) and write to a network connection, and we read from the network connection and write to the terminal (standard output). At the other end of the network connection, the telnetd daemon reads what we typed and presents it to a shell as if we were logged in to the remote machine. The telnetd daemon sends any output generated by the commands we type back to us through the telnet command, to be displayed on our terminal. Figure 14.20 shows a picture of this.

Figure 14.20. Overview of `telnet` program

The telnet process has two inputs and two outputs. We can't do a blocking read on either of the inputs, as we never know which input will have data for us.

One way to handle this particular problem is to divide the process in two pieces (using fork), with each half handling one direction of data. We show this in Figure 14.21. (The cu(1) command provided with System V's uucp communication package was structured like this.)

Figure 14.21. The `telnet` program using two processes

If we use two processes, we can let each process do a blocking read. But this leads to a problem when the operation terminates. If an end of file is received by the child (the network connection is disconnected by the telnetd daemon), then the child terminates, and the parent is notified by the SIGCHLD signal. But if the parent terminates (the user enters an end of file at the terminal), then the parent has to tell the child to stop. We can use a signal for this (SIGUSR1, for example), but it does complicate the program somewhat.

Instead of two processes, we could use two threads in a single process. This avoids the termination complexity, but requires that we deal with synchronization between the threads, which could add more complexity than it saves.

We could use nonblocking I/O in a single process by setting both descriptors nonblocking and issuing a read on the first descriptor. If data is present, we read it and process it. If there is no data to read, the call returns immediately. We then do the same thing with the second descriptor. After this, we wait for some amount of time (a few seconds, perhaps) and then try to read from the first descriptor again. This type of loop is called polling. The problem is that it wastes CPU time. Most of the time, there won't be data to read, so we waste time performing the read system calls. We also have to guess how long to wait each time around the loop. Although it works on any system that supports nonblocking I/O, polling should be avoided on a multitasking system.

Another technique is called asynchronous I/O. To do this, we tell the kernel to notify us with a signal when a descriptor is ready for I/O. There are two problems with this. First, not all systems support this feature (it is an optional facility in the Single UNIX Specification). System V provides the SIGPOLL signal for this technique, but this signal works only if the descriptor refers to a STREAMS device. BSD has a similar signal, SIGIO, but it has similar limitations: it works only on descriptors that refer to terminal devices or networks. The second problem with this technique is that there is only one of these signals per process (SIGPOLL or SIGIO). If we enable this signal for two descriptors (in the example we've been talking about, reading from two descriptors), the occurrence of the signal doesn't tell us which descriptor is ready. To determine which descriptor is ready, we still need to set each nonblocking and try them in sequence. We describe asynchronous I/O briefly in Section 14.6.

A better technique is to use I/O multiplexing. To do this, we build a list of the descriptors that we are interested in (usually more than one descriptor) and call a function that doesn't return until one of the descriptors is ready for I/O. On return from the function, we are told which descriptors are ready for I/O.

Three functionspoll, pselect, and selectallow us to perform I/O multiplexing. Figure 14.22 summarizes which platforms support them. Note that select is defined by the base POSIX.1 standard, but poll is an XSI extension to the base.

Figure 14.22. I/O multiplexing supported by various UNIX systems
System
poll
pselect
select
<sys/select.h>
SUS
XSI
•
•
•
FreeBSD 5.2.1
•
•
•

Linux 2.4.22
•
•
•
•
Mac OS X 10.3
•
•
•

Solaris 9
•

•
•

POSIX specifies that <sys/select> be included to pull the information for select into your program. Historically, however, we have had to include three other header files, and some of the implementations haven't yet caught up to the standard. Check the select manual page to see what your system supports. Older systems require that you include <sys/types.h>, <sys/time.h>, and <unistd.h>.

I/O multiplexing was provided with the select function in 4.2BSD. This function has always worked with any descriptor, although its main use has been for terminal I/O and network I/O. SVR3 added the poll function when the STREAMS mechanism was added. Initially, however, poll worked only with STREAMS devices. In SVR4, support was added to allow poll to work on any descriptor.

14.5.1. `select` and `pselect` Functions

The select function lets us do I/O multiplexing under all POSIX-compatible platforms. The arguments we pass to select tell the kernel

Which descriptors we're interested in.
What conditions we're interested in for each descriptor. (Do we want to read from a given descriptor? Do we want to write to a given descriptor? Are we interested in an exception condition for a given descriptor?)
How long we want to wait. (We can wait forever, wait a fixed amount of time, or not wait at all.)

On the return from select, the kernel tells us

The total count of the number of descriptors that are ready
Which descriptors are ready for each of the three conditions (read, write, or exception condition)

With this return information, we can call the appropriate I/O function (usually read or write) and know that the function won't block.

[View full width]
#include <sys/select.h> int select(int maxfdp1, fd_set *restrict readfds, fd_set *restrict writefds, fd_set *restrict exceptfds, struct timeval *restrict tvptr);

Returns: count of ready descriptors, 0 on timeout, 1 on error

Let's look at the last argument first. This specifies how long we want to wait:

   struct timeval {
     long tv_sec;     /* seconds */
     long tv_usec;    /* and microseconds */
   };

There are three conditions.

   tvptr == NULL

Wait forever. This infinite wait can be interrupted if we catch a signal. Return is made when one of the specified descriptors is ready or when a signal is caught. If a signal is caught, select returns 1 with errno set to EINTR.

   tvptr->tv_sec == 0 && tvptr->tv_usec == 0

Don't wait at all. All the specified descriptors are tested, and return is made immediately. This is a way to poll the system to find out the status of multiple descriptors, without blocking in the select function.

   tvptr->tv_sec != 0 || tvptr->tv_usec != 0

Wait the specified number of seconds and microseconds. Return is made when one of the specified descriptors is ready or when the timeout value expires. If the timeout expires before any of the descriptors is ready, the return value is 0. (If the system doesn't provide microsecond resolution, the tvptr>tv_usec value is rounded up to the nearest supported value.) As with the first condition, this wait can also be interrupted by a caught signal.

POSIX.1 allows an implementation to modify the timeval structure, so after select returns, you can't rely on the structure containing the same values it did before calling select. FreeBSD 5.2.1, Mac OS X 10.3, and Solaris 9 all leave the structure unchanged, but Linux 2.4.22 will update it with the time remaining if select returns before the timeout value expires.

The middle three argumentsreadfds, writefds, and exceptfdsare pointers to descriptor sets. These three sets specify which descriptors we're interested in and for which conditions (readable, writable, or an exception condition). A descriptor set is stored in an fd_set data type. This data type is chosen by the implementation so that it can hold one bit for each possible descriptor. We can consider it to be just a big array of bits, as shown in Figure 14.23.

Figure 14.23. Specifying the read, write, and exception descriptors for `select`

The only thing we can do with the fd_set data type is allocate a variable of this type, assign a variable of this type to another variable of the same type, or use one of the following four functions on a variable of this type.

#include <sys/select.h> int FD_ISSET(int fd, fd_set *fdset);

Returns: nonzero if fd is in set, 0 otherwise

void FD_CLR(int fd, fd_set *fdset); void FD_SET(int fd, fd_set *fdset); void FD_ZERO(fd_set *fdset);

These interfaces can be implemented as either macros or functions. An fd_set is set to all zero bits by calling FD_ZERO. To turn on a single bit in a set, we use FD_SET. We can clear a single bit by calling FD_CLR. Finally, we can test whether a given bit is turned on in the set with FD_ISSET.

After declaring a descriptor set, we must zero the set using FD_ZERO. We then set bits in the set for each descriptor that we're interested in, as in

   fd_set   rset;
   int      fd;

   FD_ZERO(&rset);
   FD_SET(fd, &rset);
   FD_SET(STDIN_FILENO, &rset);

On return from select, we can test whether a given bit in the set is still on using FD_ISSET:

   if (FD_ISSET(fd, &rset)) {
       ...
   }

Any (or all) of the middle three arguments to select (the pointers to the descriptor sets) can be null pointers if we're not interested in that condition. If all three pointers are NULL, then we have a higher precision timer than provided by sleep. (Recall from Section 10.19 that sleep waits for an integral number of seconds. With select, we can wait for intervals less than 1 second; the actual resolution depends on the system's clock.) Exercise 14.6 shows such a function.

The first argument to select, maxfdp1, stands for "maximum file descriptor plus 1." We calculate the highest descriptor that we're interested in, considering all three of the descriptor sets, add 1, and that's the first argument. We could just set the first argument to FD_SETSIZE, a constant in <sys/select.h> that specifies the maximum number of descriptors (often 1,024), but this value is too large for most applications. Indeed, most applications probably use between 3 and 10 descriptors. (Some applications need many more descriptors, but these UNIX programs are atypical.) By specifying the highest descriptor that we're interested in, we can prevent the kernel from going through hundreds of unused bits in the three descriptor sets, looking for bits that are turned on.

As an example, Figure 14.24 shows what two descriptor sets look like if we write

   fd_set readset, writeset;

   FD_ZERO(&readset);
   FD_ZERO(&writeset);
   FD_SET(0, &readset);
   FD_SET(3, &readset);
   FD_SET(1, &writeset);
   FD_SET(2, &writeset);
   select(4, &readset, &writeset, NULL, NULL);

Figure 14.24. Example descriptor sets for `select`

The reason we have to add 1 to the maximum descriptor number is that descriptors start at 0, and the first argument is really a count of the number of descriptors to check (starting with descriptor 0).

There are three possible return values from select.

A return value of 1 means that an error occurred. This can happen, for example, if a signal is caught before any of the specified descriptors are ready. In this case, none of the descriptor sets will be modified.
A return value of 0 means that no descriptors are ready. This happens if the time limit expires before any of the descriptors are ready. When this happens, all the descriptor sets will be zeroed out.
A positive return value specifies the number of descriptors that are ready. This value is the sum of the descriptors ready in all three sets, so if the same descriptor is ready to be read and written, it will be counted twice in the return value. The only bits left on in the three descriptor sets are the bits corresponding to the descriptors that are ready.

We now need to be more specific about what "ready" means.

A descriptor in the read set (readfds) is considered ready if a read from that descriptor won't block.
A descriptor in the write set (writefds) is considered ready if a write to that descriptor won't block.
A descriptor in the exception set (exceptfds) is considered ready if an exception condition is pending on that descriptor. Currently, an exception condition corresponds to either the arrival of out-of-band data on a network connection or certain conditions occurring on a pseudo terminal that has been placed into packet mode. (Section 15.10 of Stevens [1990] describes this latter condition.)
File descriptors for regular files always return ready for reading, writing, and exception conditions.

It is important to realize that whether a descriptor is blocking or not doesn't affect whether select blocks. That is, if we have a nonblocking descriptor that we want to read from and we call select with a timeout value of 5 seconds, select will block for up to 5 seconds. Similarly, if we specify an infinite timeout, select blocks until data is ready for the descriptor or until a signal is caught.

If we encounter the end of file on a descriptor, that descriptor is considered readable by select. We then call read and it returns 0, the way to signify end of file on UNIX systems. (Many people incorrectly assume that select indicates an exception condition on a descriptor when the end of file is reached.)

POSIX.1 also defines a variant of select called pselect.

[View full width]
#include <sys/select.h> int pselect(int maxfdp1, fd_set *restrict readfds, fd_set *restrict writefds, fd_set *restrict exceptfds, const struct timespec *restrict tsptr, const sigset_t *restrict sigmask);

Returns: count of ready descriptors, 0 on timeout, 1 on error

The pselect function is identical to select, with the following exceptions.

The timeout value for select is specified by a timeval structure, but for pselect, a timespec structure is used. (Recall the definition of the timespec structure in Section 11.6.) Instead of seconds and microseconds, the timespec structure represents the timeout value in seconds and nanoseconds. This provides a higher-resolution timeout if the platform supports that fine a level of granularity.
The timeout value for pselect is declared const, and we are guaranteed that its value will not change as a result of calling pselect.
An optional signal mask argument is available with pselect. If sigmask is null, pselect behaves as select does with respect to signals. Otherwise, sigmask points to a signal mask that is atomically installed when pselect is called. On return, the previous signal mask is restored.

14.5.2. `poll` Function

The poll function is similar to select, but the programmer interface is different. As we'll see, poll is tied to the STREAMS system, since it originated with System V, although we are able to use it with any type of file descriptor.

[View full width]
#include <poll.h> int poll(struct pollfd fdarray[], nfds_t nfds, int timeout);

Returns: count of ready descriptors, 0 on timeout, 1 on error

With poll, instead of building a set of descriptors for each condition (readability, writability, and exception condition), as we did with select, we build an array of pollfd structures, with each array element specifying a descriptor number and the conditions that we're interested in for that descriptor:

   struct pollfd {
     int   fd;       /* file descriptor to check, or <0 to ignore */
     short events;   /* events of interest on fd */
     short revents;  /* events that occurred on fd */
   };

The number of elements in the fdarray array is specified by nfds.

Historically, there have been differences in how the nfds parameter was declared. SVR3 specified the number of elements in the array as an unsigned long, which seems excessive. In the SVR4 manual [AT&T 1990d], the prototype for poll showed the data type of the second argument as size_t. (Recall the primitive system data types, Figure 2.20.) But the actual prototype in the <poll.h> header still showed the second argument as an unsigned long. The Single UNIX Specification defines the new type nfds_t to allow the implementation to select the appropriate type and hide the details from applications. Note that this type has to be large enough to hold an integer, since the return value represents the number of entries in the array with satisfied events.

The SVID corresponding to SVR4 [AT&T 1989] showed the first argument to poll as struct pollfd fdarray[], whereas the SVR4 manual page [AT&T 1990d] showed this argument as struct pollfd *fdarray. In the C language, both declarations are equivalent. We use the first declaration to reiterate that fdarray points to an array of structures and not a pointer to a single structure.

To tell the kernel what events we're interested in for each descriptor, we have to set the events member of each array element to one or more of the values in Figure 14.25. On return, the revents member is set by the kernel, specifying which events have occurred for each descriptor. (Note that poll doesn't change the events member. This differs from select, which modifies its arguments to indicate what is ready.)

Figure 14.25. The events and revents flags for poll
Name
Input to events?
Result from revents?
Description
POLLIN
•
•
Data other than high priority can be read without blocking (equivalent to POLLRDNORM|POLLRDBAND).
POLLRDNORM
•
•
Normal data (priority band 0) can be read without blocking.
POLLRDBAND
•
•
Data from a nonzero priority band can be read without blocking.
POLLPRI
•
•
High-priority data can be read without blocking.
POLLOUT
•
•
Normal data can be written without blocking.
POLLWRNORM
•
•
Same as POLLOUT.
POLLWRBAND
•
•
Data for a nonzero priority band can be written without blocking.
POLLERR

•
An error has occurred.
POLLHUP

•
A hangup has occurred.
POLLNVAL

•
The descriptor does not reference an open file.

The first four rows of Figure 14.25 test for readability, the next three test for writability, and the final three are for exception conditions. The last three rows in Figure 14.25 are set by the kernel on return. These three values are returned in revents when the condition occurs, even if they weren't specified in the events field.

When a descriptor is hung up (POLLHUP), we can no longer write to the descriptor. There may, however, still be data to be read from the descriptor.

The final argument to poll specifies how long we want to wait. As with select, there are three cases.

   timeout == -1

Wait forever. (Some systems define the constant INFTIM in <stropts.h> as 1.) We return when one of the specified descriptors is ready or when a signal is caught. If a signal is caught, poll returns 1 with errno set to EINTR.

   timeout == 0

Don't wait. All the specified descriptors are tested, and we return immediately. This is a way to poll the system to find out the status of multiple descriptors, without blocking in the call to poll.

   timeout > 0

Wait timeout milliseconds. We return when one of the specified descriptors is ready or when the timeout expires. If the timeout expires before any of the descriptors is ready, the return value is 0. (If your system doesn't provide millisecond resolution, timeout is rounded up to the nearest supported value.)

It is important to realize the difference between an end of file and a hangup. If we're entering data from the terminal and type the end-of-file character, POLLIN is turned on so we can read the end-of-file indication (read returns 0). POLLHUP is not turned on in revents. If we're reading from a modem and the telephone line is hung up, we'll receive the POLLHUP notification.

As with select, whether a descriptor is blocking or not doesn't affect whether poll blocks.

Interruptibility of `select` and `poll`

When the automatic restarting of interrupted system calls was introduced with 4.2BSD (Section 10.5), the select function was never restarted. This characteristic continues with most systems even if the SA_RESTART option is specified. But under SVR4, if SA_RESTART was specified, even select and poll were automatically restarted. To prevent this from catching us when we port software to systems derived from SVR4, we'll always use the signal_intr function (Figure 10.19) if the signal could interrupt a call to select or poll.

None of the implementations described in this book restart poll or select when a signal is received, even if the SA_RESTART flag is used.

14.5. I/O Multiplexing

Figure 14.20. Overview of telnet program

Figure 14.21. The telnet program using two processes

Figure 14.22. I/O multiplexing supported by various UNIX systems

14.5.1. select and pselect Functions

Figure 14.23. Specifying the read, write, and exception descriptors for select

Figure 14.24. Example descriptor sets for select

14.5.2. poll Function

Figure 14.25. The events and revents flags for poll

Interruptibility of select and poll