14.5. I/O MultiplexingWhen we read from one descriptor and write to another, we can use blocking I/O in a loop, such as while ((n = read(STDIN_FILENO, buf, BUFSIZ)) > 0) if (write(STDOUT_FILENO, buf, n) != n) err_sys("write error"); We see this form of blocking I/O over and over again. What if we have to read from two descriptors? In this case, we can't do a blocking read on either descriptor, as data may appear on one descriptor while we're blocked in a read on the other. A different technique is required to handle this case. Let's look at the structure of the telnet(1) command. In this program, we read from the terminal (standard input) and write to a network connection, and we read from the network connection and write to the terminal (standard output). At the other end of the network connection, the telnetd daemon reads what we typed and presents it to a shell as if we were logged in to the remote machine. The telnetd daemon sends any output generated by the commands we type back to us through the telnet command, to be displayed on our terminal. Figure 14.20 shows a picture of this. Figure 14.20. Overview of telnet program
The telnet process has two inputs and two outputs. We can't do a blocking read on either of the inputs, as we never know which input will have data for us. One way to handle this particular problem is to divide the process in two pieces (using fork), with each half handling one direction of data. We show this in Figure 14.21. (The cu(1) command provided with System V's uucp communication package was structured like this.) Figure 14.21. The telnet program using two processes
If we use two processes, we can let each process do a blocking read. But this leads to a problem when the operation terminates. If an end of file is received by the child (the network connection is disconnected by the telnetd daemon), then the child terminates, and the parent is notified by the SIGCHLD signal. But if the parent terminates (the user enters an end of file at the terminal), then the parent has to tell the child to stop. We can use a signal for this (SIGUSR1, for example), but it does complicate the program somewhat. Instead of two processes, we could use two threads in a single process. This avoids the termination complexity, but requires that we deal with synchronization between the threads, which could add more complexity than it saves. We could use nonblocking I/O in a single process by setting both descriptors nonblocking and issuing a read on the first descriptor. If data is present, we read it and process it. If there is no data to read, the call returns immediately. We then do the same thing with the second descriptor. After this, we wait for some amount of time (a few seconds, perhaps) and then try to read from the first descriptor again. This type of loop is called polling. The problem is that it wastes CPU time. Most of the time, there won't be data to read, so we waste time performing the read system calls. We also have to guess how long to wait each time around the loop. Although it works on any system that supports nonblocking I/O, polling should be avoided on a multitasking system. Another technique is called asynchronous I/O. To do this, we tell the kernel to notify us with a signal when a descriptor is ready for I/O. There are two problems with this. First, not all systems support this feature (it is an optional facility in the Single UNIX Specification). System V provides the SIGPOLL signal for this technique, but this signal works only if the descriptor refers to a STREAMS device. BSD has a similar signal, SIGIO, but it has similar limitations: it works only on descriptors that refer to terminal devices or networks. The second problem with this technique is that there is only one of these signals per process (SIGPOLL or SIGIO). If we enable this signal for two descriptors (in the example we've been talking about, reading from two descriptors), the occurrence of the signal doesn't tell us which descriptor is ready. To determine which descriptor is ready, we still need to set each nonblocking and try them in sequence. We describe asynchronous I/O briefly in Section 14.6. A better technique is to use I/O multiplexing. To do this, we build a list of the descriptors that we are interested in (usually more than one descriptor) and call a function that doesn't return until one of the descriptors is ready for I/O. On return from the function, we are told which descriptors are ready for I/O. Three functionspoll, pselect, and selectallow us to perform I/O multiplexing. Figure 14.22 summarizes which platforms support them. Note that select is defined by the base POSIX.1 standard, but poll is an XSI extension to the base.
14.5.1. select and pselect FunctionsThe select function lets us do I/O multiplexing under all POSIX-compatible platforms. The arguments we pass to select tell the kernel
On the return from select, the kernel tells us
With this return information, we can call the appropriate I/O function (usually read or write) and know that the function won't block.
Let's look at the last argument first. This specifies how long we want to wait: struct timeval { long tv_sec; /* seconds */ long tv_usec; /* and microseconds */ }; There are three conditions.
tvptr == NULL
tvptr->tv_sec == 0 && tvptr->tv_usec == 0
tvptr->tv_sec != 0 || tvptr->tv_usec != 0
The middle three argumentsreadfds, writefds, and exceptfdsare pointers to descriptor sets. These three sets specify which descriptors we're interested in and for which conditions (readable, writable, or an exception condition). A descriptor set is stored in an fd_set data type. This data type is chosen by the implementation so that it can hold one bit for each possible descriptor. We can consider it to be just a big array of bits, as shown in Figure 14.23. Figure 14.23. Specifying the read, write, and exception descriptors for select
The only thing we can do with the fd_set data type is allocate a variable of this type, assign a variable of this type to another variable of the same type, or use one of the following four functions on a variable of this type.
These interfaces can be implemented as either macros or functions. An fd_set is set to all zero bits by calling FD_ZERO. To turn on a single bit in a set, we use FD_SET. We can clear a single bit by calling FD_CLR. Finally, we can test whether a given bit is turned on in the set with FD_ISSET. After declaring a descriptor set, we must zero the set using FD_ZERO. We then set bits in the set for each descriptor that we're interested in, as in fd_set rset; int fd; FD_ZERO(&rset); FD_SET(fd, &rset); FD_SET(STDIN_FILENO, &rset); On return from select, we can test whether a given bit in the set is still on using FD_ISSET: if (FD_ISSET(fd, &rset)) { ... } Any (or all) of the middle three arguments to select (the pointers to the descriptor sets) can be null pointers if we're not interested in that condition. If all three pointers are NULL, then we have a higher precision timer than provided by sleep. (Recall from Section 10.19 that sleep waits for an integral number of seconds. With select, we can wait for intervals less than 1 second; the actual resolution depends on the system's clock.) Exercise 14.6 shows such a function. The first argument to select, maxfdp1, stands for "maximum file descriptor plus 1." We calculate the highest descriptor that we're interested in, considering all three of the descriptor sets, add 1, and that's the first argument. We could just set the first argument to FD_SETSIZE, a constant in <sys/select.h> that specifies the maximum number of descriptors (often 1,024), but this value is too large for most applications. Indeed, most applications probably use between 3 and 10 descriptors. (Some applications need many more descriptors, but these UNIX programs are atypical.) By specifying the highest descriptor that we're interested in, we can prevent the kernel from going through hundreds of unused bits in the three descriptor sets, looking for bits that are turned on. As an example, Figure 14.24 shows what two descriptor sets look like if we write fd_set readset, writeset; FD_ZERO(&readset); FD_ZERO(&writeset); FD_SET(0, &readset); FD_SET(3, &readset); FD_SET(1, &writeset); FD_SET(2, &writeset); select(4, &readset, &writeset, NULL, NULL); Figure 14.24. Example descriptor sets for select
The reason we have to add 1 to the maximum descriptor number is that descriptors start at 0, and the first argument is really a count of the number of descriptors to check (starting with descriptor 0). There are three possible return values from select.
We now need to be more specific about what "ready" means.
It is important to realize that whether a descriptor is blocking or not doesn't affect whether select blocks. That is, if we have a nonblocking descriptor that we want to read from and we call select with a timeout value of 5 seconds, select will block for up to 5 seconds. Similarly, if we specify an infinite timeout, select blocks until data is ready for the descriptor or until a signal is caught. If we encounter the end of file on a descriptor, that descriptor is considered readable by select. We then call read and it returns 0, the way to signify end of file on UNIX systems. (Many people incorrectly assume that select indicates an exception condition on a descriptor when the end of file is reached.) POSIX.1 also defines a variant of select called pselect.
The pselect function is identical to select, with the following exceptions.
14.5.2. poll FunctionThe poll function is similar to select, but the programmer interface is different. As we'll see, poll is tied to the STREAMS system, since it originated with System V, although we are able to use it with any type of file descriptor.
With poll, instead of building a set of descriptors for each condition (readability, writability, and exception condition), as we did with select, we build an array of pollfd structures, with each array element specifying a descriptor number and the conditions that we're interested in for that descriptor: struct pollfd { int fd; /* file descriptor to check, or <0 to ignore */ short events; /* events of interest on fd */ short revents; /* events that occurred on fd */ }; The number of elements in the fdarray array is specified by nfds.
To tell the kernel what events we're interested in for each descriptor, we have to set the events member of each array element to one or more of the values in Figure 14.25. On return, the revents member is set by the kernel, specifying which events have occurred for each descriptor. (Note that poll doesn't change the events member. This differs from select, which modifies its arguments to indicate what is ready.)
The first four rows of Figure 14.25 test for readability, the next three test for writability, and the final three are for exception conditions. The last three rows in Figure 14.25 are set by the kernel on return. These three values are returned in revents when the condition occurs, even if they weren't specified in the events field. When a descriptor is hung up (POLLHUP), we can no longer write to the descriptor. There may, however, still be data to be read from the descriptor. The final argument to poll specifies how long we want to wait. As with select, there are three cases.
timeout == -1
timeout == 0
timeout > 0
It is important to realize the difference between an end of file and a hangup. If we're entering data from the terminal and type the end-of-file character, POLLIN is turned on so we can read the end-of-file indication (read returns 0). POLLHUP is not turned on in revents. If we're reading from a modem and the telephone line is hung up, we'll receive the POLLHUP notification. As with select, whether a descriptor is blocking or not doesn't affect whether poll blocks. Interruptibility of select and pollWhen the automatic restarting of interrupted system calls was introduced with 4.2BSD (Section 10.5), the select function was never restarted. This characteristic continues with most systems even if the SA_RESTART option is specified. But under SVR4, if SA_RESTART was specified, even select and poll were automatically restarted. To prevent this from catching us when we port software to systems derived from SVR4, we'll always use the signal_intr function (Figure 10.19) if the signal could interrupt a call to select or poll.
|