20.8. Source Code
We start with the apue_db.h header shown first. This header is included by the library source code and all applications that call the library.
For the remainder of this text, we depart from the style of the previous examples in several ways. First, because the source code example is longer than usual, we number the lines. This makes it easier to link the discussion with the corresponding source code. Second, we place the description of the source code immediately below the source code on the same page.
This style was inspired by John Lions in his book documenting the UNIX Version 6 operating system source code [Lions 1977, 1996]. It simplifies the task of studying large amounts of source code.
Note that we do not bother to number blank lines. Although this departs from the normal behavior of such tools as pr(1), we have nothing interesting to say about blank lines.
1 #ifndef _APUE_DB_H
2 #define _APUE_DB_H
3 typedef void * DBHANDLE;
4 DBHANDLE db_open(const char *, int, ...);
5 void db_close(DBHANDLE);
6 char *db_fetch(DBHANDLE, const char *);
7 int db_store(DBHANDLE, const char *, const char *, int);
8 int db_delete(DBHANDLE, const char *);
9 void db_rewind(DBHANDLE);
10 char *db_nextrec(DBHANDLE, char *);
11 /*
12 * Flags for db_store().
13 */
14 #define DB_INSERT 1 /* insert new record only */
15 #define DB_REPLACE 2 /* replace existing record */
16 #define DB_STORE 3 /* replace or insert */
17 /*
18 * Implementation limits.
19 */
20 #define IDXLEN_MIN 6 /* key, sep, start, sep, length, \n */
21 #define IDXLEN_MAX 1024 /* arbitrary */
22 #define DATLEN_MIN 2 /* data byte, newline */
23 #define DATLEN_MAX 1024 /* arbitrary */
24 #endif /* _APUE_DB_H */
[1 3] | We use the _APUE_DB_H symbol to ensure that the contents of the header file are included only once. The DBHANDLE type represents an active reference to the database and is used to isolate applications from the implementation details of the database. Compare this technique with the way the standard I/O library exposes the FILE structure to applications. | [4 10] | Next, we declare the prototypes for the database library's public functions. Since this header is included by applications that want to use the library, we don't declare the prototypes for the library's private functions here. | [1124] | The legal flags that can be passed to the db_store function are defined next, followed by fundamental limits of the implementation. These limits can be changed, if desired, to support bigger databases. | | The minimum index record length is specified by IDXLEN_MIN. This represents a 1-byte key, a 1-byte separator, a 1-byte starting offset, another 1-byte separator, a 1-byte length, and a terminating newline character. (Recall the format of an index record from Figure 20.2.) An index record will usually be larger than IDXLEN_MIN bytes, but this is the bare minimum size. |
The next file is db.c, the C source file for the library. For simplicity, we include all functions in a single file. This has the advantage that we can hide private functions by declaring them static.
1 #include "apue.h"
2 #include "apue_db.h"
3 #include <fcntl.h> /* open & db_open flags */
4 #include <stdarg.h>
5 #include <errno.h>
6 #include <sys/uio.h> /* struct iovec */
7 /*
8 * Internal index file constants.
9 * These are used to construct records in the
10 * index file and data file.
11 */
12 #define IDXLEN_SZ 4 /* index record length (ASCII chars) */
13 #define SEP ':' /* separator char in index record */
14 #define SPACE ' ' /* space character */
15 #define NEWLINE '\n' /* newline character */
16 /*
17 * The following definitions are for hash chains and free
18 * list chain in the index file.
19 */
20 #define PTR_SZ 6 /* size of ptr field in hash chain */
21 #define PTR_MAX 999999 /* max file offset = 10**PTR_SZ - 1 */
22 #define NHASH_DEF 137 /* default hash table size */
23 #define FREE_OFF 0 /* free list offset in index file */
24 #define HASH_OFF PTR_SZ /* hash table offset in index file */
25 typedef unsigned long DBHASH; /* hash values */
26 typedef unsigned long COUNT; /* unsigned counter */
[1 6] | We include apue.h because we use some of the functions from our private library. In turn, apue.h includes several standard header files, including <stdio.h> and <unistd.h>. We include <stdarg.h> because the db_open function uses the variable-argument functions declared by <stdarg.h>. | [7 26] | The size of an index record is specified by IDXLEN_SZ. We use some characters, such as colon and newline, as delimiters in the database. We use the space character as "white out" when we delete a record. | | Some of the values that we have defined as constants could also be made variable, with some added complexity in the implementation. For example, we set the size of the hash table to 137 entries. A better technique would be to let the caller specify this as an argument to db_open, based on the expected size of the database. We would then have to store this size at the beginning of the index file. |
27 /*
28 * Library's private representation of the database.
29 */
30 typedef struct {
31 int idxfd; /* fd for index file */
32 int datfd; /* fd for data file */
33 char *idxbuf; /* malloc'ed buffer for index record */
34 char *datbuf; /* malloc'ed buffer for data record*/
35 char *name; /* name db was opened under */
36 off_t idxoff; /* offset in index file of index record */
37 /* key is at (idxoff + PTR_SZ + IDXLEN_SZ) */
38 size_t idxlen; /* length of index record */
39 /* excludes IDXLEN_SZ bytes at front of record */
40 /* includes newline at end of index record */
41 off_t datoff; /* offset in data file of data record */
42 size_t datlen; /* length of data record */
43 /* includes newline at end */
44 off_t ptrval; /* contents of chain ptr in index record */
45 off_t ptroff; /* chain ptr offset pointing to this idx record */
46 off_t chainoff; /* offset of hash chain for this index record */
47 off_t hashoff; /* offset in index file of hash table */
48 DBHASH nhash; /* current hash table size */
49 COUNT cnt_delok; /* delete OK */
50 COUNT cnt_delerr; /* delete error */
51 COUNT cnt_fetchok; /* fetch OK */
52 COUNT cnt_fetcherr; /* fetch error */
53 COUNT cnt_nextrec; /* nextrec */
54 COUNT cnt_stor1; /* store: DB_INSERT, no empty, appended */
55 COUNT cnt_stor2; /* store: DB_INSERT, found empty, reused */
56 COUNT cnt_stor3; /* store: DB_REPLACE, diff len, appended */
57 COUNT cnt_stor4; /* store: DB_REPLACE, same len, overwrote */
58 COUNT cnt_storerr; /* store error */
59 } DB;
[27 48] | The DB structure is where we keep all the information for each open database. The DBHANDLE value that is returned by db_open and used by all the other functions is really just a pointer to one of these structures, but we hide that from the callers. | | Since we store pointers and lengths as ASCII in the database, we convert these to numeric values and save them in the DB structure. We also save the hash table size even though it is fixed, just in case we decide to enhance the library to allow callers to specify the size when the database is created (see Exercise 20.7). | [49 59] | The last ten fields in the DB structure count both successful and unsuccessful operations. If we want to analyze the performance of our database, we can write a function to return these statistics, but for now, we only maintain the counters. |
60 /*
61 * Internal functions.
62 */
63 static DB *_db_alloc(int);
64 static void _db_dodelete(DB *);
65 static int _db_find_and_lock(DB *, const char *, int);
66 static int _db_findfree(DB *, int, int);
67 static void _db_free(DB *);
68 static DBHASH _db_hash(DB *, const char *);
69 static char *_db_readdat(DB *);
70 static off_t _db_readidx(DB *, off_t);
71 static off_t _db_readptr(DB *, off_t);
72 static void _db_writedat(DB *, const char *, off_t, int);
73 static void _db_writeidx(DB *, const char *, off_t, int, off_t);
74 static void _db_writeptr(DB *, off_t, off_t);
75 /*
76 * Open or create a database. Same arguments as open(2).
77 */
78 DBHANDLE
79 db_open(const char *pathname, int oflag, ...)
80 {
81 DB *db;
82 int len, mode;
83 size_t i;
84 char asciiptr[PTR_SZ + 1],
85 hash[(NHASH_DEF + 1) * PTR_SZ + 2];
86 /* +2 for newline and null */
87 struct stat statbuff;
88 /*
89 * Allocate a DB structure, and the buffers it needs.
90 */
91 len = strlen(pathname);
92 if ((db = _db_alloc(len)) == NULL)
93 err_dump("db_open: _db_alloc error for DB");
[60 74] | We have chosen to name all the user-callable (public) functions starting with db_ and all the internal (private) functions starting with _db_. The public functions were declared in the library's header file, apue_db.h. We declare the internal functions as static so they are visible only to functions residing in the same file (the file containing the library implementation). | [75 93] | The db_open function has the same arguments as open(2). If the caller wants to create the database files, the optional third argument specifies the file permissions. The db_open function opens the index file and the data file, initializing the index file, if necessary. The function starts by calling _db_alloc to allocate and initialize a DB structure. |
94 db->nhash = NHASH_DEF;/* hash table size */
95 db->hashoff = HASH_OFF; /* offset in index file of hash table */
96 strcpy(db->name, pathname);
97 strcat(db->name, ".idx");
98 if (oflag & O_CREAT) {
99 va_list ap;
100 va_start(ap, oflag);
101 mode = va_arg(ap, int);
102 va_end(ap);
103 /*
104 * Open index file and data file.
105 */
106 db->idxfd = open(db->name, oflag, mode);
107 strcpy(db->name + len, ".dat");
108 db->datfd = open(db->name, oflag, mode);
109 } else {
110 /*
111 * Open index file and data file.
112 */
113 db->idxfd = open(db->name, oflag);
114 strcpy(db->name + len, ".dat");
115 db->datfd = open(db->name, oflag);
116 }
117 if (db->idxfd < 0 || db->datfd < 0) {
118 _db_free(db);
119 return(NULL);
120 }
[94 97] | We continue to initialize the DB structure. The pathname passed in by the caller specifies the prefix of the database filenames. We append the suffix .idx to create the name for the database index file. | [98 108] | If the caller wants to create the database files, we use the variable argument functions from <stdarg.h> to find the optional third argument. Then we use open to create and open the index file and data file. Note that the filename of the data file starts with the same prefix as the index file but has .dat as a suffix instead. | [109 116] | If the caller doesn't specify the O_CREAT flag, then we're opening existing database files. In this case, we simply call open with two arguments. | [117 120] | If we hit an error opening or creating either database file, we call _db_free to clean up the DB structure and then return NULL to the caller. If one open succeeded and one failed, _db_free will take care of closing the open file descriptor, as we shall see shortly. |
121 if ((oflag & (O_CREAT | O_TRUNC)) == (O_CREAT | O_TRUNC)) {
122 /*
123 * If the database was created, we have to initialize
124 * it. Write lock the entire file so that we can stat
125 * it, check its size, and initialize it, atomically.
126 */
127 if (writew_lock(db->idxfd, 0, SEEK_SET, 0) < 0)
128 err_dump("db_open: writew_lock error");
129 if (fstat(db->idxfd, &statbuff) < 0)
130 err_sys("db_open: fstat error");
131 if (statbuff.st_size == 0) {
132 /*
133 * We have to build a list of (NHASH_DEF + 1) chain
134 * ptrs with a value of 0. The +1 is for the free
135 * list pointer that precedes the hash table.
136 */
137 sprintf(asciiptr, "%*d", PTR_SZ, 0);
[121 130] | We encounter locking if the database is being created. Consider two processes trying to create the same database at about the same time. Assume that the first process calls fstat and is blocked by the kernel after fstat returns. The second process calls db_open, finds that the length of the index file is 0, and initializes the free list and hash chain. The second process then writes one record to the database. At this point, the second process is blocked, and the first process continues executing right after the call to fstat. The first process finds the size of the index file to be 0 (since fstat was called before the second process initialized the index file), so the first process initializes the free list and hash chain, wiping out the record that the second process stored in the database. The way to prevent this is to use locking. We use the macros readw_lock, writew_lock, and un_lock from Section 14.3. | [131 137] | If the size of the index file is 0, we have just created it, so we need to initialize the free list and hash chain pointers it contains. Note that we use the format string %*d to convert a database pointer from an integer to an ASCII string. (We'll use this type of format again in _db_writeidx and _db_writeptr.) This format tells sprintf to take the PTR_SZ argument and use it as the minimum field width for the next argument, which is 0 in this instance (here we are initializing the pointers to 0, since we are creating a new database). This has the effect of forcing the string created to be at least PTR_SZ characters (padded on the left with spaces). In _db_writeidx and _db_writeptr, we will pass a pointer value instead of zero, but we will first verify that the pointer value isn't greater than PTR_MAX, to guarantee that every pointer string we write to the database occupies exactly PTR_SZ (6) characters. |
138 hash[0] = 0;
139 for (i = 0; i < NHASH_DEF + 1; i++)
140 strcat(hash, asciiptr);
141 strcat(hash, "\n");
142 i = strlen(hash);
143 if (write(db->idxfd, hash, i) != i)
144 err_dump("db_open: index file init write error");
145 }
146 if (un_lock(db->idxfd, 0, SEEK_SET, 0) < 0)
147 err_dump("db_open: un_lock error");
148 }
149 db_rewind(db);
150 return(db);
151 }
152 /*
153 * Allocate & initialize a DB structure and its buffers.
154 */
155 static DB *
156 _db_alloc(int namelen)
157 {
158 DB *db;
159 /*
160 * Use calloc, to initialize the structure to zero.
161 */
162 if ((db = calloc(1, sizeof(DB))) == NULL)
163 err_dump("_db_alloc: calloc error for DB");
164 db->idxfd = db->datfd = -1; /* descriptors */
165 /*
166 * Allocate room for the name.
167 * +5 for ".idx" or ".dat" plus null at end.
168 */
169 if ((db->name = malloc(namelen + 5)) == NULL)
170 err_dump("_db_alloc: malloc error for name");
[138 151] | We continue to initialize the newly created database. We build the hash table and write it to the index file. Then we unlock the index file, reset the database file pointers, and return a pointer to the DB structure as the opaque handle for the caller to use with the other database functions. | [152 164] | The _db_alloc function is called by db_open to allocate storage for the DB structure, an index buffer, and a data buffer. We use calloc to allocate memory to hold the DB structure and ensure that it is initialized to all zeros. Since this has the side effect of setting the database file descriptors to zero, we need to reset them to 1 to indicate that they are not yet valid. | [165 170] | We allocate space to hold the name of the database file. We use this buffer to create both filenames by changing the suffix to refer to either the index file or the data file, as we saw in db_open. |
171 /*
172 * Allocate an index buffer and a data buffer.
173 * +2 for newline and null at end.
174 */
175 if ((db->idxbuf = malloc(IDXLEN_MAX + 2)) == NULL)
176 err_dump("_db_alloc: malloc error for index buffer");
177 if ((db->datbuf = malloc(DATLEN_MAX + 2)) == NULL)
178 err_dump("_db_alloc: malloc error for data buffer");
179 return(db);
180 }
181 /*
182 * Relinquish access to the database.
183 */
184 void
185 db_close(DBHANDLE h)
186 {
187 _db_free((DB *)h); /* closes fds, free buffers & struct */
188 }
189 /*
190 * Free up a DB structure, and all the malloc'ed buffers it
191 * may point to. Also close the file descriptors if still open.
192 */
193 static void
194 _db_free(DB *db)
195 {
196 if (db->idxfd >= 0)
197 close(db->idxfd);
198 if (db->datfd >= 0)
199 close(db->datfd);
[171 180] | We allocate space for buffers for the index and data files. The buffer sizes are defined in apue_db.h. An enhancement to the database library would be to allow these buffers to expand as required. We could keep track of the size of these two buffers and call realloc whenever we find we need a bigger buffer. Finally, we return a pointer to the DB structure that we allocated. | [181 188] | The db_close function is a wrapper that casts a database handle to a DB structure pointer, passing it to _db_free to release any resources and free the DB structure. | [189 199] | The _db_free function is called by db_open if an error occurs while opening the index file or data file and is also called by db_close when an application is done using the database. If the file descriptor for the database index file is valid, we close it. The same is done with the file descriptor for the data file. (Recall that when we allocate a new DB structure in _db_alloc, we initialize each file descriptor to 1. If we are unable to open one of the database files, the corresponding file descriptor will still be set to 1, and we will avoid trying to close it.) |
200 if (db->idxbuf != NULL)
201 free(db->idxbuf);
202 if (db->datbuf != NULL)
203 free(db->datbuf);
204 if (db->name != NULL)
205 free(db->name);
206 free(db);
207 }
208 /*
209 * Fetch a record. Return a pointer to the null-terminated data.
210 */
211 char *
212 db_fetch(DBHANDLE h, const char *key)
213 {
214 DB *db = h;
215 char *ptr;
216 if (_db_find_and_lock(db, key, 0) < 0) {
217 ptr = NULL; /* error, record not found */
218 db->cnt_fetcherr++;
219 } else {
220 ptr = _db_readdat(db); /* return pointer to data */
221 db->cnt_fetchok++;
222 }
223 /*
224 * Unlock the hash chain that _db_find_and_lock locked.
225 */
226 if (un_lock(db->idxfd, db->chainoff, SEEK_SET, 1) < 0)
227 err_dump("db_fetch: un_lock error");
228 return(ptr);
229 }
[200 207] | Next, we free any dynamically-allocated buffers. We can safely pass a null pointer to free, so we don't need to check the value of each buffer pointer beforehand, but we do so anyway because we consider it better style to free only those objects that we allocated. (Not all deallocator functions are as forgiving as free.) Finally, we free the memory backing the DB structure. | [208 218] | The db_fetch function is used to read a record given its key. We first try to find the record by calling _db_find_and_lock. If the record can't be found, we set the return value (ptr) to NULL and increment the count of unsuccessful record searches. Because _db_find_and_lock returns with the database index file locked, we can't return until we unlock it. | [219 229] | If the record is found, we call _db_readdat to read the corresponding data record and increment the count of the successful record searches. Before returning, we unlock the index file by calling un_lock. Then we return a pointer to the record found (or NULL if the record wasn't found). |
230 /*
231 * Find the specified record. Called by db_delete, db_fetch,
232 * and db_store. Returns with the hash chain locked.
233 */
234 static int
235 _db_find_and_lock(DB *db, const char *key, int writelock)
236 {
237 off_t offset, nextoffset;
238 /*
239 * Calculate the hash value for this key, then calculate the
240 * byte offset of corresponding chain ptr in hash table.
241 * This is where our search starts. First we calculate the
242 * offset in the hash table for this key.
243 */
244 db->chainoff = (_db_hash(db, key) * PTR_SZ) + db->hashoff;
245 db->ptroff = db->chainoff;
246 /*
247 * We lock the hash chain here. The caller must unlock it
248 * when done. Note we lock and unlock only the first byte.
249 */
250 if (writelock) {
251 if (writew_lock(db->idxfd, db->chainoff, SEEK_SET, 1) < 0)
252 err_dump("_db_find_and_lock: writew_lock error");
253 } else {
254 if (readw_lock(db->idxfd, db->chainoff, SEEK_SET, 1) < 0)
255 err_dump("_db_find_and_lock: readw_lock error");
256 }
257 /*
258 * Get the offset in the index file of first record
259 * on the hash chain (can be 0).
260 */
261 offset = _db_readptr(db, db->ptroff);
[230 237] | The _db_find_and_lock function is used internally by the library to find a record given its key. We set the writelock parameter to a nonzero value if we want to acquire a write lock on the index file while we search for the record. If we set writelock to zero, we read-lock the index file while we search it. | [238 256] | We prepare to traverse a hash chain in _db_find_and_lock. We convert the key into a hash value, which we use to calculate the starting address of the hash chain in the file (chainoff). We wait for the lock to be granted before going through the hash chain. Note that we lock only the first byte in the start of the hash chain. This increases concurrency by allowing multiple processes to search different hash chains at the same time. | [257 261] | We call _db_readptr to read the first pointer in the hash chain. If this returns zero, the hash chain is empty. |
262 while (offset != 0) {
263 nextoffset = _db_readidx(db, offset);
264 if (strcmp(db->idxbuf, key) == 0)
265 break; /* found a match */
266 db->ptroff = offset; /* offset of this (unequal) record */
267 offset = nextoffset; /* next one to compare */
268 }
269 /*
270 * offset == 0 on error (record not found).
271 */
272 return(offset == 0 ? -1 : 0);
273 }
274 /*
275 * Calculate the hash value for a key.
276 */
277 static DBHASH
278 _db_hash(DB *db, const char *key)
279 {
280 DBHASH hval = 0;
281 char c;
282 int i;
283 for (i = 1; (c = *key++) != 0; i++)
284 hval += c * i; /* ascii char times its 1-based index */
285 return(hval % db->nhash);
286 }
[262 268] | In the while loop, we go through each index record on the hash chain, comparing keys. We call _db_readidx to read each index record. It populates the idxbuf field with the key of the current record. If _db_readidx returns zero, we've reached the last entry in the chain. | [269 273] | If offset is zero after the loop, we've reached the end of a hash chain without finding a matching key, so we return 1. Otherwise, we found a match (and exited the loop with the break statement), so we return success (0). In this case, the ptroff field contains the address of the previous index record, datoff contains the address of the data record, and datlen contains the size of the data record. As we make our way through the hash chain, we save the previous index record that points to the current index record. We'll use this when we delete a record, since we have to modify the chain pointer of the previous record to delete the current record. | [274 286] | _db_hash calculates the hash value for a given key. It multiplies each ASCII character times its 1-based index and divides the result by the number of hash table entries. The remainder from the division is the hash value for this key. Recall that the number of hash table entries is 137, which is a prime number. According to Knuth [1998], prime hashes generally provide good distribution characteristics. |
287 /*
288 * Read a chain ptr field from anywhere in the index file:
289 * the free list pointer, a hash table chain ptr, or an
290 * index record chain ptr.
291 */
292 static off_t
293 _db_readptr(DB *db, off_t offset)
294 {
295 char asciiptr[PTR_SZ + 1];
296 if (lseek(db->idxfd, offset, SEEK_SET) == -1)
297 err_dump("_db_readptr: lseek error to ptr field");
298 if (read(db->idxfd, asciiptr, PTR_SZ) != PTR_SZ)
299 err_dump("_db_readptr: read error of ptr field");
300 asciiptr[PTR_SZ] = 0; /* null terminate */
301 return(atol(asciiptr));
302 }
303 /*
304 * Read the next index record. We start at the specified offset
305 * in the index file. We read the index record into db->idxbuf
306 * and replace the separators with null bytes. If all is OK we
307 * set db->datoff and db->datlen to the offset and length of the
308 * corresponding data record in the data file.
309 */
310 static off_t
311 _db_readidx(DB *db, off_t offset)
312 {
313 ssize_t i;
314 char *ptr1, *ptr2;
315 char asciiptr[PTR_SZ + 1], asciilen[IDXLEN_SZ + 1];
316 struct iovec iov[2];
[287 302] | _db_readptr reads any one of three different chain pointers: (a) the pointer at the beginning of the index file that points to the first index record on the free list, (b) the pointers in the hash table that point to the first index record on each hash chain, and (c) the pointers that are stored at the beginning of each index record (whether the index record is part of a hash chain or on the free list). We convert the pointer from ASCII to a long integer before returning it. No locking is done by this function; that is up to the caller. | [303 316] | The _db_readidx function is used to read the record at the specified offset from the index file. On success, the function will return the offset of the next record in the list. In this case, the function will populate several fields in the DB structure: idxoff contains the offset of the current record in the index file, ptrval contains the offset of the next index entry in the list, idxlen contains the length of the current index record, idxbuf contains the actual index record, datoff contains the offset of the record in the data file, and datlen contains the length of the data record. |
317 /*
318 * Position index file and record the offset. db_nextrec
319 * calls us with offset==0, meaning read from current offset.
320 * We still need to call lseek to record the current offset.
321 */
322 if ((db->idxoff = lseek(db->idxfd, offset,
323 offset == 0 ? SEEK_CUR : SEEK_SET)) == -1)
324 err_dump("_db_readidx: lseek error");
325 /*
326 * Read the ascii chain ptr and the ascii length at
327 * the front of the index record. This tells us the
328 * remaining size of the index record.
329 */
330 iov[0].iov_base = asciiptr;
331 iov[0].iov_len = PTR_SZ;
332 iov[1].iov_base = asciilen;
333 iov[1].iov_len = IDXLEN_SZ;
334 if ((i = readv(db->idxfd, &iov[0], 2)) != PTR_SZ + IDXLEN_SZ) {
335 if (i == 0 && offset == 0)
336 return(-1); /* EOF for db_nextrec */
337 err_dump("_db_readidx: readv error of index record");
338 }
339 /*
340 * This is our return value; always >= 0.
341 */
342 asciiptr[PTR_SZ] = 0; /* null terminate */
343 db->ptrval = atol(asciiptr); /* offset of next key in chain */
344 asciilen[IDXLEN_SZ] = 0; /* null terminate */
345 if ((db->idxlen = atoi(asciilen)) < IDXLEN_MIN ||
346 db->idxlen > IDXLEN_MAX)
347 err_dump("_db_readidx: invalid length");
[317 324] | We start by seeking to the index file offset provided by the caller. We record the offset in the DB structure, so even if the caller wants to read the record at the current file offset (by setting offset to 0), we still need to call lseek to determine the current offset. Since an index record will never be stored at offset 0 in the index file, we can safely overload the value of 0 to mean "read from the current offset." | [325 338] | We call readv to read the two fixed-length fields at the beginning of the index record: the chain pointer to the next index record and the size of the variable-length index record that follows. | [339 347] | We convert the offset of the next record to an integer and store it in the ptrval field (this will be used as the return value for this function). Then we convert the length of the index record into an integer and save it in the idxlen field. |
348 /*
349 * Now read the actual index record. We read it into the key
350 * buffer that we malloced when we opened the database.
351 */
352 if ((i = read(db->idxfd, db->idxbuf, db->idxlen)) != db->idxlen)
353 err_dump("_db_readidx: read error of index record");
354 if (db->idxbuf[db->idxlen-1] != NEWLINE) /* sanity check */
355 err_dump("_db_readidx: missing newline");
356 db->idxbuf[db->idxlen-1] = 0; /* replace newline with null */
357 /*
358 * Find the separators in the index record.
359 */
360 if ((ptr1 = strchr(db->idxbuf, SEP)) == NULL)
361 err_dump("_db_readidx: missing first separator");
362 *ptr1++ = 0; /* replace SEP with null */
363 if ((ptr2 = strchr(ptr1, SEP)) == NULL)
364 err_dump("_db_readidx: missing second separator");
365 *ptr2++ = 0; /* replace SEP with null */
366 if (strchr(ptr2, SEP) != NULL)
367 err_dump("_db_readidx: too many separators");
368 /*
369 * Get the starting offset and length of the data record.
370 */
371 if ((db->datoff = atol(ptr1)) < 0)
372 err_dump("_db_readidx: starting offset < 0");
373 if ((db->datlen = atol(ptr2)) <= 0 || db->datlen > DATLEN_MAX)
374 err_dump("_db_readidx: invalid length");
375 return(db->ptrval); /* return offset of next key in chain */
376 }
[348 356] | We read the variable-length index record into the idxbuf field in the DB structure. The record should be terminated with a newline, which we replace with a null byte. If the index file is corrupt, we terminate and drop core by calling err_dump. | [357 367] | We separate the index record into three fields: the key, the offset of the corresponding data record, and the length of the data record. The strchr function finds the first occurrence of the specified character in the given string. Here we look for the character that separates fields in the record (SEP, which we define to be a colon). | [368 376] | We convert the data record offset and length into integers and store them in the DB structure. Then we return the offset of the next record in the hash chain. Note that we do not read the data record. That is left to the caller. In db_fetch, for example, we don't read the data record until _db_find_and_lock has read the index record that matches the key that we're looking for. |
377 /*
378 * Read the current data record into the data buffer.
379 * Return a pointer to the null-terminated data buffer.
380 */
381 static char *
382 _db_readdat(DB *db)
383 {
384 if (lseek(db->datfd, db->datoff, SEEK_SET) == -1)
385 err_dump("_db_readdat: lseek error");
386 if (read(db->datfd, db->datbuf, db->datlen) != db->datlen)
387 err_dump("_db_readdat: read error");
388 if (db->datbuf[db->datlen-1] != NEWLINE) /* sanity check */
389 err_dump("_db_readdat: missing newline");
390 db->datbuf[db->datlen-1] = 0; /* replace newline with null */
391 return(db->datbuf); /* return pointer to data record */
392 }
393 /*
394 * Delete the specified record.
395 */
396 int
397 db_delete(DBHANDLE h, const char *key)
398 {
399 DB *db = h;
400 int rc = 0; /* assume record will be found */
401 if (_db_find_and_lock(db, key, 1) == 0) {
402 _db_dodelete(db);
403 db->cnt_delok++;
404 } else {
405 rc = -1; /* not found */
406 db->cnt_delerr++;
407 }
408 if (un_lock(db->idxfd, db->chainoff, SEEK_SET, 1) < 0)
409 err_dump("db_delete: un_lock error");
410 return(rc);
411 }
[377 392] | The _db_readdat function populates the datbuf field in the DB structure with the contents of the data record, expecting that the datoff and datlen fields have been properly initialized already. | [393 411] | The db_delete function is used to delete a record given its key. We use _db_find_and_lock to determine whether the record exists in the database. If it does, we call _db_dodelete to do the work needed to delete the record. The third argument to _db_find_and_lock controls whether the chain is read-locked or write-locked. Here we are requesting a write lock, since we will potentially change the list. Since _db_find_and_lock returns with the lock still held, we need to unlock it, regardless of whether the record was found. |
412 /*
413 * Delete the current record specified by the DB structure.
414 * This function is called by db_delete and db_store, after
415 * the record has been located by _db_find_and_lock.
416 */
417 static void
418 _db_dodelete(DB *db)
419 {
420 int i;
421 char *ptr;
422 off_t freeptr, saveptr;
423 /*
424 * Set data buffer and key to all blanks.
425 */
426 for (ptr = db->datbuf, i = 0; i < db->datlen - 1; i++)
427 *ptr++ = SPACE;
428 *ptr = 0; /* null terminate for _db_writedat */
429 ptr = db->idxbuf;
430 while (*ptr)
431 *ptr++ = SPACE;
432 /*
433 * We have to lock the free list.
434 */
435 if (writew_lock(db->idxfd, FREE_OFF, SEEK_SET, 1) < 0)
436 err_dump("_db_dodelete: writew_lock error");
437 /*
438 * Write the data record with all blanks.
439 */
440 _db_writedat(db, db->datbuf, db->datoff, SEEK_SET);
[412 431] | The _db_dodelete function does all the work necessary to delete a record from the database. (This function is also called by db_store.) Most of the function just updates two linked lists: the free list and the hash chain for this key. When a record is deleted, we set its key and data record to blanks. This fact is used by db_nextrec, which we'll examine later in this section. | [432 440] | We call writew_lock to write-lock the free list. This is to prevent two processes that are deleting records at the same time, on two different hash chains, from interfering with each other. Since we'll add the deleted record to the free list, which changes the free-list pointer, only one process at a time can be doing this. | | We write the all-blank data record by calling _db_writedat. Note that there is no need for _db_writedat to lock the data file in this case. Since db_delete has write-locked the hash chain for this record, we know that no other process is reading or writing this particular data record. |
441 /*
442 * Read the free list pointer. Its value becomes the
443 * chain ptr field of the deleted index record. This means
444 * the deleted record becomes the head of the free list.
445 */
446 freeptr = _db_readptr(db, FREE_OFF);
447 /*
448 * Save the contents of index record chain ptr,
449 * before it's rewritten by _db_writeidx.
450 */
451 saveptr = db->ptrval;
452 /*
453 * Rewrite the index record. This also rewrites the length
454 * of the index record, the data offset, and the data length,
455 * none of which has changed, but that's OK.
456 */
457 _db_writeidx(db, db->idxbuf, db->idxoff, SEEK_SET, freeptr);
458 /*
459 * Write the new free list pointer.
460 */
461 _db_writeptr(db, FREE_OFF, db->idxoff);
462 /*
463 * Rewrite the chain ptr that pointed to this record being
464 * deleted. Recall that _db_find_and_lock sets db->ptroff to
465 * point to this chain ptr. We set this chain ptr to the
466 * contents of the deleted record's chain ptr, saveptr.
467 */
468 _db_writeptr(db, db->ptroff, saveptr);
469 if (un_lock(db->idxfd, FREE_OFF, SEEK_SET, 1) < 0)
470 err_dump("_db_dodelete: un_lock error");
471 }
[441 461] | We read the free-list pointer and then update the index record so that its next record pointer is set to the first record on the free list. (If the free list was empty, this new chain pointer is 0.) We have already cleared the key. Then we update the free-list pointer with the offset of the index record we are deleting. This means that the free list is handled on a last-in, first-out basis; that is, deleted records are added to the front of the free list (although we remove entries from the free list on a first-fit basis). | | We don't have a separate free list for each file. When we add a deleted index record to the free list, the index record still points to the deleted data record. There are better ways to do this, in exchange for added complexity. | [462 471] | We update the previous record in the hash chain to point to the record after the one we are deleting, thus removing the deleted record from the hash chain. Finally, we unlock the free list. |
472 /*
473 * Write a data record. Called by _db_dodelete (to write
474 * the record with blanks) and db_store.
475 */
476 static void
477 _db_writedat(DB *db, const char *data, off_t offset, int whence)
478 {
479 struct iovec iov[2];
480 static char newline = NEWLINE;
481 /*
482 * If we're appending, we have to lock before doing the lseek
483 * and write to make the two an atomic operation. If we're
484 * overwriting an existing record, we don't have to lock.
485 */
486 if (whence == SEEK_END) /* we're appending, lock entire file */
487 if (writew_lock(db->datfd, 0, SEEK_SET, 0) < 0)
488 err_dump("_db_writedat: writew_lock error");
489 if ((db->datoff = lseek(db->datfd, offset, whence)) == -1)
490 err_dump("_db_writedat: lseek error");
491 db->datlen = strlen(data) + 1; /* datlen includes newline */
492 iov[0].iov_base = (char *) data;
493 iov[0].iov_len = db->datlen - 1;
494 iov[1].iov_base = &newline;
495 iov[1].iov_len = 1;
496 if (writev(db->datfd, &iov[0], 2) != db->datlen)
497 err_dump("_db_writedat: writev error of data record");
498 if (whence == SEEK_END)
499 if (un_lock(db->datfd, 0, SEEK_SET, 0) < 0)
500 err_dump("_db_writedat: un_lock error");
501 }
[472 491] | We call _db_writedat to write a data record. When we delete a record, we use _db_writedat to overwrite the record with blanks; _db_writedat doesn't need to lock the data file, because db_delete has write-locked the hash chain for this record. Thus, no other process could be reading or writing this particular data record. When we cover db_store later in this section, we'll encounter the case in which _db_writedat is appending to the data file and has to lock it. | | We seek to the location where we want to write the data record. The amount to write is the record size plus 1 byte for the terminating newline we add. | [492 501] | We set up the iovec array and call writev to write the data record and newline. We can't assume that the caller's buffer has room at the end for us to append the newline, so we write the newline from a separate buffer. If we are appending a record to the file, we release the lock we acquired earlier. |
502 /*
503 * Write an index record. _db_writedat is called before
504 * this function to set the datoff and datlen fields in the
505 * DB structure, which we need to write the index record.
506 */
507 static void
508 _db_writeidx(DB *db, const char *key,
509 off_t offset, int whence, off_t ptrval)
510 {
511 struct iovec iov[2];
512 char asciiptrlen[PTR_SZ + IDXLEN_SZ +1];
513 int len;
514 char *fmt;
515 if ((db->ptrval = ptrval) < 0 || ptrval > PTR_MAX)
516 err_quit("_db_writeidx: invalid ptr: %d", ptrval);
517 if (sizeof(off_t) == sizeof(long long))
518 fmt = "%s%c%lld%c%d\n";
519 else
520 fmt = "%s%c%ld%c%d\n";
521 sprintf(db->idxbuf, fmt, key, SEP, db->datoff, SEP, db->datlen);
522 if ((len = strlen(db->idxbuf)) < IDXLEN_MIN || len > IDXLEN_MAX)
523 err_dump("_db_writeidx: invalid length");
524 sprintf(asciiptrlen, "%*ld%*d", PTR_SZ, ptrval, IDXLEN_SZ, len);
525 /*
526 * If we're appending, we have to lock before doing the lseek
527 * and write to make the two an atomic operation. If we're
528 * overwriting an existing record, we don't have to lock.
529 */
530 if (whence == SEEK_END) /* we're appending */
531 if (writew_lock(db->idxfd, ((db->nhash+1)*PTR_SZ)+1,
532 SEEK_SET, 0) < 0)
533 err_dump("_db_writeidx: writew_lock error");
[502 524] | The _db_writeidx function is called to write an index record. After validating the next pointer in the chain, we create the index record and store the second half of it in idxbuf. We need the size of this portion of the index record to create the first half of the index record, which we store in the local variable asciiptrlen. | | Note that we select the format string passed to sprintf based on the size of the off_t data type. Even a 32-bit system can provide 64-bit file offsets, so we can't make any assumptions about the size of the off_t data type. | [525 533] | As with _db_writedat, this function deals with locking only when a new index record is being appended to the index file. When _db_dodelete calls this function, we're rewriting an existing index record. In this case, the caller has write-locked the hash chain, so no additional locking is required. |
534 /*
535 * Position the index file and record the offset.
536 */
537 if ((db->idxoff = lseek(db->idxfd, offset, whence)) == -1)
538 err_dump("_db_writeidx: lseek error");
539 iov[0].iov_base = asciiptrlen;
540 iov[0].iov_len = PTR_SZ + IDXLEN_SZ;
541 iov[1].iov_base = db->idxbuf;
542 iov[1].iov_len = len;
543 if (writev(db->idxfd, &iov[0], 2) != PTR_SZ + IDXLEN_SZ + len)
544 err_dump("_db_writeidx: writev error of index record");
545 if (whence == SEEK_END)
546 if (un_lock(db->idxfd, ((db->nhash+1)*PTR_SZ)+1,
547 SEEK_SET, 0) < 0)
548 err_dump("_db_writeidx: un_lock error");
549 }
550 /*
551 * Write a chain ptr field somewhere in the index file:
552 * the free list, the hash table, or in an index record.
553 */
554 static void
555 _db_writeptr(DB *db, off_t offset, off_t ptrval)
556 {
557 char asciiptr[PTR_SZ + 1];
558 if (ptrval < 0 || ptrval > PTR_MAX)
559 err_quit("_db_writeptr: invalid ptr: %d", ptrval);
560 sprintf(asciiptr, "%*ld", PTR_SZ, ptrval);
561 if (lseek(db->idxfd, offset, SEEK_SET) == -1)
562 err_dump("_db_writeptr: lseek error to ptr field");
563 if (write(db->idxfd, asciiptr, PTR_SZ) != PTR_SZ)
564 err_dump("_db_writeptr: write error of ptr field");
565 }
[534 549] | We seek to the location where we want to write the index record and save this offset in the idxoff field of the DB structure. Since we built the index record in two separate buffers, we use writev to store it in the index file. If we were appending to the file, we release the lock we acquired before seeking. This makes the seek and the write an atomic operation from the perspective of concurrently running processes appending new records to the same database. | [550 565] | _db_writeptr is used to write a chain pointer to the index file. We validate that the chain pointer is within bounds, then convert it to an ASCII string. We seek to the specified offset in the index file and write the pointer. |
566 /*
567 * Store a record in the database. Return 0 if OK, 1 if record
568 * exists and DB_INSERT specified, -1 on error.
569 */
570 int
571 db_store(DBHANDLE h, const char *key, const char *data, int flag)
572 {
573 DB *db = h;
574 int rc, keylen, datlen;
575 off_t ptrval;
576 if (flag != DB_INSERT && flag != DB_REPLACE &&
577 flag != DB_STORE) {
578 errno = EINVAL;
579 return(-1);
580 }
581 keylen = strlen(key);
582 datlen = strlen(data) + 1; /* +1 for newline at end */
583 if (datlen < DATLEN_MIN || datlen > DATLEN_MAX)
584 err_dump("db_store: invalid data length");
585 /*
586 * _db_find_and_lock calculates which hash table this new record
587 * goes into (db->chainoff), regardless of whether it already
588 * exists or not. The following calls to _db_writeptr change the
589 * hash table entry for this chain to point to the new record.
590 * The new record is added to the front of the hash chain.
591 */
592 if (_db_find_and_lock(db, key, 1) < 0) { /* record not found */
593 if (flag == DB_REPLACE) {
594 rc = -1;
595 db->cnt_storerr++;
596 errno = ENOENT; /* error, record does not exist */
597 goto doreturn;
598 }
[566 584] | We use db_store to add a record to the database. We first validate the flag value we are passed. Then we make sure that the length of the data record is valid. If it isn't, we drop core and exit. This is OK for an example, but if we were building a production-quality library, we'd return an error status instead, which would give the application a chance to recover. | [585 598] | We call _db_find_and_lock to see if the record already exists. It is OK if the record doesn't exist and either DB_INSERT or DB_STORE is specified, or if the record already exists and either DB_REPLACE or DB_STORE is specified. If we're replacing an existing record, that implies that the keys are identical but that the data records probably differ. Note that the final argument to _db_find_and_lock specifies that the hash chain must be write-locked, as we will probably be modifying this hash chain. |
599 /*
600 * _db_find_and_lock locked the hash chain for us; read
601 * the chain ptr to the first index record on hash chain.
602 */
603 ptrval = _db_readptr(db, db->chainoff);
604 if (_db_findfree(db, keylen, datlen) < 0) {
605 /*
606 * Can't find an empty record big enough. Append the
607 * new record to the ends of the index and data files.
608 */
609 _db_writedat(db, data, 0, SEEK_END);
610 _db_writeidx(db, key, 0, SEEK_END, ptrval);
611 /*
612 * db->idxoff was set by _db_writeidx. The new
613 * record goes to the front of the hash chain.
614 */
615 _db_writeptr(db, db->chainoff, db->idxoff);
616 db->cnt_stor1++;
617 } else {
618 /*
619 * Reuse an empty record. _db_findfree removed it from
620 * the free list and set both db->datoff and db->idxoff.
621 * Reused record goes to the front of the hash chain.
622 */
623 _db_writedat(db, data, db->datoff, SEEK_SET);
624 _db_writeidx(db, key, db->idxoff, SEEK_SET, ptrval);
625 _db_writeptr(db, db->chainoff, db->idxoff);
626 db->cnt_stor2++;
627 }
[599 603] | After we call _db_find_and_lock, the code divides into four cases. In the first two, no record was found, so we are adding a new record. We read the offset of the first entry on the hash list. | [604 616] | Case 1: we call _db_findfree to search the free list for a deleted record with the same size key and same size data. If no such record is found, we have to append the new record to the ends of the index and data files. We call _db_writedat to write the data part, _db_writeidx to write the index part, and _db_writeptr to place the new record on the front of the hash chain. We increment a count (cnt_stor1) of the number of times we executed this case to allow us to characterize the behavior of the database.
| [617 627] | Case 2: _db_findfree found an empty record with the correct sizes and removed it from the free list (we'll see the implementation of _db_findfree shortly). We write the data and index portions of the new record and add the record to the front of the hash chain as we did in case 1. The cnt_stor2 field counts how many times we've executed this case. |
628 } else { /* record found */
629 if (flag == DB_INSERT) {
630 rc = 1; /* error, record already in db */
631 db->cnt_storerr++;
632 goto doreturn;
633 }
634 /*
635 * We are replacing an existing record. We know the new
636 * key equals the existing key, but we need to check if
637 * the data records are the same size.
638 */
639 if (datlen != db->datlen) {
640 _db_dodelete(db); /* delete the existing record */
641 /*
642 * Reread the chain ptr in the hash table
643 * (it may change with the deletion).
644 */
645 ptrval = _db_readptr(db, db->chainoff);
646 /*
647 * Append new index and data records to end of files.
648 */
649 _db_writedat(db, data, 0, SEEK_END);
650 _db_writeidx(db, key, 0, SEEK_END, ptrval);
651 /*
652 * New record goes to the front of the hash chain.
653 */
654 _db_writeptr(db, db->chainoff, db->idxoff);
655 db->cnt_stor3++;
656 } else {
[628 633] | Now we reach the two cases in which a record with the same key already exists in the database. If the caller isn't replacing the record, we set the return code to indicate that a record exists, increment the count of the number of store errors, and jump to the end of the function, where we handle the common return logic. | [634 656] | Case 3: an existing record is being replaced, and the length of the new data record differs from the length of the existing one. We call _db_dodelete to delete the existing record. Recall that this places the deleted record at the head of the free list. Then we append the new record to the ends of the data and index files by calling _db_writedat and _db_writeidx. (There are other ways to handle this case. We could try to find a deleted record that has the correct data size.) The new record is added to the front of the hash chain by calling _db_writeptr. The cnt_stor3 counter in the DB structure records the number of times we've executed this case. |
657 /*
658 * Same size data, just replace data record.
659 */
660 _db_writedat(db, data, db->datoff, SEEK_SET);
661 db->cnt_stor4++;
662 }
663 }
664 rc = 0; /* OK */
665 doreturn: /* unlock hash chain locked by _db_find_and_lock */
666 if (un_lock(db->idxfd, db->chainoff, SEEK_SET, 1) < 0)
667 err_dump("db_store: un_lock error");
668 return(rc);
669 }
670 /*
671 * Try to find a free index record and accompanying data record
672 * of the correct sizes. We're only called by db_store.
673 */
674 static int
675 _db_findfree(DB *db, int keylen, int datlen)
676 {
677 int rc;
678 off_t offset, nextoffset, saveoffset;
679 /*
680 * Lock the free list.
681 */
682 if (writew_lock(db->idxfd, FREE_OFF, SEEK_SET, 1) < 0)
683 err_dump("_db_findfree: writew_lock error");
684 /*
685 * Read the free list pointer.
686 */
687 saveoffset = FREE_OFF;
688 offset = _db_readptr(db, saveoffset);
[657 663] | Case 4: An existing record is being replaced, and the length of the new data record equals the length of the existing data record. This is the easiest case; we simply rewrite the data record and increment the counter (cnt_stor4) for this case. | [664 669] | In the normal case, we set the return code to indicate success and fall through to the common return logic. We unlock the hash chain that was locked as a result of calling _db_find_and_lock and return to the caller. | [670 688] | The _db_findfree function tries to find a free index record and associated data record of the specified sizes. We need to write-lock the free list to avoid interfering with any other processes using the free list. After locking the free list, we get the pointer address at the head of the list. |
689 while (offset != 0) {
690 nextoffset = _db_readidx(db, offset);
691 if (strlen(db->idxbuf) == keylen && db->datlen == datlen)
692 break; /* found a match */
693 saveoffset = offset;
694 offset = nextoffset;
695 }
696 if (offset == 0) {
697 rc = -1; /* no match found */
698 } else {
699 /*
700 * Found a free record with matching sizes.
701 * The index record was read in by _db_readidx above,
702 * which sets db->ptrval. Also, saveoffset points to
703 * the chain ptr that pointed to this empty record on
704 * the free list. We set this chain ptr to db->ptrval,
705 * which removes the empty record from the free list.
706 */
707 _db_writeptr(db, saveoffset, db->ptrval);
708 rc = 0;
709 /*
710 * Notice also that _db_readidx set both db->idxoff
711 * and db->datoff. This is used by the caller, db_store,
712 * to write the new index record and data record.
713 */
714 }
715 /*
716 * Unlock the free list.
717 */
718 if (un_lock(db->idxfd, FREE_OFF, SEEK_SET, 1) < 0)
719 err_dump("_db_findfree: un_lock error");
720 return(rc);
721 }
[689 695] | The while loop in _db_findfree goes through the free list, looking for a record with matching key and data sizes. In this simple implementation, we reuse a deleted record only if the key length and data length equal the lengths for the new record being inserted. There are a variety of better ways to reuse this deleted space, in exchange for added complexity. | [696 714] | If we can't find an available record of the requested key and data sizes, we set the return code to indicate failure. Otherwise, we write the previous record's chain pointer to point to the next chain pointer value of the record we have found. This removes the record from the free list. | [715 721] | Once we are done with the free list, we release the write lock. Then we return the status to the caller. |
722 /*
723 * Rewind the index file for db_nextrec.
724 * Automatically called by db_open.
725 * Must be called before first db_nextrec.
726 */
727 void
728 db_rewind(DBHANDLE h)
729 {
730 DB *db = h;
731 off_t offset;
732 offset = (db->nhash + 1) * PTR_SZ; /* +1 for free list ptr */
733 /*
734 * We're just setting the file offset for this process
735 * to the start of the index records; no need to lock.
736 * +1 below for newline at end of hash table.
737 */
738 if ((db->idxoff = lseek(db->idxfd, offset+1, SEEK_SET)) == -1)
739 err_dump("db_rewind: lseek error");
740 }
741 /*
742 * Return the next sequential record.
743 * We just step our way through the index file, ignoring deleted
744 * records. db_rewind must be called before this function is
745 * called the first time.
746 */
747 char *
748 db_nextrec(DBHANDLE h, char *key)
749 {
750 DB *db = h;
751 char c;
752 char *ptr;
[722 740] | The db_rewind function is used to reset the database to "the beginning;" we set the file offset for the index file to point to the first record in the index file (immediately following the hash table). (Recall the structure of the index file from Figure 20.2.) | [741 752] | The db_nextrec function returns the next record in the database. The return value is a pointer to the data buffer. If the caller provides a non-null value for the key parameter, the corresponding key is copied to this address. The caller is responsible for allocating a buffer big enough to store the key. A buffer whose size is IDXLEN_MAX bytes is large enough to hold any key. | | Records are returned sequentially, in the order that they happen to be stored in the database file. Thus, the records are not sorted by key value. Also, because we do not follow the hash chains, we can come across records that have been deleted, but we will not return these to the caller. |
753 /*
754 * We read lock the free list so that we don't read
755 * a record in the middle of its being deleted.
756 */
757 if (readw_lock(db->idxfd, FREE_OFF, SEEK_SET, 1) < 0)
758 err_dump("db_nextrec: readw_lock error");
759 do {
760 /*
761 * Read next sequential index record.
762 */
763 if (_db_readidx(db, 0) < 0) {
764 ptr = NULL; /* end of index file, EOF */
765 goto doreturn;
766 }
767 /*
768 * Check if key is all blank (empty record).
769 */
770 ptr = db->idxbuf;
771 while ((c = *ptr++) != 0 && c == SPACE)
772 ; /* skip until null byte or nonblank */
773 } while (c == 0); /* loop until a nonblank key is found */
774 if (key != NULL)
775 strcpy(key, db->idxbuf); /* return key */
776 ptr = _db_readdat(db); /* return pointer to data buffer */
777 db->cnt_nextrec++;
778 doreturn:
779 if (un_lock(db->idxfd, FREE_OFF, SEEK_SET, 1) < 0)
780 err_dump("db_nextrec: un_lock error");
781 return(ptr);
782 }
[753 758] | We first need to read-lock the free list so that no other processes can remove a record while we are reading it. | [759 773] | We call _db_readidx to read the next record. We pass in an offset of 0 to tell _db_readidx to continue reading from the current offset. Since we are reading the index file sequentially, we can come across records that have been deleted. We want to return only valid records, so we skip any record whose key is all spaces (recall that _db_dodelete clears a key by setting it to all spaces). | [774 782] | When we find a valid key, we copy it to the caller's buffer if one was supplied. Then we read the data record and set the return value to point to the internal buffer containing the data record. We increment a statistics counter, unlock the free list, and return the pointer to the data record. |
The normal use of db_rewind and db_nextrec is in a loop of the form
db_rewind(db);
while ((ptr = db_nextrec(db, key)) != NULL) {
/* process record */
}
As we warned earlier, there is no order to the returned records; they are not in key order.
If the database is being modified while db_nextrec is called from a loop, the records returned by db_nextrec are simply a snapshot of a changing database at some point in time. db_nextrec always returns a "correct" record when it is called; that is, it won't return a record that was deleted. But it is possible for a record returned by db_nextrec to be deleted immediately after db_nextrec returns. Similarly, if a deleted record is reused right after db_nextrec skips over the deleted record, we won't see that new record unless we rewind the database and go through it again. If it's important to obtain an accurate "frozen" snapshot of the database using db_nextrec, there must be no insertions or deletions going on at the same time.
Look at the locking used by db_nextrec. We're not going through any hash chain, and we can't determine the hash chain that a record belongs on. Therefore, it is possible for an index record to be in the process of being deleted when db_nextrec is reading the record. To prevent this, db_nextrec read-locks the free list, thereby avoiding any interaction with _db_dodelete and _db_findfree.
Before we conclude our study of the db.c source file, we need to describe the locking when new index records or data records are appended to the end of the file. In cases 1 and 3, db_store calls both _db_writeidx and _db_writedat with a third argument of 0 and a fourth argument of SEEK_END. This fourth argument is the flag to these two functions, indicating that the new record is being appended to the file. The technique used by _db_writeidx is to write-lock the index file from the end of the hash chain to the end of file. This won't interfere with any other readers or writers of the database (since they will lock a hash chain), but it does prevent other callers of db_store from trying to append at the same time. The technique used by _db_writedat is to write-lock the entire data file. Again, this won't interfere with other readers or writers of the database (since they don't even try to lock the data file), but it does prevent other callers of db_store from trying to append to the data file at the same time. (See Exercise 20.3.)
|