diff options
author | dan miller | 2007-10-20 02:49:29 +0000 |
---|---|---|
committer | dan miller | 2007-10-20 02:49:29 +0000 |
commit | e36d23a85ebff914d74bb541558c2b6082b78edb (patch) | |
tree | 54b58fdf162e78af64055282a6035c8d2443389d /libraries/sqlite/unix/sqlite-3.5.1/www/fileformat.tcl | |
parent | * Fixed an issue whereby avatar chat distances were being calculated against ... (diff) | |
download | opensim-SC-e36d23a85ebff914d74bb541558c2b6082b78edb.zip opensim-SC-e36d23a85ebff914d74bb541558c2b6082b78edb.tar.gz opensim-SC-e36d23a85ebff914d74bb541558c2b6082b78edb.tar.bz2 opensim-SC-e36d23a85ebff914d74bb541558c2b6082b78edb.tar.xz |
sqlite source (unix build) added to libraries
Diffstat (limited to 'libraries/sqlite/unix/sqlite-3.5.1/www/fileformat.tcl')
-rw-r--r-- | libraries/sqlite/unix/sqlite-3.5.1/www/fileformat.tcl | 785 |
1 files changed, 785 insertions, 0 deletions
diff --git a/libraries/sqlite/unix/sqlite-3.5.1/www/fileformat.tcl b/libraries/sqlite/unix/sqlite-3.5.1/www/fileformat.tcl new file mode 100644 index 0000000..d143f08 --- /dev/null +++ b/libraries/sqlite/unix/sqlite-3.5.1/www/fileformat.tcl | |||
@@ -0,0 +1,785 @@ | |||
1 | # | ||
2 | # Run this script to generated a fileformat.html output file | ||
3 | # | ||
4 | set rcsid {$Id: fileformat.tcl,v 1.13 2004/10/10 17:24:55 drh Exp $} | ||
5 | source common.tcl | ||
6 | header {SQLite Database File Format (Version 2)} | ||
7 | puts { | ||
8 | <h2>SQLite 2.X Database File Format</h2> | ||
9 | |||
10 | <p> | ||
11 | This document describes the disk file format for SQLite versions 2.1 | ||
12 | through 2.8. SQLite version 3.0 and following uses a very different | ||
13 | format which is described separately. | ||
14 | </p> | ||
15 | |||
16 | <h3>1.0 Layers</h3> | ||
17 | |||
18 | <p> | ||
19 | SQLite is implemented in layers. | ||
20 | (See the <a href="arch.html">architecture description</a>.) | ||
21 | The format of database files is determined by three different | ||
22 | layers in the architecture. | ||
23 | </p> | ||
24 | |||
25 | <ul> | ||
26 | <li>The <b>schema</b> layer implemented by the VDBE.</li> | ||
27 | <li>The <b>b-tree</b> layer implemented by btree.c</li> | ||
28 | <li>The <b>pager</b> layer implemented by pager.c</li> | ||
29 | </ul> | ||
30 | |||
31 | <p> | ||
32 | We will describe each layer beginning with the bottom (pager) | ||
33 | layer and working upwards. | ||
34 | </p> | ||
35 | |||
36 | <h3>2.0 The Pager Layer</h3> | ||
37 | |||
38 | <p> | ||
39 | An SQLite database consists of | ||
40 | "pages" of data. Each page is 1024 bytes in size. | ||
41 | Pages are numbered beginning with 1. | ||
42 | A page number of 0 is used to indicate "no such page" in the | ||
43 | B-Tree and Schema layers. | ||
44 | </p> | ||
45 | |||
46 | <p> | ||
47 | The pager layer is responsible for implementing transactions | ||
48 | with atomic commit and rollback. It does this using a separate | ||
49 | journal file. Whenever a new transaction is started, a journal | ||
50 | file is created that records the original state of the database. | ||
51 | If the program terminates before completing the transaction, the next | ||
52 | process to open the database can use the journal file to restore | ||
53 | the database to its original state. | ||
54 | </p> | ||
55 | |||
56 | <p> | ||
57 | The journal file is located in the same directory as the database | ||
58 | file and has the same name as the database file but with the | ||
59 | characters "<tt>-journal</tt>" appended. | ||
60 | </p> | ||
61 | |||
62 | <p> | ||
63 | The pager layer does not impose any content restrictions on the | ||
64 | main database file. As far as the pager is concerned, each page | ||
65 | contains 1024 bytes of arbitrary data. But there is structure to | ||
66 | the journal file. | ||
67 | </p> | ||
68 | |||
69 | <p> | ||
70 | A journal file begins with 8 bytes as follows: | ||
71 | 0xd9, 0xd5, 0x05, 0xf9, 0x20, 0xa1, 0x63, and 0xd6. | ||
72 | Processes that are attempting to rollback a journal use these 8 bytes | ||
73 | as a sanity check to make sure the file they think is a journal really | ||
74 | is a valid journal. Prior version of SQLite used different journal | ||
75 | file formats. The magic numbers for these prior formats are different | ||
76 | so that if a new version of the library attempts to rollback a journal | ||
77 | created by an earlier version, it can detect that the journal uses | ||
78 | an obsolete format and make the necessary adjustments. This article | ||
79 | describes only the newest journal format - supported as of version | ||
80 | 2.8.0. | ||
81 | </p> | ||
82 | |||
83 | <p> | ||
84 | Following the 8 byte prefix is a three 4-byte integers that tell us | ||
85 | the number of pages that have been committed to the journal, | ||
86 | a magic number used for | ||
87 | sanity checking each page, and the | ||
88 | original size of the main database file before the transaction was | ||
89 | started. The number of committed pages is used to limit how far | ||
90 | into the journal to read. The use of the checksum magic number is | ||
91 | described below. | ||
92 | The original size of the database is used to restore the database | ||
93 | file back to its original size. | ||
94 | The size is expressed in pages (1024 bytes per page). | ||
95 | </p> | ||
96 | |||
97 | <p> | ||
98 | All three integers in the journal header and all other multi-byte | ||
99 | numbers used in the journal file are big-endian. | ||
100 | That means that the most significant byte | ||
101 | occurs first. That way, a journal file that is | ||
102 | originally created on one machine can be rolled back by another | ||
103 | machine that uses a different byte order. So, for example, a | ||
104 | transaction that failed to complete on your big-endian SparcStation | ||
105 | can still be rolled back on your little-endian Linux box. | ||
106 | </p> | ||
107 | |||
108 | <p> | ||
109 | After the 8-byte prefix and the three 4-byte integers, the | ||
110 | journal file consists of zero or more page records. Each page | ||
111 | record is a 4-byte (big-endian) page number followed by 1024 bytes | ||
112 | of data and a 4-byte checksum. | ||
113 | The data is the original content of the database page | ||
114 | before the transaction was started. So to roll back the transaction, | ||
115 | the data is simply written into the corresponding page of the | ||
116 | main database file. Pages can appear in the journal in any order, | ||
117 | but they are guaranteed to appear only once. All page numbers will be | ||
118 | between 1 and the maximum specified by the page size integer that | ||
119 | appeared at the beginning of the journal. | ||
120 | </p> | ||
121 | |||
122 | <p> | ||
123 | The so-called checksum at the end of each record is not really a | ||
124 | checksum - it is the sum of the page number and the magic number which | ||
125 | was the second integer in the journal header. The purpose of this | ||
126 | value is to try to detect journal corruption that might have occurred | ||
127 | because of a power loss or OS crash that occurred which the journal | ||
128 | file was being written to disk. It could have been the case that the | ||
129 | meta-data for the journal file, specifically the size of the file, had | ||
130 | been written to the disk so that when the machine reboots it appears that | ||
131 | file is large enough to hold the current record. But even though the | ||
132 | file size has changed, the data for the file might not have made it to | ||
133 | the disk surface at the time of the OS crash or power loss. This means | ||
134 | that after reboot, the end of the journal file will contain quasi-random | ||
135 | garbage data. The checksum is an attempt to detect such corruption. If | ||
136 | the checksum does not match, that page of the journal is not rolled back. | ||
137 | </p> | ||
138 | |||
139 | <p> | ||
140 | Here is a summary of the journal file format: | ||
141 | </p> | ||
142 | |||
143 | <ul> | ||
144 | <li>8 byte prefix: 0xd9, 0xd5, 0x05, 0xf9, 0x20, 0xa1, 0x63, 0xd6</li> | ||
145 | <li>4 byte number of records in journal</li> | ||
146 | <li>4 byte magic number used for page checksums</li> | ||
147 | <li>4 byte initial database page count</li> | ||
148 | <li>Zero or more instances of the following: | ||
149 | <ul> | ||
150 | <li>4 byte page number</li> | ||
151 | <li>1024 bytes of original data for the page</li> | ||
152 | <li>4 byte checksum</li> | ||
153 | </ul> | ||
154 | </li> | ||
155 | </ul> | ||
156 | |||
157 | <h3>3.0 The B-Tree Layer</h3> | ||
158 | |||
159 | <p> | ||
160 | The B-Tree layer builds on top of the pager layer to implement | ||
161 | one or more separate b-trees all in the same disk file. The | ||
162 | algorithms used are taken from Knuth's <i>The Art Of Computer | ||
163 | Programming.</i></p> | ||
164 | |||
165 | <p> | ||
166 | Page 1 of a database contains a header string used for sanity | ||
167 | checking, a few 32-bit words of configuration data, and a pointer | ||
168 | to the beginning of a list of unused pages in the database. | ||
169 | All other pages in the | ||
170 | database are either pages of a b-tree, overflow pages, or unused | ||
171 | pages on the freelist. | ||
172 | </p> | ||
173 | |||
174 | <p> | ||
175 | Each b-tree page contains zero or more database entries. | ||
176 | Each entry has an unique key of one or more bytes and data of | ||
177 | zero or more bytes. | ||
178 | Both the key and data are arbitrary byte sequences. The combination | ||
179 | of key and data are collectively known as "payload". The current | ||
180 | implementation limits the amount of payload in a single entry to | ||
181 | 1048576 bytes. This limit can be raised to 16777216 by adjusting | ||
182 | a single #define in the source code and recompiling. But most entries | ||
183 | contain less than a hundred bytes of payload so a megabyte limit seems | ||
184 | more than enough. | ||
185 | </p> | ||
186 | |||
187 | <p> | ||
188 | Up to 238 bytes of payload for an entry can be held directly on | ||
189 | a b-tree page. Any additional payload is contained on a linked list | ||
190 | of overflow pages. This limit on the amount of payload held directly | ||
191 | on b-tree pages guarantees that each b-tree page can hold at least | ||
192 | 4 entries. In practice, most entries are smaller than 238 bytes and | ||
193 | thus most pages can hold more than 4 entries. | ||
194 | </p> | ||
195 | |||
196 | <p> | ||
197 | A single database file can hold any number of separate, independent b-trees. | ||
198 | Each b-tree is identified by its root page, which never changes. | ||
199 | Child pages of the b-tree may change as entries are added and removed | ||
200 | and pages split and combine. But the root page always stays the same. | ||
201 | The b-tree itself does not record which pages are root pages and which | ||
202 | are not. That information is handled entirely at the schema layer. | ||
203 | </p> | ||
204 | |||
205 | <h4>3.1 B-Tree Page 1 Details</h4> | ||
206 | |||
207 | <p> | ||
208 | Page 1 begins with the following 48-byte string: | ||
209 | </p> | ||
210 | |||
211 | <blockquote><pre> | ||
212 | ** This file contains an SQLite 2.1 database ** | ||
213 | </pre></blockquote> | ||
214 | |||
215 | <p> | ||
216 | If you count the number of characters in the string above, you will | ||
217 | see that there are only 47. A '\000' terminator byte is added to | ||
218 | bring the total to 48. | ||
219 | </p> | ||
220 | |||
221 | <p> | ||
222 | A frequent question is why the string says version 2.1 when (as | ||
223 | of this writing) we are up to version 2.7.0 of SQLite and any | ||
224 | change to the second digit of the version is suppose to represent | ||
225 | a database format change. The answer to this is that the B-tree | ||
226 | layer has not changed any since version 2.1. There have been | ||
227 | database format changes since version 2.1 but those changes have | ||
228 | all been in the schema layer. Because the format of the b-tree | ||
229 | layer is unchanged since version 2.1.0, the header string still | ||
230 | says version 2.1. | ||
231 | </p> | ||
232 | |||
233 | <p> | ||
234 | After the format string is a 4-byte integer used to determine the | ||
235 | byte-order of the database. The integer has a value of | ||
236 | 0xdae37528. If this number is expressed as 0xda, 0xe3, 0x75, 0x28, then | ||
237 | the database is in a big-endian format and all 16 and 32-bit integers | ||
238 | elsewhere in the b-tree layer are also big-endian. If the number is | ||
239 | expressed as 0x28, 0x75, 0xe3, and 0xda, then the database is in a | ||
240 | little-endian format and all other multi-byte numbers in the b-tree | ||
241 | layer are also little-endian. | ||
242 | Prior to version 2.6.3, the SQLite engine was only able to read databases | ||
243 | that used the same byte order as the processor they were running on. | ||
244 | But beginning with 2.6.3, SQLite can read or write databases in any | ||
245 | byte order. | ||
246 | </p> | ||
247 | |||
248 | <p> | ||
249 | After the byte-order code are six 4-byte integers. Each integer is in the | ||
250 | byte order determined by the byte-order code. The first integer is the | ||
251 | page number for the first page of the freelist. If there are no unused | ||
252 | pages in the database, then this integer is 0. The second integer is | ||
253 | the number of unused pages in the database. The last 4 integers are | ||
254 | not used by the b-tree layer. These are the so-called "meta" values that | ||
255 | are passed up to the schema layer | ||
256 | and used there for configuration and format version information. | ||
257 | All bytes of page 1 past beyond the meta-value integers are unused | ||
258 | and are initialized to zero. | ||
259 | </p> | ||
260 | |||
261 | <p> | ||
262 | Here is a summary of the information contained on page 1 in the b-tree layer: | ||
263 | </p> | ||
264 | |||
265 | <ul> | ||
266 | <li>48 byte header string</li> | ||
267 | <li>4 byte integer used to determine the byte-order</li> | ||
268 | <li>4 byte integer which is the first page of the freelist</li> | ||
269 | <li>4 byte integer which is the number of pages on the freelist</li> | ||
270 | <li>36 bytes of meta-data arranged as nine 4-byte integers</li> | ||
271 | <li>928 bytes of unused space</li> | ||
272 | </ul> | ||
273 | |||
274 | <h4>3.2 Structure Of A Single B-Tree Page</h4> | ||
275 | |||
276 | <p> | ||
277 | Conceptually, a b-tree page contains N database entries and N+1 pointers | ||
278 | to other b-tree pages. | ||
279 | </p> | ||
280 | |||
281 | <blockquote> | ||
282 | <table border=1 cellspacing=0 cellpadding=5> | ||
283 | <tr> | ||
284 | <td align="center">Ptr<br>0</td> | ||
285 | <td align="center">Entry<br>0</td> | ||
286 | <td align="center">Ptr<br>1</td> | ||
287 | <td align="center">Entry<br>1</td> | ||
288 | <td align="center"><b>...</b></td> | ||
289 | <td align="center">Ptr<br>N-1</td> | ||
290 | <td align="center">Entry<br>N-1</td> | ||
291 | <td align="center">Ptr<br>N</td> | ||
292 | </tr> | ||
293 | </table> | ||
294 | </blockquote> | ||
295 | |||
296 | <p> | ||
297 | The entries are arranged in increasing order. That is, the key to | ||
298 | Entry 0 is less than the key to Entry 1, and the key to Entry 1 is | ||
299 | less than the key of Entry 2, and so forth. The pointers point to | ||
300 | pages containing additional entries that have keys in between the | ||
301 | entries on either side. So Ptr 0 points to another b-tree page that | ||
302 | contains entries that all have keys less than Key 0, and Ptr 1 | ||
303 | points to a b-tree pages where all entries have keys greater than Key 0 | ||
304 | but less than Key 1, and so forth. | ||
305 | </p> | ||
306 | |||
307 | <p> | ||
308 | Each b-tree page in SQLite consists of a header, zero or more "cells" | ||
309 | each holding a single entry and pointer, and zero or more "free blocks" | ||
310 | that represent unused space on the page. | ||
311 | </p> | ||
312 | |||
313 | <p> | ||
314 | The header on a b-tree page is the first 8 bytes of the page. | ||
315 | The header contains the value | ||
316 | of the right-most pointer (Ptr N) and the byte offset into the page | ||
317 | of the first cell and the first free block. The pointer is a 32-bit | ||
318 | value and the offsets are each 16-bit values. We have: | ||
319 | </p> | ||
320 | |||
321 | <blockquote> | ||
322 | <table border=1 cellspacing=0 cellpadding=5> | ||
323 | <tr> | ||
324 | <td align="center" width=30>0</td> | ||
325 | <td align="center" width=30>1</td> | ||
326 | <td align="center" width=30>2</td> | ||
327 | <td align="center" width=30>3</td> | ||
328 | <td align="center" width=30>4</td> | ||
329 | <td align="center" width=30>5</td> | ||
330 | <td align="center" width=30>6</td> | ||
331 | <td align="center" width=30>7</td> | ||
332 | </tr> | ||
333 | <tr> | ||
334 | <td align="center" colspan=4>Ptr N</td> | ||
335 | <td align="center" colspan=2>Cell 0</td> | ||
336 | <td align="center" colspan=2>Freeblock 0</td> | ||
337 | </tr> | ||
338 | </table> | ||
339 | </blockquote> | ||
340 | |||
341 | <p> | ||
342 | The 1016 bytes of a b-tree page that come after the header contain | ||
343 | cells and freeblocks. All 1016 bytes are covered by either a cell | ||
344 | or a freeblock. | ||
345 | </p> | ||
346 | |||
347 | <p> | ||
348 | The cells are connected in a linked list. Cell 0 contains Ptr 0 and | ||
349 | Entry 0. Bytes 4 and 5 of the header point to Cell 0. Cell 0 then | ||
350 | points to Cell 1 which contains Ptr 1 and Entry 1. And so forth. | ||
351 | Cells vary in size. Every cell has a 12-byte header and at least 4 | ||
352 | bytes of payload space. Space is allocated to payload in increments | ||
353 | of 4 bytes. Thus the minimum size of a cell is 16 bytes and up to | ||
354 | 63 cells can fit on a single page. The size of a cell is always a multiple | ||
355 | of 4 bytes. | ||
356 | A cell can have up to 238 bytes of payload space. If | ||
357 | the payload is more than 238 bytes, then an additional 4 byte page | ||
358 | number is appended to the cell which is the page number of the first | ||
359 | overflow page containing the additional payload. The maximum size | ||
360 | of a cell is thus 254 bytes, meaning that a least 4 cells can fit into | ||
361 | the 1016 bytes of space available on a b-tree page. | ||
362 | An average cell is usually around 52 to 100 bytes in size with about | ||
363 | 10 or 20 cells to a page. | ||
364 | </p> | ||
365 | |||
366 | <p> | ||
367 | The data layout of a cell looks like this: | ||
368 | </p> | ||
369 | |||
370 | <blockquote> | ||
371 | <table border=1 cellspacing=0 cellpadding=5> | ||
372 | <tr> | ||
373 | <td align="center" width=20>0</td> | ||
374 | <td align="center" width=20>1</td> | ||
375 | <td align="center" width=20>2</td> | ||
376 | <td align="center" width=20>3</td> | ||
377 | <td align="center" width=20>4</td> | ||
378 | <td align="center" width=20>5</td> | ||
379 | <td align="center" width=20>6</td> | ||
380 | <td align="center" width=20>7</td> | ||
381 | <td align="center" width=20>8</td> | ||
382 | <td align="center" width=20>9</td> | ||
383 | <td align="center" width=20>10</td> | ||
384 | <td align="center" width=20>11</td> | ||
385 | <td align="center" width=100>12 ... 249</td> | ||
386 | <td align="center" width=20>250</td> | ||
387 | <td align="center" width=20>251</td> | ||
388 | <td align="center" width=20>252</td> | ||
389 | <td align="center" width=20>253</td> | ||
390 | </tr> | ||
391 | <tr> | ||
392 | <td align="center" colspan=4>Ptr</td> | ||
393 | <td align="center" colspan=2>Keysize<br>(low)</td> | ||
394 | <td align="center" colspan=2>Next</td> | ||
395 | <td align="center" colspan=1>Ksz<br>(hi)</td> | ||
396 | <td align="center" colspan=1>Dsz<br>(hi)</td> | ||
397 | <td align="center" colspan=2>Datasize<br>(low)</td> | ||
398 | <td align="center" colspan=1>Payload</td> | ||
399 | <td align="center" colspan=4>Overflow<br>Pointer</td> | ||
400 | </tr> | ||
401 | </table> | ||
402 | </blockquote> | ||
403 | |||
404 | <p> | ||
405 | The first four bytes are the pointer. The size of the key is a 24-bit | ||
406 | where the upper 8 bits are taken from byte 8 and the lower 16 bits are | ||
407 | taken from bytes 4 and 5 (or bytes 5 and 4 on little-endian machines.) | ||
408 | The size of the data is another 24-bit value where the upper 8 bits | ||
409 | are taken from byte 9 and the lower 16 bits are taken from bytes 10 and | ||
410 | 11 or 11 and 10, depending on the byte order. Bytes 6 and 7 are the | ||
411 | offset to the next cell in the linked list of all cells on the current | ||
412 | page. This offset is 0 for the last cell on the page. | ||
413 | </p> | ||
414 | |||
415 | <p> | ||
416 | The payload itself can be any number of bytes between 1 and 1048576. | ||
417 | But space to hold the payload is allocated in 4-byte chunks up to | ||
418 | 238 bytes. If the entry contains more than 238 bytes of payload, then | ||
419 | additional payload data is stored on a linked list of overflow pages. | ||
420 | A 4 byte page number is appended to the cell that contains the first | ||
421 | page of this linked list. | ||
422 | </p> | ||
423 | |||
424 | <p> | ||
425 | Each overflow page begins with a 4-byte value which is the | ||
426 | page number of the next overflow page in the list. This value is | ||
427 | 0 for the last page in the list. The remaining | ||
428 | 1020 bytes of the overflow page are available for storing payload. | ||
429 | Note that a full page is allocated regardless of the number of overflow | ||
430 | bytes stored. Thus, if the total payload for an entry is 239 bytes, | ||
431 | the first 238 are stored in the cell and the overflow page stores just | ||
432 | one byte. | ||
433 | </p> | ||
434 | |||
435 | <p> | ||
436 | The structure of an overflow page looks like this: | ||
437 | </p> | ||
438 | |||
439 | <blockquote> | ||
440 | <table border=1 cellspacing=0 cellpadding=5> | ||
441 | <tr> | ||
442 | <td align="center" width=20>0</td> | ||
443 | <td align="center" width=20>1</td> | ||
444 | <td align="center" width=20>2</td> | ||
445 | <td align="center" width=20>3</td> | ||
446 | <td align="center" width=200>4 ... 1023</td> | ||
447 | </tr> | ||
448 | <tr> | ||
449 | <td align="center" colspan=4>Next Page</td> | ||
450 | <td align="center" colspan=1>Overflow Data</td> | ||
451 | </tr> | ||
452 | </table> | ||
453 | </blockquote> | ||
454 | |||
455 | <p> | ||
456 | All space on a b-tree page which is not used by the header or by cells | ||
457 | is filled by freeblocks. Freeblocks, like cells, are variable in size. | ||
458 | The size of a freeblock is at least 4 bytes and is always a multiple of | ||
459 | 4 bytes. | ||
460 | The first 4 bytes contain a header and the remaining bytes | ||
461 | are unused. The structure of the freeblock is as follows: | ||
462 | </p> | ||
463 | |||
464 | <blockquote> | ||
465 | <table border=1 cellspacing=0 cellpadding=5> | ||
466 | <tr> | ||
467 | <td align="center" width=20>0</td> | ||
468 | <td align="center" width=20>1</td> | ||
469 | <td align="center" width=20>2</td> | ||
470 | <td align="center" width=20>3</td> | ||
471 | <td align="center" width=200>4 ... 1015</td> | ||
472 | </tr> | ||
473 | <tr> | ||
474 | <td align="center" colspan=2>Size</td> | ||
475 | <td align="center" colspan=2>Next</td> | ||
476 | <td align="center" colspan=1>Unused</td> | ||
477 | </tr> | ||
478 | </table> | ||
479 | </blockquote> | ||
480 | |||
481 | <p> | ||
482 | Freeblocks are stored in a linked list in increasing order. That is | ||
483 | to say, the first freeblock occurs at a lower index into the page than | ||
484 | the second free block, and so forth. The first 2 bytes of the header | ||
485 | are an integer which is the total number of bytes in the freeblock. | ||
486 | The second 2 bytes are the index into the page of the next freeblock | ||
487 | in the list. The last freeblock has a Next value of 0. | ||
488 | </p> | ||
489 | |||
490 | <p> | ||
491 | When a new b-tree is created in a database, the root page of the b-tree | ||
492 | consist of a header and a single 1016 byte freeblock. As entries are | ||
493 | added, space is carved off of that freeblock and used to make cells. | ||
494 | When b-tree entries are deleted, the space used by their cells is converted | ||
495 | into freeblocks. Adjacent freeblocks are merged, but the page can still | ||
496 | become fragmented. The b-tree code will occasionally try to defragment | ||
497 | the page by moving all cells to the beginning and constructing a single | ||
498 | freeblock at the end to take up all remaining space. | ||
499 | </p> | ||
500 | |||
501 | <h4>3.3 The B-Tree Free Page List</h4> | ||
502 | |||
503 | <p> | ||
504 | When information is removed from an SQLite database such that one or | ||
505 | more pages are no longer needed, those pages are added to a list of | ||
506 | free pages so that they can be reused later when new information is | ||
507 | added. This subsection describes the structure of this freelist. | ||
508 | </p> | ||
509 | |||
510 | <p> | ||
511 | The 32-bit integer beginning at byte-offset 52 in page 1 of the database | ||
512 | contains the address of the first page in a linked list of free pages. | ||
513 | If there are no free pages available, this integer has a value of 0. | ||
514 | The 32-bit integer at byte-offset 56 in page 1 contains the number of | ||
515 | free pages on the freelist. | ||
516 | </p> | ||
517 | |||
518 | <p> | ||
519 | The freelist contains a trunk and many branches. The trunk of | ||
520 | the freelist is composed of overflow pages. That is to say, each page | ||
521 | contains a single 32-bit integer at byte offset 0 which | ||
522 | is the page number of the next page on the freelist trunk. | ||
523 | The payload area | ||
524 | of each trunk page is used to record pointers to branch pages. | ||
525 | The first 32-bit integer in the payload area of a trunk page | ||
526 | is the number of branch pages to follow (between 0 and 254) | ||
527 | and each subsequent 32-bit integer is a page number for a branch page. | ||
528 | The following diagram shows the structure of a trunk freelist page: | ||
529 | </p> | ||
530 | |||
531 | <blockquote> | ||
532 | <table border=1 cellspacing=0 cellpadding=5> | ||
533 | <tr> | ||
534 | <td align="center" width=20>0</td> | ||
535 | <td align="center" width=20>1</td> | ||
536 | <td align="center" width=20>2</td> | ||
537 | <td align="center" width=20>3</td> | ||
538 | <td align="center" width=20>4</td> | ||
539 | <td align="center" width=20>5</td> | ||
540 | <td align="center" width=20>6</td> | ||
541 | <td align="center" width=20>7</td> | ||
542 | <td align="center" width=200>8 ... 1023</td> | ||
543 | </tr> | ||
544 | <tr> | ||
545 | <td align="center" colspan=4>Next trunk page</td> | ||
546 | <td align="center" colspan=4># of branch pages</td> | ||
547 | <td align="center" colspan=1>Page numbers for branch pages</td> | ||
548 | </tr> | ||
549 | </table> | ||
550 | </blockquote> | ||
551 | |||
552 | <p> | ||
553 | It is important to note that only the pages on the trunk of the freelist | ||
554 | contain pointers to other pages. The branch pages contain no | ||
555 | data whatsoever. The fact that the branch pages are completely | ||
556 | blank allows for an important optimization in the paging layer. When | ||
557 | a branch page is removed from the freelist to be reused, it is not | ||
558 | necessary to write the original content of that page into the rollback | ||
559 | journal. The branch page contained no data to begin with, so there is | ||
560 | no need to restore the page in the event of a rollback. Similarly, | ||
561 | when a page is not longer needed and is added to the freelist as a branch | ||
562 | page, it is not necessary to write the content of that page | ||
563 | into the database file. | ||
564 | Again, the page contains no real data so it is not necessary to record the | ||
565 | content of that page. By reducing the amount of disk I/O required, | ||
566 | these two optimizations allow some database operations | ||
567 | to go four to six times faster than they would otherwise. | ||
568 | </p> | ||
569 | |||
570 | <h3>4.0 The Schema Layer</h3> | ||
571 | |||
572 | <p> | ||
573 | The schema layer implements an SQL database on top of one or more | ||
574 | b-trees and keeps track of the root page numbers for all b-trees. | ||
575 | Where the b-tree layer provides only unformatted data storage with | ||
576 | a unique key, the schema layer allows each entry to contain multiple | ||
577 | columns. The schema layer also allows indices and non-unique key values. | ||
578 | </p> | ||
579 | |||
580 | <p> | ||
581 | The schema layer implements two separate data storage abstractions: | ||
582 | tables and indices. Each table and each index uses its own b-tree | ||
583 | but they use the b-tree capabilities in different ways. For a table, | ||
584 | the b-tree key is a unique 4-byte integer and the b-tree data is the | ||
585 | content of the table row, encoded so that columns can be separately | ||
586 | extracted. For indices, the b-tree key varies in size depending on the | ||
587 | size of the fields being indexed and the b-tree data is empty. | ||
588 | </p> | ||
589 | |||
590 | <h4>4.1 SQL Table Implementation Details</h4> | ||
591 | |||
592 | <p>Each row of an SQL table is stored in a single b-tree entry. | ||
593 | The b-tree key is a 4-byte big-endian integer that is the ROWID | ||
594 | or INTEGER PRIMARY KEY for that table row. | ||
595 | The key is stored in a big-endian format so | ||
596 | that keys will sort in numerical order using memcmp() function.</p> | ||
597 | |||
598 | <p>The content of a table row is stored in the data portion of | ||
599 | the corresponding b-tree table. The content is encoded to allow | ||
600 | individual columns of the row to be extracted as necessary. Assuming | ||
601 | that the table has N columns, the content is encoded as N+1 offsets | ||
602 | followed by N column values, as follows: | ||
603 | </p> | ||
604 | |||
605 | <blockquote> | ||
606 | <table border=1 cellspacing=0 cellpadding=5> | ||
607 | <tr> | ||
608 | <td>offset 0</td> | ||
609 | <td>offset 1</td> | ||
610 | <td><b>...</b></td> | ||
611 | <td>offset N-1</td> | ||
612 | <td>offset N</td> | ||
613 | <td>value 0</td> | ||
614 | <td>value 1</td> | ||
615 | <td><b>...</b></td> | ||
616 | <td>value N-1</td> | ||
617 | </tr> | ||
618 | </table> | ||
619 | </blockquote> | ||
620 | |||
621 | <p> | ||
622 | The offsets can be either 8-bit, 16-bit, or 24-bit integers depending | ||
623 | on how much data is to be stored. If the total size of the content | ||
624 | is less than 256 bytes then 8-bit offsets are used. If the total size | ||
625 | of the b-tree data is less than 65536 then 16-bit offsets are used. | ||
626 | 24-bit offsets are used otherwise. Offsets are always little-endian, | ||
627 | which means that the least significant byte occurs first. | ||
628 | </p> | ||
629 | |||
630 | <p> | ||
631 | Data is stored as a nul-terminated string. Any empty string consists | ||
632 | of just the nul terminator. A NULL value is an empty string with no | ||
633 | nul-terminator. Thus a NULL value occupies zero bytes and an empty string | ||
634 | occupies 1 byte. | ||
635 | </p> | ||
636 | |||
637 | <p> | ||
638 | Column values are stored in the order that they appear in the CREATE TABLE | ||
639 | statement. The offsets at the beginning of the record contain the | ||
640 | byte index of the corresponding column value. Thus, Offset 0 contains | ||
641 | the byte index for Value 0, Offset 1 contains the byte offset | ||
642 | of Value 1, and so forth. The number of bytes in a column value can | ||
643 | always be found by subtracting offsets. This allows NULLs to be | ||
644 | recovered from the record unambiguously. | ||
645 | </p> | ||
646 | |||
647 | <p> | ||
648 | Most columns are stored in the b-tree data as described above. | ||
649 | The one exception is column that has type INTEGER PRIMARY KEY. | ||
650 | INTEGER PRIMARY KEY columns correspond to the 4-byte b-tree key. | ||
651 | When an SQL statement attempts to read the INTEGER PRIMARY KEY, | ||
652 | the 4-byte b-tree key is read rather than information out of the | ||
653 | b-tree data. But there is still an Offset associated with the | ||
654 | INTEGER PRIMARY KEY, just like any other column. But the Value | ||
655 | associated with that offset is always NULL. | ||
656 | </p> | ||
657 | |||
658 | <h4>4.2 SQL Index Implementation Details</h4> | ||
659 | |||
660 | <p> | ||
661 | SQL indices are implement using a b-tree in which the key is used | ||
662 | but the data is always empty. The purpose of an index is to map | ||
663 | one or more column values into the ROWID for the table entry that | ||
664 | contains those column values. | ||
665 | </p> | ||
666 | |||
667 | <p> | ||
668 | Each b-tree in an index consists of one or more column values followed | ||
669 | by a 4-byte ROWID. Each column value is nul-terminated (even NULL values) | ||
670 | and begins with a single character that indicates the datatype for that | ||
671 | column value. Only three datatypes are supported: NULL, Number, and | ||
672 | Text. NULL values are encoded as the character 'a' followed by the | ||
673 | nul terminator. Numbers are encoded as the character 'b' followed by | ||
674 | a string that has been crafted so that sorting the string using memcmp() | ||
675 | will sort the corresponding numbers in numerical order. (See the | ||
676 | sqliteRealToSortable() function in util.c of the SQLite sources for | ||
677 | additional information on this encoding.) Numbers are also nul-terminated. | ||
678 | Text values consists of the character 'c' followed by a copy of the | ||
679 | text string and a nul-terminator. These encoding rules result in | ||
680 | NULLs being sorted first, followed by numerical values in numerical | ||
681 | order, followed by text values in lexicographical order. | ||
682 | </p> | ||
683 | |||
684 | <h4>4.4 SQL Schema Storage And Root B-Tree Page Numbers</h4> | ||
685 | |||
686 | <p> | ||
687 | The database schema is stored in the database in a special tabled named | ||
688 | "sqlite_master" and which always has a root b-tree page number of 2. | ||
689 | This table contains the original CREATE TABLE, | ||
690 | CREATE INDEX, CREATE VIEW, and CREATE TRIGGER statements used to define | ||
691 | the database to begin with. Whenever an SQLite database is opened, | ||
692 | the sqlite_master table is scanned from beginning to end and | ||
693 | all the original CREATE statements are played back through the parser | ||
694 | in order to reconstruct an in-memory representation of the database | ||
695 | schema for use in subsequent command parsing. For each CREATE TABLE | ||
696 | and CREATE INDEX statement, the root page number for the corresponding | ||
697 | b-tree is also recorded in the sqlite_master table so that SQLite will | ||
698 | know where to look for the appropriate b-tree. | ||
699 | </p> | ||
700 | |||
701 | <p> | ||
702 | SQLite users can query the sqlite_master table just like any other table | ||
703 | in the database. But the sqlite_master table cannot be directly written. | ||
704 | The sqlite_master table is automatically updated in response to CREATE | ||
705 | and DROP statements but it cannot be changed using INSERT, UPDATE, or | ||
706 | DELETE statements as that would risk corrupting the database. | ||
707 | </p> | ||
708 | |||
709 | <p> | ||
710 | SQLite stores temporary tables and indices in a separate | ||
711 | file from the main database file. The temporary table database file | ||
712 | is the same structure as the main database file. The schema table | ||
713 | for the temporary tables is stored on page 2 just as in the main | ||
714 | database. But the schema table for the temporary database named | ||
715 | "sqlite_temp_master" instead of "sqlite_master". Other than the | ||
716 | name change, it works exactly the same. | ||
717 | </p> | ||
718 | |||
719 | <h4>4.4 Schema Version Numbering And Other Meta-Information</h4> | ||
720 | |||
721 | <p> | ||
722 | The nine 32-bit integers that are stored beginning at byte offset | ||
723 | 60 of Page 1 in the b-tree layer are passed up into the schema layer | ||
724 | and used for versioning and configuration information. The meaning | ||
725 | of the first four integers is shown below. The other five are currently | ||
726 | unused. | ||
727 | </p> | ||
728 | |||
729 | <ol> | ||
730 | <li>The schema version number</li> | ||
731 | <li>The format version number</li> | ||
732 | <li>The recommended pager cache size</li> | ||
733 | <li>The safety level</li> | ||
734 | </ol> | ||
735 | |||
736 | <p> | ||
737 | The first meta-value, the schema version number, is used to detect when | ||
738 | the schema of the database is changed by a CREATE or DROP statement. | ||
739 | Recall that when a database is first opened the sqlite_master table is | ||
740 | scanned and an internal representation of the tables, indices, views, | ||
741 | and triggers for the database is built in memory. This internal | ||
742 | representation is used for all subsequent SQL command parsing and | ||
743 | execution. But what if another process were to change the schema | ||
744 | by adding or removing a table, index, view, or trigger? If the original | ||
745 | process were to continue using the old schema, it could potentially | ||
746 | corrupt the database by writing to a table that no longer exists. | ||
747 | To avoid this problem, the schema version number is changed whenever | ||
748 | a CREATE or DROP statement is executed. Before each command is | ||
749 | executed, the current schema version number for the database file | ||
750 | is compared against the schema version number from when the sqlite_master | ||
751 | table was last read. If those numbers are different, the internal | ||
752 | schema representation is erased and the sqlite_master table is reread | ||
753 | to reconstruct the internal schema representation. | ||
754 | (Calls to sqlite_exec() generally return SQLITE_SCHEMA when this happens.) | ||
755 | </p> | ||
756 | |||
757 | <p> | ||
758 | The second meta-value is the schema format version number. This | ||
759 | number tells what version of the schema layer should be used to | ||
760 | interpret the file. There have been changes to the schema layer | ||
761 | over time and this number is used to detect when an older database | ||
762 | file is being processed by a newer version of the library. | ||
763 | As of this writing (SQLite version 2.7.0) the current format version | ||
764 | is "4". | ||
765 | </p> | ||
766 | |||
767 | <p> | ||
768 | The third meta-value is the recommended pager cache size as set | ||
769 | by the DEFAULT_CACHE_SIZE pragma. If the value is positive it | ||
770 | means that synchronous behavior is enable (via the DEFAULT_SYNCHRONOUS | ||
771 | pragma) and if negative it means that synchronous behavior is | ||
772 | disabled. | ||
773 | </p> | ||
774 | |||
775 | <p> | ||
776 | The fourth meta-value is safety level added in version 2.8.0. | ||
777 | A value of 1 corresponds to a SYNCHRONOUS setting of OFF. In other | ||
778 | words, SQLite does not pause to wait for journal data to reach the disk | ||
779 | surface before overwriting pages of the database. A value of 2 corresponds | ||
780 | to a SYNCHRONOUS setting of NORMAL. A value of 3 corresponds to a | ||
781 | SYNCHRONOUS setting of FULL. If the value is 0, that means it has not | ||
782 | been initialized so the default synchronous setting of NORMAL is used. | ||
783 | </p> | ||
784 | } | ||
785 | footer $rcsid | ||