; ACL2 Version 7.1 -- A Computational Logic for Applicative Common Lisp ; Copyright (C) 2015, Regents of the University of Texas ; This version of ACL2 is a descendent of ACL2 Version 1.9, Copyright ; (C) 1997 Computational Logic, Inc. See the documentation topic NOTE-2-0. ; This program is free software; you can redistribute it and/or modify ; it under the terms of the LICENSE file distributed with ACL2. ; This program is distributed in the hope that it will be useful, ; but WITHOUT ANY WARRANTY; without even the implied warranty of ; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ; LICENSE for more details. ; Regarding authorship of ACL2 in general: ; Written by: Matt Kaufmann and J Strother Moore ; email: Kaufmann@cs.utexas.edu and Moore@cs.utexas.edu ; Department of Computer Science ; University of Texas at Austin ; Austin, TX 78712 U.S.A. ; serialize-raw.lisp -- a scheme for serializing ACL2 objects to disk. ; This file was initially developed and contributed by Jared Davis on behalf of ; Centaur Technology. Comments referring to "I" or "my" are from Jared. This ; file is now maintained by the ACL2 authors (see above). (in-package "ACL2") ; INTRODUCTION ; ; We now develop a serialization scheme that allows ACL2 objects to be saved to ; disk using a compact, structure-shared, binary encoding. ; ; We configure ACL2's print-object$ function so that it writes objects with our ; encoding scheme when writing certificate files. This allows large objects ; produced by make-event to be read and written efficiently. ; ; We extend the ACL2 readtable so that serialized objects can be read at any ; time, using the extended reader macros #Y[...] and #Z[...]. These macros are ; almost identical, but ; ; #Y rebuilds the object entirely with CONS and does not restore any ; fast alists, whereas ; ; #Z uses HONS for the parts of the structure that were originally normed, ; and rebuilds the hash tables for fast alists. ; ; We provide routines for reading and writing ACL2 objects as individual files, ; typically with a ".sao" extension for "Serialized ACL2 Object". For ; bootstrapping reasons, these are introduced in hons.lisp and hons-raw.lisp ; instead of here in serialize-raw.lisp. ; Essay on Bad Objects and Serialize ; ; When we decode serialized objects, must we ensure that the object returned is ; a valid ACL2 object, i.e., not something that BAD-LISP-OBJECTP would reject? ; ; Matt and Jared think the answer is "no" for the reader macros. Why? These ; macros are just extensions of certain readtables like the *acl2-readtable*, ; which are used by READ to interpret input. But there are already any number ; of other ways for READ to produce bad objects, for instance it might produce a ; floating point numbers, symbols in a foreign package, vectors, structures, ; etc. The "Remarks on *acl2-readtable*" in acl2.lisp has more discussion of ; these matters. At any rate, whenever ACL2 is using READ, it already needs to ; be defending against bad objects, so it should be okay if the serialize reader ; macros generate bad objects. ; ; However, Jared thinks that the serialize-read-fn function *is* responsible for ; ensuring that only good objects are read, because it is a "new" way for input ; to enter the system. ; ; Well, the serialized object format is considerably more restrictive than the ; Common Lisp reader, and does not provide any way to encode floats, circular ; objects, etc. Jared thinks the only bad objects that can be produced from ; serialized reading are symbols in unknown packages. So, in ; serialize-read-file we pass in a flag that makes sure we check whether ; packages are known. We think this is sufficient to justify not checking ; BAD-LISP-OBJECTP explicitly. ; Essay on the Serialize Object Format ; ; Our scheme involves encoding ACL2 objects into a fairly simple, byte-based, ; binary format. ; ; There are actually three different formats of serialized objects, named V1, ; V2 and V3. For compatibility with previously written files, we support ; reading from all three file formats. But we always write files using the ; newest, V3 format. ; ; Why these different versions? ; ; V1. We originally developed our serialization scheme as a ttag-based ; library in :dir :system, but this left us with no way to tightly ; integrate it into ACL2's certificate printing/reading routines. ; ; V2. When we moved serialization into ACL2 proper, we noticed a few things ; that we could improve, and tweaked the serialization scheme. The new ; scheme, V2, wasn't compatible with community book books/serialize, but ; we already had many files written in the old V1 format. ; ; V3. Later, we realized that it would be easy to restore fast alists within ; serialized objects, and that this would allow the fast alists defined ; in a book to still be fast after including the book. The new format, ; V3, added this feature, but again we had lots of V2 files that we ; still wanted to be able to read. ; ; Eventually we should be able to drop support for the old versions, but the ; formats are all very similar so supporting them is not that much work. Since ; all of the versions are very similar, we begin with the V1 format. A ; sequence of objects is encoded by OBJECT: first MAGIC, then LEN, then the ; indicated homogeneous sequences of objects, and then MAGIC. ; ; OBJECT ::= MAGIC ; marker for sanity checking ; LEN ; total number of objects ; NATS RATS COMPLEXES CHARS ; object data ; STRS PACKAGES CONSES ; ; MAGIC ; marker for sanity checking ; ; NATS ::= LEN NAT* ; number of nats, followed by that many nats ; RATS ::= LEN RAT* ; number of rats, followed by that many rats ; COMPLEXES ::= LEN COMPLEX* ; etc. ; CHARS ::= LEN CHAR* ; STRS ::= LEN STR* ; PACKAGES ::= LEN PACKAGE* ; CONSES ::= LEN CONS* ; ; RAT ::= NAT NAT NAT ; sign (0 or 1), numerator, denominator ; COMPLEX ::= RAT RAT ; real, imaginary parts ; CHAR ::= byte ; the character code for this character ; STR ::= LEN CHAR* ; length and then its characters ; PACKAGE ::= STR LEN STR* ; package name, number of symbols, symbol names ; CONS ::= NAT NAT ; "index" of its car and cdr (see below) ; ; LEN ::= NAT ; just to show when we're referring to a length ; ; MAGIC ::= #xAC120BC7 ; also see the discussion below ; ; NAT ::= [see below] ; ; You can experiment with these formats, for example as follows in raw Lisp. ; ; (let ((str (with-output-to-string (s) (ser-encode-to-stream 5/2 s)))) ; (loop for i from 0 to (1- (length str)) ; collect (char-code (char str i)))) ; ; ; Magic Numbers. The magic number, #xAC120BC7, is a 32-bit integer that sort ; of looks like "ACL2 OBCT", i.e., "ACL2 Object." This use of magic numbers is ; probably silly, but may have some advantages: ; ; (1) It lets us do a basic sanity check. ; ; (2) When serialize is used to write out whole files, helps to ensure the ; file doesn't start with valid ASCII characters. This *might* help ; protect these files from tampering by programs that convert newline ; characters in text files (e.g., FTP programs). ; ; (3) It gives us the option of tweaking our encoding. Today we use distinct ; magic numbers to identify V1, V2, and V3 files, and in the future we ; could add additional encodings by adding other magic numbers. ; ; ; Naturals. Our representation of NAT is slightly involved since we need to ; support arbitrary-sized natural numbers. We use a variable-length encoding ; where the most significant bit of each byte is 0 if this is the last piece of ; the number, or 1 if there are additional bytes, and the other 7 bits are data ; bits. The bytes are kept in little-endian order. For example: ; ; 0 is encoded as #x00 (continue bit: 0, data: 0) ; 2 is encoded as #x02 (continue bit: 0, data: 2) ; ... ; 127 is encoded as #x7F (continue bit: 0, data: 127) ; 128 is encoded as #x80 #x01 [(c: 1, d: 0) = 0] + 128*[(c: 0, d: 1) = 1] ; 129 is encoded as #x81 #x01 [(c: 1, d: 1) = 1] + 128*[(c: 0, d: 1) = 1] ; 519 is encoded as #x87 #x04 [(c: 1, d: 7) = 7] + 128*[(c: 0, d: 4) = 4] ; 16383 is encoded as #xFF #x7F ; [(c: 1, d: 127) = 127] + 128 * [(c: 0, d: 127) = 127] ; 16384 is encoded as #x80 #x80 #x01 ; [(c: 1, d: 0) = 0] + 128 * [(c: 0, d: 0) = 0] ; + 128^2*[(c: 0, d: 1) = 1] ; 16387 is encoded as #x83 #x80 #x01 ; [(c: 1, d: 3) = 3] + 128 * [(c: 0, d: 0) = 0] ; + 128^2*[(c: 0, d: 1) = 1] ; ... ; ; ; Negative Integers. Negative integers aren't mentioned in the file format ; because we encode them as rationals with denominator 1. This only requires 2 ; bytes of overhead (for the sign and denominator) beyond the magnitude of the ; integer, which seems acceptable since negative integers aren't especially ; frequent. ; ; ; Conses. Every object in the file has an (implicit) index, determined by its ; position in the file. The naturals are given the smallest indices 0, 1, 2, ; and so on. Supposing there are N naturals, the rationals will have indices ; N, N+1, etc. After that we have the complexes, then the characters, strings, ; symbols, and finally the conses. ; ; We encode conses using these indices. For instance, suppose the first two ; natural numbers in our file are 37 and 55. Since we start our indexing with ; the naturals, 37 will have index 0 and 55 will have index 1. Then, we can ; encode the cons (37 . 55) by just writing down these two indices, e.g., #x00 ; #x01. ; ; We insist that sub-trees of conses come first in the file, so as we are ; decoding the file, whenever we construct a cons we can make sure its indices ; refer to already-constructed conses. ; ; The object with the maximal index is "the object" that has been saved, ; and is returned by the #Y and #Z readers, or by the whole-file reader, ; serialize-read. ; ; ; The V2 Format. The V2 format is almost the same as the V1 format, but with ; the following changes that allow us to restore the normedness of conses. ; ; (1) The magic number changes to #xAC120BC8, so we know which format is ; being used, ; ; (2) We tweak the way indices are handled so that NIL and T are implicitly ; given index 0 and 1, respectively (i.e., we do not actually write out ; those symbols), which can sometimes slightly improve compression for ; cons structures that have lots of NILs and Ts within them (since ; without this tweak, NIL and T might have large indices that are ; represented using many bytes). ; ; (3) We change the way conses are represented so we can mark which conses ; were normed. Instead of recording a cons by writing down its car-index ; and cdr-index verbatim, we now instead write down: ; ; (car-index << 1) | (if honsp 1 0), cdr-index ; ; Because of the way we encode naturals, this neatly only costs an extra ; byte if the car-index happens to have an integer length that is a ; multiple of 7. (To see this, note that our encoding is essentially a ; base-128 encoding, so we need an extra "digit" only when the top 7-bit ; "digit" has its top bit set to 1.) ; ; (4) Instead of the total number of objects, we replace LEN with the maximum ; index of the object to be read. Usually this just means that instead ; of LEN we record LEN - 1. But it allows us to detect the special case ; of NIL where the object being encoded is not necessarily at LEN - 1. ; ; ; The V3 format. The V3 format is almost the same as the V2 format, but with ; the following changes that allow us to restore fast alists. ; ; (1) The magic number changes to #xAC120BC9, and ; ; (2) We extend the OBJECT format with an extra FALS field, that comes ; after the conses. Note that LEN is unchanged and does not count ; the FALS. ; ; OBJECT ::= MAGIC ; marker for sanity checking ; LEN ; total number of objects ; NATS RATS COMPLEXES CHARS ; object data ; STRS PACKAGES CONSES ; ; FALS ; MAGIC ; marker for sanity checking ; ; FALS ::= FAL* 0 ; zero-terminated list ; ; FAL ::= NAT NAT ; index and hash-table-count ; ; (3) We change the encoding of STR so that we can mark which strings were ; normed. In place of recording the string's LEN, we instead record: ; (len << 1) | (if honsp 1 0) ; ----------------------------------------------------------------------------- ; ; PRELIMINARIES ; ; ----------------------------------------------------------------------------- (defparameter *ser-verbose* nil) (defmacro ser-time? (form) `(if *ser-verbose* (time ,form) ,form)) (defmacro ser-print? (msg &rest args) `(when *ser-verbose* (format t ,msg . ,args))) ; To make it easy to switch the kind of input/output stream being used, all of ; our stream reading/writing is done with the following macros. ; ; In previous versions of serialize we used binary streams and wrote/read from ; them with write/read-byte on most Lisps; on CCL we used memory-mapped files ; for better performance while reading. But we had to switch to using ordinary ; character streams to get compatibility with the Common Lisp reader (where we ; use #Y and #Z), which reads character streams. (defmacro ser-write-char (x stream) `(write-char (the character ,x) ,stream)) (defmacro ser-write-byte (x stream) `(ser-write-char (code-char (the (unsigned-byte 8) ,x)) ,stream)) (defmacro ser-read-char (stream) ; Note that Lisp's read-char causes an end-of-file error (HyperSpec says that ; it "is signaled") if EOF is reached, so we don't have to detect unexpected ; EOFs in our decoding routines. `(the character (read-char ,stream))) (defmacro ser-read-byte (stream) ; How do we know that the char-code is 8 bits when reading a file stream? We ; try to enforce this in acl2-set-character-encoding. The magic number might ; save us if this changes, since its first byte is #xAC = 172 > 127 and will ; presumably be misread if we are using a file character encoding other than ; iso-8859-1. `(the (unsigned-byte 8) (char-code (ser-read-char ,stream)))) (defun ser-encode-magic (stream) ;; We only write V3 files now, so we write AC120BC9 instead of C8 or C7. (ser-write-byte #xAC stream) (ser-write-byte #x12 stream) (ser-write-byte #x0B stream) (ser-write-byte #xC9 stream)) (defun ser-decode-magic (stream) ;; Returns :V1, :V2, or :V3, or causes an error. (let* ((magic-1 (ser-read-byte stream)) (magic-2 (ser-read-byte stream)) (magic-3 (ser-read-byte stream)) (magic-4 (ser-read-byte stream))) (declare (type (unsigned-byte 8) magic-1 magic-2 magic-3 magic-4)) (let ((version (and (= magic-1 #xAC) (= magic-2 #x12) (= magic-3 #x0B) (cond ((= magic-4 #xC7) :v1) ((= magic-4 #xC8) :v2) ((= magic-4 #xC9) :v3) (t nil))))) (unless version (error "Invalid serialized object, magic number incorrect: ~X ~X ~X ~X" magic-1 magic-2 magic-3 magic-4)) version))) ; ----------------------------------------------------------------------------- ; ; ENCODING AND DECODING NATURALS ; ; ----------------------------------------------------------------------------- ; WHY DO WE USE 8-BIT BLOCKS? ; ; Originally I tried using 64-bit blocks. I thought this would mean only 1/64 ; of the bits would be "overhead" for continue-bits, and surely this would be ; better than using 8-bit blocks, where 1/8 of the bits would be continue-bit ; overhead. ; ; This thinking is totally wrong. It ignores another important source of ; overhead: the unnecessary data-bits in the final block. To make this very ; concrete, think about encoding the number 5. We only "need" 3 bits. In an ; 8-bit encoding, we use 8 bits so the overhead is 5/8 = 62%. But in a 64-bit ; encoding we would need 64 bits for an overhead of 61/64 = 95%. So the 8-bit ; encoding is much more efficient for small integers. ; ; Another potential disadvantage of 64-bit blocks may seem to be that 64-bit ; numbers are not fixnums in CCL (or perhaps any Lisp circa 2014 or for years ; to come). However, that issue is probably actually quite minor; 8 8-bit ; fixnums quite possibly use more memory than one 64-bit bignum. ; ; Of course, there are cases where 64-bit blocks win. For instance, 2^62 ; nicely fits into a single 64-bit block, but requires 9 8-bit blocks (at 7 ; data bits apiece), i.e., 72 bits. But on some other larger numbers, 8-bit ; blocks can still be more efficient. Take 2^64. Here, we need either 2 ; 64-bit blocks (at 63 data bits apiece) for 128 bits, or 10 8-bit blocks for ; 80 bits. In short, the wider encoding only wins when the numbers are very ; long, or when there aren't very many unnecessary data bits in the final ; block. ; WHY ALL THESE OPTIMIZATIONS? ; ; The performance of natural number encoding/decoding is especially important ; because we have to (1) encode/decode two naturals for every cons, and (2) ; encode/decode naturals all over the place for string lengths, symbol name ; lengths, and the representation of any number. These optimizations are a big ; deal: on one example benchmark, they improve CCL's decoding performance by ; almost 20% (as opposed to using just ser-encode-nat-large to encode ; naturals). (defun ser-encode-nat-fixnum (n stream) ; Optimized encoder that assumes N is a fixnum. (declare (type fixnum n)) (loop while (>= n 128) do (ser-write-byte (logior ;; The 7 low data bits (the fixnum (logand n #x7F)) ;; A continue bit #x80) stream) (setq n (the fixnum (ash n -7)))) (ser-write-byte n stream)) (defun ser-encode-nat-large (n stream) ; Safe encoder that doesn't assume how large N is. (declare (type (integer 0 *) n)) (loop until (typep n 'fixnum) do ;; Fixnums are at least (signed-byte 16) in Common Lisp, so we must ;; be in the large case, i.e., n > 128. (ser-write-byte (logior ;; The 7 low data bits (the fixnum (logand (the integer n) #x7F)) ;; A continue bit #x80) stream) (setq n (the integer (ash n -7)))) (ser-encode-nat-fixnum n stream)) (defmacro ser-encode-nat (n stream) ; This is kind of silly, but it lets us avoid the function overhead of calling ; ser-encode-nat-large in the very common case that N is a fixnum. (when (eq stream 'n) (error "~s called with stream = N, which would cause capture!" 'ser-encode-nat)) `(let ((n ,n)) (if (typep n 'fixnum) (ser-encode-nat-fixnum n ,stream) (ser-encode-nat-large n ,stream)))) (defun ser-decode-nat-large (shift value stream) ; Simple (but unoptimized) natural number decoder that doesn't assume anything ; is a fixnum. Shift is 7 times the current block number we are reading, and ; represents how much we need to shift over the next 7 bits we read. Value is ; the already-summed value of the previous blocks we have read. (declare (type integer value shift)) (let ((x1 (ser-read-byte stream))) (declare (type fixnum x1)) (loop while (>= x1 128) do (incf value (ash (- x1 128) shift)) (incf shift 7) (setf x1 (ser-read-byte stream))) (incf value (ash x1 shift)) value)) (defmacro ser-decode-nat-body (shift) ; See SER-NAT-DECODE; this is accounting for different fixnum sizes across Lisps ; by unrolling the loop with a recursive macro. SHIFT is a constant that is ; being incremented by 7 on each "iteration". An invariant that is important to ; the fixnum optimizations is that VALUE is always less than 2^SHIFT. (if (> (expt 2 (+ 7 shift)) most-positive-fixnum) ;; Can't unroll any further because we've reached the fixnum size, so fall ;; back to using the large decoder. `(ser-decode-nat-large ,shift value stream) `(progn (setq x1 (ser-read-byte stream)) ;; Reusing X1 is kind of goofy, but seems to result in better code on ;; CCL. (cond ((< x1 128) ;; The returned VALUE + (X1 << SHIFT) is a fixnum since it is less than ;; 2^(7+SHIFT), which above we checked is a fixnum. (setq x1 (the fixnum (ash x1 ,shift))) (the fixnum (+ value x1))) (t ;; Else, we increment value by (x1 - 128) << SHIFT. This is still a ;; fixnum because (x1 - 128) < 2^7, so the sum is under 2^(7+SHIFT). ;; To see this let x2=x1-128, s=SHIFT, and v=value; then since v<2^s ;; and x2 <= 2^7-1, we have: ;; value + (x1-128)<