Date: 1/10/2001 To: SIGCOMM Executive Committee From: Chris Edmondson-Yurkanan Subj: Report on an experiment funded by SIGCOMM Prototyping a Digital Archive of Historic Network Specifications BACKGROUND: My ultimate archival goal is to build digital archives for many historical network architectures that can be used as a resource for network students, researchers, and historians. These archives would contain the key specifi- cations and design documents, and could also include some not-so-technical documents that provide context and perspective. These archives must be searchable. EXPERIMENT: The goal of this experiment was to explore the issues in creating a digital archive of scanned, OCRed, and formatted historical network specifications. For this experiment I used a subset of the materials sent to me in August 1999 for possible inclusion in the notebook that accompanied the SIGCOMM 1999 Tutorial: The Technical History of the Internet. In addition, I included a few of the documents that I have since started to collect. The actual production was performed by Kata Carbone, who has worked on many SIGCOMM and ICNP print projects since the first annual SIGCOMM conference in Austin in 1983. Kata was also the ToN editorial assistant from 1995-1999. Kata spent 76+ hours on 430 original source pages at a cost to SIGCOMM of $1142.75. Most of these pages were scanned, OCRed, and proofed. A few had to be completely rekeyed by her. Figures and tables required either rekeying, redrawing, or the embedding of an image into the text. All proofing was performed by Kata. The documents that are in this initial archive are: 1) 89 pages of materials on the NPL network architecture that Donald Davies sent me, including the NPL Proposal for a Digital Communication Network, the 1967 ACM "Gatlinburg" paper by Davies, Bartlett, Scantlebury, and Wilkinson, and a historic retrospective written by him in the 1980's, 2) 115 pages from several small Arpanet reports, and 3) 226 pages of several early TCP specifications. MEASUREMENTS: Original # of pages scanned: 430 Final # of pages (after OCRing and formatting): 379 Scan/Input Time: 465 minutes OCR/Correct Time: 873 minutes Proof/Correct Time:3233 minutes (includes producing figures/tables) I hesitate to give an average time per page, given the learning curve of this experiment and the wide variety and quality of the originals, but the average would be 10-11 minutes per page. OBSERVATIONS FROM THIS EXPERIMENT: 1) I started this project following the process that Lyman Chapin, Dave Oran and I used in producing the 25th Anniversary Issue of CCR. This process was to scan the originals, OCR them, correct the OCRing, reformat, and proof. If you look at the 25th anniversary issue, there is a consistent format in all the early papers (whereas, some of the later papers were already available electronically, and thus minimal reformatting was done). All papers are easy to read online as opposed to papers available only as images. Examples of paper images that are semi-hard to read (and also don't print that well are: the Cerf/Kahn paper (pdf image) at www.worldcom.com/about_the_company/cerfs_up/technical_writings/protocol_paper/ or the Kleinrock status report (in jpeg format) at www.cs.ucla.edu/~lk/LK/Bib/REPORT/RLEreport-1961.html And of course, once a paper has gone through the above process, then it is full-text searchable. 2) At the conclusion of this experiment, I now have mixed feelings about how best to present this historic material. I actually prefer seeing a scanned image of Donald Davies' proposal typed on A4 paper rather than a modern formatted version in Word. Seeing the page's image provides some historical context just as rare manuscripts by authors have for centuries. I certainly know how inspired I became about network history when I received Donald Davies' originals with rusty paperclips on A4 paper. Since network history is young, the real historic context may actually be in the markings on the original. The IENs that I requested from Bob Braden have a few pencil markings on them, but I don't know if he sent me a xerox of his copy or Postel's copy or ?. Of course, if the quality of the scanned image is poor, then reading the newly formatted document in Word could be preferable. 3) I am concerned with the 70% time spent on the correcting the OCRing, producing the figures/tables, and proofing. Plus, if one really wanted to attain 100% accuracy, then an additional proofreading would be required. 4) So, for my next archival experiment, I am considering following the JSTOR model. The JSTOR project (www.jstor.org) is a digital archiving project that is creating searchable digital archives of older journals. While their motivation is different, I have learned quite a bit from their project, since they have already digitized over 5 million pages. JSTOR observed that they were only able to correct OCRed text with 99.5% accuracy, which for their purposes isn't of high enough quality for viewing. So JSTOR provides the image of each page for viewing and printing, but for searching they use the 99.5% proofed OCRed text, along with database entries of keywords and rekeyed abstracts. While I could not find any documents describing the ACM Digital Library's process in any detail, it appears that they are also using the above model. Thanks for your support, Chris PS: I'll send you the URL for the following documents in a following email. APPENDIX A: Detailed list of the scanned materials, along with notes about the process ************ NPL MATERIAL (received from D. Davies) ************ Paper #: 01 Title: Historical note on the early development of packet switching # of Pages (Original/Final): 05/05 Scan/Input Time: 08 min. OCR/Correct Time: 67 min. Proof/Correct Time: 22 min. Paper #: 02 Title: Remote on-line data processing and its communication needs # of Pages (Original/Final): 03/05 Scan/Input Time: 34 min. OCR/Correct Time: * Proof/Correct Time: 19 min. Paper #: 03 Title: Further speculations on data transmission # of Pages (Original/Final): 02/03 Scan/Input Time: 02 min. OCR/Correct Time: 35 min. Proof/Correct Time: 10 min. Paper #: 04 Title: Proposal for the development of a national communication service for on-line data processing # of Pages (Original/Final): 08/10 Scan/Input Time: 12 min. OCR/Correct Time: 28 min. Proof/Correct Time: 40 min. Paper #: 05 Title: NPL Proposal for a digital communication network # of Pages (Original/Final): 28/34 Scan/Input Time: 39 min. OCR/Correct Time: 228 min. Proof/Correct Time: 134 min. Paper #: 06 Title: A digital communication network for computers giving rapid response at remote terminals (ACM Symposium) # of Pages (Original/Final): 18/27 Scan/Input Time: 16 min. OCR/Correct Time: 18 min. Proof/Correct Time: 120 min. Paper #: 07 Title: Report on visit of R.A. Scantlebury to the 1967 ACM symposium USA # of Pages (Original/Final): 01/02 Scan/Input Time: 01 min. OCR/Correct Time: 04 min. Proof/Correct Time: 14 min. Paper #: 08 Title: A data communication network for real-time computers # of Pages (Original/Final): 10/15 Scan/Input Time: 21 min. OCR/Correct Time: 51 min. Proof/Correct Time: 48 min. Paper #: 09 Title: CCITT meeting at Geneva, November 23 to 27 # of Pages (Original/Final): 03/03 Scan/Input Time: 03 min. OCR/Correct Time: 01 min. Proof/Correct Time: 20 min. Paper #: 10 Title: Preparation for the NRD meeting of November 1970 # of Pages (Original/Final): 03/04 Scan/Input Time: 03 min. OCR/Correct Time: 01 min. Proof/Correct Time: 24 min. Paper #: 11 Title: Report on CCITT meetings on new data networks 23-27 November 1970 at Geneva # of Pages (Original/Final): 06/05 Scan/Input Time: 30 min. OCR/Correct Time: * Proof/Correct Time: 10 min. Paper #: 12 Title: Private line digital data system # of Pages (Original/Final): 01/02 Scan/Input Time: 01 min. OCR/Correct Time: 05 min. Proof/Correct Time: 15 min. Paper #: 13 Title: Cover letter # of Pages (Original/Final): 01/01 Scan/Input Time: -- OCR/Correct Time: -- Proof/Correct Time: -- TOTALS # Pages: 89/116 Scan/Input Time: 170 min. OCR/Correct Time: 438 min. Proof/Correct Time: 476 min. 1084 minutes = 18 hours, 4 minutes * Text was rekeyed. Notes: _ Underlined text was changed to bold face _ Equations were set with italicized variables _ Scanned tabular data was reset in table mode _ Paper Nos. 2 & 11 were rekeyed due to poor original quality _ Paper No. 5 was missing a line of text at the bottom of original page 21; section headings were tagged for table of contents generation; figures were floated within the text _ Paper Nos. 6 & 8 were converted to single column from 2-column original layouts **************** ARPANET MATERIAL (received from L. Chapin, C. Partridge) **************** Paper #: 01 Title: ARPAnet/IMP Software History: Principal Milestones # of Pages (Original/Final): 06/06 Scan/Input Time: 05 min. OCR/Correct Time: 06 min. Proof/Correct Time: 50 min. Paper #: 02 Title: BBN Report # 2918: Network Design Issues # of Pages (Original/Final): 28/18 Scan/Input Time: 34 min. OCR/Correct Time: 34 min. Proof/Correct Time: 135 min. Paper #: 03 Title: BBN Report # 1822: Chapter 3: System Operation # of Pages (Original/Final): 44/14 Scan/Input Time: 18 min. OCR/Correct Time: 160 min. Proof/Correct Time: 177 min. TOTALS # Pages: 78/38 Scan/Input Time: 57 min. OCR/Correct Time: 200 min. Proof/Correct Time: 362 min. 619 minutes = 10 hours, 19 minutes * Text was rekeyed. Notes: _ Underlined text was changed to bold face _ Equations were set with italicized variables _ Scanned tabular data was reset in table mode **************** ARPANET MATERIAL (received from L. Chapin) **************** Paper #: 01 Title: NIC # 8246: Host/Host Protocol for the ARPA Network # of Pages (Original/Final): 37/28 Scan/Input Time: 13 min. OCR/Correct Time: 30 min. Proof/Correct Time: 255 min. TOTALS # Pages: 37/28 Scan/Input Time: 13 min. OCR/Correct Time: 30 min. Proof/Correct Time: 255 min. 298 minutes = 4 hours, 58 minutes Notes: _ Underlined text was changed to bold face or italics _ Scanned tabular data was reset in table mode _ Section headings were tagged for table of contents generation *************** TCP/IP MATERIAL (received from V. Cerf) *************** Paper #: 01 Title: A Partial Specification of an International Transmission Protocol # of Pages (Original/Final): 17/13 Scan/Input Time: 08 min. OCR/Correct Time: 15 min. Proof/Correct Time: 225 min. TOTALS # Pages: 17/13 Scan/Input Time: 08 min. OCR/Correct Time: 15 min. Proof/Correct Time: 225 min. 248 minutes = 4 hours, 8 minutes Notes: _ Underlined text was changed to bold face or italics _ Scanned tabular data was reset in table mode _ Figures were re-drawn _ Figures 4 and 8 are non-existant (skips from figure 3 to 5; 7 to 9) ************ IEN MATERIAL (three IENs from B. Braden) ************ Paper #: 01 Title: IEN #5: Specification of Internet Transmission Control Program: TCP (Version 2) # of Pages (Original/Final): 105/83 Scan/Input Time: 67 min. OCR/Correct Time: 45 min. Proof/Correct Time: 1045 min. Paper #: 02 Title: IEN #18: Proposed Revisions to the TCP # of Pages (Original/Final): 10/10 Scan/Input Time: 120 min. OCR/Correct Time: * Proof/Correct Time: 30 min. Paper #: 03 Title: IEN #21: Specification of Internetwork Transmission Control Program: TCP (Version 3) # of Pages (Original/Final): 94/91 Scan/Input Time: 30 min. OCR/Correct Time: 145 min. Proof/Correct Time: 840 min. TOTALS # Pages: 209/184 Scan/Input Time: 217 min. OCR/Correct Time: 190 min. Proof/Correct Time: 1915 min. 2322 minutes = 38 hours, 42 minutes Notes: _ Underlined text was changed to bold face or italics _ Scanned tabular data was reset in table mode _ Section headings were tagged for table of contents generation _ Paper No. 2 was rekeyed due to poor original quality --