The Project Gutenberg eBook of The Project Gutenberg FAQ 2002
This ebook is for the use of anyone anywhere in the United States andmost other parts of the world at no cost and with almost no restrictionswhatsoever. You may copy it, give it away or re-use it under the termsof the Project Gutenberg License included with this ebook or onlineat If you are not located in the United States,you will have to check the laws of the country where you are locatedbefore using this eBook.
Title: The Project Gutenberg FAQ 2002
Author: Jim Tinsley
Release date: October 1, 2005 [eBook #9109]
Most recently updated: January 2, 2021
Language: English
The Project Gutenberg FAQ 2002
by Jim Tinsley
Important: This file is posted to the Project Gutenberg archivesnot as a current guide, more as a historical reference. I hopethat future FAQs will be posted, as the project evolves, butthis one is of its time.
If you want the most up-to-date information from PG, pleasesee the current version of the FAQ, from the Project Gutenbergsite, or, at the time of posting, at:
Writing a FAQ for an organization of fanatical proofreaders hasits ups and downs! I'd like to thank all those who correctedmy facts and my typos, and especially the people who pointed outthe lack of clarity in certain answers. The remaining errors andopacity are all mine.
Preface to the archive edition
Ironically, Project Gutenberg, which preserves the writings ofothers, doesn't have much written history itself. There arescraps of e-mails and guidelines, but many newsletters and otherinternal writings before 1996 have gone to the great bit-bucketin the sky.
The later half of the '90s marked a graceful blooming of ProjectGutenberg's growth. Three related technical factors contributed: theexplosion in home PCs brought standardization, which made it easyfor non-techies to install scanners, which, in response to the newdemand, became plentiful and cheap. And, of course, these years sawthe rise in popularity of the Internet, which has always been PG'smain channel of communication and distribution.
However, while PG's production expanded geometrically, at Moore'sLaw rates, there were barriers to participation. Most volunteers hadto find an eligible book, scan or type it, and proof the resultingtext all by themselves. This was and is a fairly significant amountof work: 40 painstaking hours would be a typical commitment for onebook.
Beyond that, simply learning the mechanics of producing e-textscould be a serious challenge for newcomers. Nearly all internalPG communication, except for the Newsletter, was by private e-mail,and instructions had to be repeated many times to individual newvolunteers, all of whom showed up with great good will, but most ofwhom vanished after a week or two.
Michael Hart was unstinting in his editing of incoming texts andhandling questions by e-mail, but any one person has only so manyhours.
The Directors of Production at the time — Sue Asscher, Dianne Bean,John Bickers and David Price — served as contact points for adviceand help, made enormous efforts of production themselves, and triedto share the scanned texts among new volunteers for proofing. Theymade a huge contribution to building community in PG.
Pietro Di Miceli set up a web site for the project in 1996, and withthe popularization of the Web (as opposed to the Internet), this becamea beacon for readers and new volunteers.
All of these people reached out to willing volunteers, drew them in,helped them, encouraged them. The Project and all of the readers ofthe books, now and in the future, owe these people a great debt.Without them, Project Gutenberg could not have achieved what it has.But still, for the most part, each volunteer worked alone.
In 1999, I wrote, in response to an offer to volunteer:
I think I can best answer your offer, and many others like it, by giving an extended description of what actually happens in the making of PG texts, and why it's often not easy to get started.
There is no agenda, no master list of tasks ready to be given to volunteers. This is often the hardest thing to get across to new volunteers. I know I waited quite a while after volunteering for someone to give me a job to do before I realized it.
Exactly five steps are normally performed in the publishing of
an e-text.
1. Someone, somewhere gets a public-domain copy of a text they
want to contribute.
2. That volunteer confirms its PD status by sending TP&V to
Michael, and getting copyright clearance.
3. Someone, usually the same volunteer, scans and corrects the
text, or, if skilled in typing, types the book into an e-text.
4. Someone, often a different volunteer, second-proofs the
e-text, removing the smaller errors.
5. The e-text is sent to Michael for posting.
There are three barriers which make it difficult for most people to contribute:
1. Getting a PD book.
2. People without scanners and typing skills have no way of
turning a book into an e-text.
3. Even with a scanner, turning a book into an e-text is not
easy or quick.
Since, generally, people who have a PD book don't just want to send it off to a stranger for scanning, the people who produce e-texts have to get over all three of these barriers. This is the bottleneck in production. It's relatively easy to get an e-text second-proofed; making it in the first place is the hardest part. You need to have a book, the means to turn it into an e-text and the time and will to do it.
After that comes second proofing. There are two problems here. One is that there may not be enough texts for all the people who want to second-proof; the other is that a lot of beginners just abandon texts given to them for second-proofing, which holds up the process and is discouraging for others. So a lot of volunteers do their own second-proofing or send their texts to established contacts with a track record of finishing the job, rather than making them available to newbies. The Directors of Production do serve as contact points, and at any given moment may have some texts for proofing, but they can only distribute the texts that have already been made.
With that explanation out of the way, I can better address your question of what you can do.
Second-proofing is an easy way to start, but material isn't just waiting for you. If you want to look for some, post your offer here and wait a week or so. If no takers by then, e-mail Michael and ask if there are any texts available; he may be able to refer you to a Director of Production who has something current. You may not get an e-text immediately, but you will get one. Of course, you can also look here for offers of e-texts ready to proof.
Your other option is to take on a book yourself. In your case, you already have a scanner, so you are equipped to become a producer. You need to find a PD book.
Getting PD books means finding and borrowing or buying them. You can do this through used bookshops, libraries or book sites on the Internet. I mention a few net sites in the FAQ in the link below. I get all my books through them, since they make it easy for me to find the books I want. Prices range from $5 up to (in my case) about $30.
The best advice I can offer here is: pick a book that you want to contribute, and a book you'll enjoy working with—you'll be living with it up close and personal for quite a while.
In March and April of 1999, Pietro created the PG Volunteers'WWWBoard and Greg Newby set up the mailing list gutvol-d, and, forthe first time, volunteers who hadn't been introduced to each otherby Michael or the Directors could meet online and communicatedirectly. A few FAQs and HOWTOs were written, covering the basics,the nitty-gritty of producing books. All of this activity made itmuch easier for people to get involved, and the Project experienceda new influx of interested volunteers. Improved OCR software wasalso a factor at this time: in response to the commoditization ofscanners, there was rapid improvement in the quality of OCR, andbetter OCR made for easier production of e-texts. More work wasshared out in co-operative proofing experiments.
It was in this new, expansive atmosphere, with ideas flooding infrom enthusiasts newly energized by the project, that Charles Franks(Charlz) came up with the idea of a web site that would serve todistribute the work of proofing a book among many volunteers. Butnot only did he think of the concept; he went ahead and did it!
In April 2000, Charlz first requested comments on his idea ina post on the Volunteers' WWWBoard, and by the end of September,the first e-texts were queueing up on the production line.
On October 9th, Charlz wrote:
Number of pages proofed by date:
2nd 6 3rd 6 4th 20 <— Newsletter 5th 27 6th 25 7th 29 8th 30 9th 45!! (and the day ain't over yet)
(The "Newsletter" is a reference to the site being mentioned inthe PG Newsletter on October 4th, 2000).
Distributed Proofreaders, or DP, simply kept growing from there, asCharlz kept scanning and adding more books and features andproofers, and its simple organic growth produced 600 e-texts in twoyears, but when Charlz asked for more help on Slashdot, a populartechnical news site, on November 8th, 2002, the response blew theroof off! The pages per day figure jumped from 1,000 to about 10,000for a while, then settled down at its current 4,000. 4,000 pages,even given that each page is proofed twice, is a lot of pages. 2,000produced pages per day is about five full books per day. DP hasformed the backbone of PG's production ever since. Whatever thefuture of DP's production, its effect on shared knowledge andresources, and the communication and community it has built, ensuresthat Project Gutenberg will never be the same again.
I began writing this FAQ in March 2002, and was essentially finishedaround December 2002. It sat around, with a few tweaks here andthere in response to comments, until the start of September 2003.
Today, it is a useful guide to Project Gutenberg norms and practices.By the time you read it, it may be ancient history ("Hey, Grandad,did you REALLY scan things from paper? Why didn't you use yourbrain implant?" :-) But it is one record of How Things Were inProject Gutenberg during this time of change.
jimSeptember 7th, 2003.
Project Gutenberg FAQ 2002
I have a question not answered in this FAQ. How do I ask it?
If it's about how to produce a text, the Volunteers' Board at<> is generally the bestplace to ask.
If it's a question of active interest to the general body ofvolunteers, you can ask it on the gutvol-d mailing list. See<> for joining it.
For other questions, you should check our Contact Information page at<> and e-mail the appropriateperson.
About Project Gutenberg:
G.1. What is Project Gutenberg?
G.2. Where did Project Gutenberg come from?
G.3. What has Project Gutenberg achieved?
G.4. Who runs Project Gutenberg?
G.5. How many people are in Project Gutenberg?
G.6. How can I contact Project Gutenberg?
G.7. How can I help Project Gutenberg?
G.8. How can I keep in touch with what Project Gutenberg is doing?
G.9. What is the relationship between Project Gutenberg, Projekt
Gutenberg-DE, Project Gutenberg of Australia, and Project Runeberg?
About Project Gutenberg publications:
G.10. Does Project Gutenberg publish only books?
G.11. What books does Project Gutenberg publish?
G.12. What other things does Project Gutenberg publish?
G.13. How does Project Gutenberg choose books to publish?
G.14. What languages does Project Gutenberg publish in?
G.15. Why don't you have any / many books about history, geography, science,
G.16. Why don't you have any books by Steven King, Tom Clancy,
Tolkien, etc.?
G.17. Why is Project Gutenberg so set on using Plain Vanilla ASCII?
Readers' FAQ
About Finding eBooks:
R.1. How can I find an eBook I'm looking for?R.2. Can I get a complete list of Project Gutenberg eBooks?R.3. How can I download a PG text that hasn't been cataloged yet?R.4. You don't have the eBook I'm looking for. Can you help me find it?R.5. Where else can I go to get eBooks?R.6. I see some eBooks in several places on the Net. Do different people really re-create the same eBooks?
About Using the Web Site:
R.7. Why couldn't I reach your site? (or: Why is your site slow?)
R.8. I get an error when I try to download a book.
R.9. I searched for a book I know is in Project Gutenberg, but got no
R.10. Can I copy your website, or your website materials?
R.11. Your site doesn't look right in my browser.
I clicked on a button, and nothing happened.
R.12. What does that thing about "Select FTP Site" mean?
R.13. What exactly is an FTP site anyway?
R.14. Can I become an FTP mirror?
R.15. Can I make a private FTP mirror for my school, library or
R.16. When I clicked on the file I want, nothing happened.
R.17. How many texts are downloaded through the web site?
R.18. What are the most popular books?
About Downloading and Using Project Gutenberg eBooks:
R.19. Should I download a ZIP or a TXT file?R.20. I've got a ZIP file. What do I do with it?R.21. I tried to unzip my file, but it said the file was corrupt, or damaged.R.22. I see gibberish onscreen when I click on a book.R.23. Can I download and read your books?R.24. What am I allowed to do with the books I download?R.25. Does Project Gutenberg know who downloads their books?R.26. I've found some obvious typos in a Project Gutenberg text. How should I report them?R.27. I've found some obvious typos in a Project Gutenberg text. Who should I report them to?R.28. I've reported some typos. What will happen next?R.29. I've got the text file, and I can read it, but it seems to be double-spaced or it has control characters like ^J or ^M at the end of every line.R.30. When I print out the text file, each line runs over the edge of the page and looks bad.R.31. I can read the text file, but a few characters appear as black squares, or gibberish.R.32. Can I get a handheld device for reading PG texts? Which device should I get?R.33. How can I read a PG eBook on my PDA (Palm, iPaq, Rocket . . .)
About the Files:
R.34. What types of files are there, and how do I read them?R.35. What do the filenames of the texts mean?R.36. What is the difference within PG between an "edition" and a "version"?R.37. What is the difference between an "etext" and an "eBook"?R.38. What are the "Etext/Ebook numbers" on the texts?R.39. What do the month and year on the text mean?
Copyright FAQ
C.1. What is copyright?
C.2. Does copyright differ from country to country? From state to state?
C.3. What are the copyright laws outside the U.S.?
C.4. Why does Project Gutenberg advise only on U.S. copyright issues?
C.5. I don't live in the U.S. Do these rules apply to me?
C.6. What is the public domain?
C.7. What can I do with a text that is in the public domain?
C.8. How does a book enter the public domain?
C.9. How does a copyright lapse?
C.10. What books are in the public domain?
C.11. My book says that it's "Copyright 1894". Is it in the public domain?
C.12. How can a copyright owner release a work into the public domain?
C.13. When is an author not the owner of a copyright on his or her works?
C.14. What does Project Gutenberg mean by "eligible"?
C.15. I have a manuscript from 1900. Is it eligible?
C.16. How come my paper book of Shakespeare says it's "Copyright 1988"?
C.17. What makes a "new copyright"?
C.18. I have a 1990 book that I know was originally written in 1840,
but the publisher is claiming a new copyright. What should I do?
C.19. I have a 1990 reprint of an 1831 original. Is it eligible?
C.20. I have a text that I know was based on a pre-1923 book, but I
don't have the title page. Can I submit it to PG?
C.21. How does Project Gutenberg "clear" books for copyright?
C.22. I want to produce a particular book. Will it be copyright cleared?
C.23. I have some extra material (images, introduction, preface, missing
chapter) that should go into an existing PG text. Do I have to
copyright-clear my edition before submitting it?
C.24. I see some Project Gutenberg eBooks that are copyrighted. What's
up with that?
C.25. What are "non-renewed" books?
C.26. How can I get Project Gutenberg to clear a non-renewed book?
Volunteers' FAQ
About the Basics:
V.1. How do I get started as a Project Gutenberg volunteer?
V.2. What experience do I need to produce or proof a text?
V.3. How do I produce a text?
V.4. Do I need any special equipment?
V.5. Do I need to be able to program?
V.6. I am a programmer, and I would like to help by programming.
V.7. What does a Gutenberg volunteer actually do?
V.8. Can I produce a book in my own language?
V.9. Does it have to be a book? Can I produce pieces from a magazine
or other periodical?
V.10. Do I have to produce in plain ASCII text?
V.11. Where do I sign up as a volunteer?
V.12. How do PG volunteers communicate, keep in touch, or co-ordinate work?
V.13. Where can I find a list of books that need proofing?
V.14. Is there a list of books that Project Gutenberg wants?
V.15. I have one book I'd like to contribute. Can I do just that without
signing up?
About production:
V.16. How does a text get produced?
V.17. How long must a text be to qualify for PG?
V.18. What books are eligible?
V.19. Are reprints or facsimiles eligible?
V.20. What is the difference between a reprint and a facsimile?
V.21. What is the difference between a reprint and a "new edition"?
V.22. What book should I work on?
V.23. I have a book in mind, but I don't have an eligible copy.
V.24. Where can I find an eligible book?
V.25. What is "TP&V"?
V.26. What is "Posting"?
V.27. I think I've found an eligible book that I'd like to work on.
What do I do next?
V.28. What books are currently being worked on?
V.29. How do I find out if my book is already on-line somewhere?
V.30. My book is not on the In-Progress list, and I can't find it on-line.
V.31. My book is on-line, but not in Project Gutenberg. What should I do?
V.32. My book is already on-line in Project Gutenberg, but my printed book
is different from the version already archived. Can I add my version?
V.33. I see a book that was being worked on three years ago. Is anyone still
working on it?
V.34. I've decided which book to produce. How do I tell PG
I'm working on it?
V.35. I have a two- or three-volume set. Should I submit them as one text,
or one text for each volume?
V.36. I have one physical book, with multiple works in it (like a
collection of plays). Should I submit each text separately?
V.37. How do I get copyright clearance?
V.38. I have a two- or three-volume set. Do I have to get a separate
clearance on each physical book?
V.39. I have one physical book, with multiple works in it (like a
collection of plays). Do I have to get a separate clearance
for each work?
V.40. Who will check up on my progress? When?
V.41. How long should it take me to complete a book?
V.42. I want/don't want my name published on my e-text
V.43. I'd like to put a copy of my finished e-text, or another
Gutenberg text, on my own web page.
V.44. I've scanned, edited and proofed my text. How do I find someone
to second-proof it?
V.45. I've gone over and over my text. I can't find any more errors,
and I'm sick of looking at it. What should I do now?
V.46. Where and how can I send my text for posting?
V.47. What is the "Credits Line"?
V.48. How soon after I send it will my text be posted?
V.49. I found a problem with my posted text. What do I do?
V.50. Someone has e-mailed me about my posted text, pointing out errors.
V.51. Someone has e-mailed me about my posted text, thanking me.
About Proofing:
V.52. What role does proofing play in Project Gutenberg?
V.53. What is Distributed Proofing?
V.54. What do I need to proof an e-text?
V.55. Do I need to have a paper copy of the book I'm proofing?
V.56. What's the difference between "first proof" and "second proof"?
V.57. What do I do with an e-text sent to me for proofing?
V.58. What kinds of errors will I have to correct?
V.59. How long does it take to proof an e-text?
V.60. Are there any special techniques for proofing?
V.61. What actually happens during a proof?
About Net searching:
V.62. I've found an eligible text elsewhere on the Net, but it's not
in the PG archives. Can I just submit it to PG?
V.63. I've found an eligible text elsewhere on the Net, but it's not
in the PG archives. Why should I submit it to PG?
V.64. I have already scanned or typed a book; it's on my web site.
How can I get it included in the Gutenberg archives?
V.65. I have already scanned or typed a book; it's on my web site.
The world can already access it. Why should I add it to the
Gutenberg archives?
V.66. I have already scanned or typed a book, but it's not in plain text
format. Can I submit it to PG?
About author-submitted eBooks:
V.67. I've written a book. Will PG publish it?
V.68. I have translated a classic book from one language to another.
Will PG publish my translation?
V.69. OK, this is one of the cases where PG will publish it.
What do I do next?
V.70. I hold the copyright on a book. Can I release it to the public domain?
V.71. I hold the copyright on a book. Do I have to release the book
into the public domain for Project Gutenberg to publish it?
V.72. I hold the copyright on a book, and would like Project Gutenberg
to publish it. Can I choose what rights to assign?
About what goes into the texts:
V.73. Why does PG format texts the way it does?
About the characters you use:
V.74. What characters can I use?
V.75. What is ASCII?
V.76. So what is ISO-8859? What is Codepage 437? What is Codepage 1252?
What is MacRoman?
V.77. What is Unicode?
V.78. What is Big-5?
V.79. What are "8-bit" and "7-bit" texts?
V.80. I have an English text with some quotations from a language that
needs accents—what should I do about the accents?
V.81. I have some Greek quotations in my book. How can I handle them?
V.82. I want to produce a book in a language like Spanish or French
with accented characters. What should I do?
About the formatting of a text file:
V.83. How long should I make my lines of text?
V.84. Why should I break lines at all? Why not make the text as one
line per paragraph, and let the reader wrap it?
V.85. Why use a CR/LF at end of line?
V.86. One space or two at the end of a sentence?
V.87. How do I indicate paragraphs?
V.88. Should I indent the start of every paragraph?
V.89. Are there any places where I should indent text?
V.90. Can I use tabs (the TAB key) to indent?
V.91. How should I treat dashes (hyphens) between words?
V.92. How should I treat dashes replacing letters?
V.93. What about hyphens at end of line?
V.94. What should I do with italics?
V.95. Yes, but I have a long passage of my book in italics! I can't
really CAPITALIZE or otherwise /mark/ all that text, can I?
V.96. Should I capitalize the first word in each chapter?
V.97. What is a Transcriber's Note? When should I add one?
V.98. Should I keep page numbers in the e-text?
V.99. In the exceptional cases where I keep page numbers, how should
I format them?
V.100. Should I keep Tables of Contents?
V.101. Should I keep Indexes and Glossaries?
V.102. How do I handle a break from one scene to another, where the
book uses blank lines, or a row of asterisks?
V.103. How should I treat footnotes?
V.104. My book leaves a space before punctuation like semicolons,
question marks, exclamation marks and quotes. Should I do
the same?
V.105. My book leaves a space in the middle of contracted words like
"do n't", "we 'll" and "he 's". Should I do the same?
V.106. How should I handle tables?
V.107. How should I format letters or journal entries?
V.108. What can I do with the British pound sign?
V.109. What can I do with the degree symbol?
V.110. How should I handle . . . ellipses?
V.111. How should I handle chapter and section headings?
V.112. My book has advertisem*nts at the end. Should I keep them?
V.113. Can I keep Lists of Illustrations, even when producing a
plain text file?
V.114. Can I include the captions of Illustrations, even when producing
a plain text file?
V.115. Can I include images with my text file?
About formatting poetry:
V.116. I'm producing a book of poetry. How should I format it?
V.117. I'm producing a novel with some short quotations from poems.
About formatting plays:
V.118. How should I format Act and Scene headings?
V.119. How should I format stage directions?
V.120. How should I format blank verse?
About some typical formatting issues:
V.121. Sample 1: Typical formatting issues of a novel.
V.122. Sample 2: Typical formatting issues of non-fiction
V.123. Sample 3: Typical formatting issues of poetry
V.124. Sample 4: Typical formatting issues of plays
About problems with the printed books:
V.125. I found some distasteful or offensive passages in a book I'm
producing. Should I omit them?
V.126. Some paragraphs in my book, where a character is speaking,
have quotes at the start, but not at the end. Should I close
those quotes?
V.127. The spelling in my book is British English (colour, centre).
Should I change these to American spellings?
V.128. I'm nearly sure that some words in my printed book are typos.
Should I change them?
V.129. Having investigated what looks like a typo, I find it isn't.
Do I need to do anything?
V.130. Aarrgh! Some pages are missing! Do I have to abandon the book?
V.131. Some words are spelled inconsistently in my book (e.g. sometimes
"surprise", sometimes "surprize"). Should I make them consistent?
Word Processing FAQ
W.1. What's the difference between an editor and a word processor?
W.2. Should I use an editor or a word processor?
W.3. Which editor or word processor should I use?
W.4. How can I make my word processor easier to work with for plain text?
W.5. What is the difference between proportional and non-proportional
W.6. I can't get words in a table or poem to line up under each other.
About using MS-Word:
W.7. I've edited my book in Word - how do I save it as plain text?
W.8. Quotes look wrong when I save a Word document as plain text.
W.9. Dashes look wrong when I save a Word document as plain text.
W.10. I saved my Word document as HTML, but the HTML looks terrible.
Scanning FAQ
S.1. What is a scanner?
S.2. What types of scanners are there?
S.3. Which scanner should I get?
S.4. What is ADF?
S.5. Should I get ADF?
S.6. What's a "TWAIN driver" and why do I need one?
S.7. How do I scan a book?
S.8. My book won't open flat enough for a good scan, and I don't
want to cut the pages.
S.9. How long does it take to scan a book?
S.10. What scanner settings are best?
S.11. Can I use a digital camera in place of a scanner?
S.12. What is OCR?
S.13. What differences are there between OCR packages?
S.14. How accurate should OCR be?
S.15. Which OCR package should I get?
S.16. What types of mistakes do OCR packages typically make?
S.17. Why am I getting a lot of mistakes in my OCRed text?
S.18. I got an OCR package bundled with my scanner. Is it good enough
to use?
S.19. I want to include some images with a HTML version. How should I
scan them?
S.20. I want to include some images with a HTML version. What type of
image should I use?
S.21. Will PG store scanned page images of my book?
H.1. Can I submit a HTML version of my text?
H.2. Why should I make a HTML version?
H.3. Can I submit a HTML version without a plain ASCII version?
H.4. What are the PG rules for HTML texts?
H.5. Can I use Javascript or other scripting languages in my HTML?
H.6. Should I make my HTML edition all on one page, or split it into
multiple linked pages?
H.7. How can I check that I haven't made mistakes in coding my HTML?
H.8. Can I submit a HTML or other format of somebody else's text?
H.9. How big can the images be in a HTML file?
H.10. The images I've scanned are too big for inclusion in HTML.
What can I do about it?
H.11. Can I include decorative images I've made or found?
H.12. How can I make a plain text version from a HTML file?
H.13. How can I make a HTML version from my plain text file?
Programs and Programming FAQ
P.1. What useful programs are available for Project Gutenberg work?
P.2. What programs could I write to help with PG work?
Formats FAQ
F.1. What formats does Project Gutenberg publish?
F.2. What is, and how do I make or use various formats?
Volunteers' Voices - Volunteers talk about PG
Amy Zelmer
Ben Crowder
Col Choat
Gardner Buchanan
Jim Tinsley
John Mamoun
Ken Reeder
Lynn Hill
Sandra Laythorpe
Tony Adam
Tonya Allen
Walter Debeuf
Bookmarks - web pages commonly referred to in the FAQ
B.1. Project GutenbergB.2. Distributed Proofing SitesB.3. Other On-Line eBook PagesB.4. Lists of Suggested Books to TranscribeB.5. Finding Paper Books On-Line
About Project Gutenberg:
G.1. What is Project Gutenberg?
Project Gutenberg is a volunteer effort to digitize, archive, anddistribute cultural works.
G.2. Where did Project Gutenberg come from?
In 1971, Michael Hart was given $100,000,000 worth of computer time ona mainframe of the era. Trying to figure out how to put these veryexpensive hours to good use, he envisaged a time when there would bemillions of connected computers, and typed in the Declaration ofIndependence (all in upper case—there was no lower case available!).His idea was that everybody who had access to a computer could have acopy of the text. Now, 31 years later, his copy of the Declaration ofIndependence (with lower-case added!) is still available to everyoneon the Internet.
During the 70s, he added some more classic American texts, and throughthe 80s worked on the Bible and the collected works of Shakespeare.That edition of Shakespeare was never released, due to copyright lawchanges, but others followed.
Starting in 1991, Project Gutenberg began to take its current form,with many different texts and defined targets. The target for 1991 wasone book a month. 1992's target was two books a month. This targetdoubled every year through 1996, when it hit 32 books a month.
Today, we have a target of 200 books a month.
G.3. What has Project Gutenberg achieved?
Project Gutenberg is the original, and oldest, etext project on the
Internet, founded in 1971.
In mid-2002, we are not only still going, we have made over 5,000eBooks available, with a current production target of 200 more eachmonth.
We have many mirrors (copies) of our archives on all five continents.
G.4. Who runs Project Gutenberg?
The Project Gutenberg Literary Archive Foundation is a 501(c)(3)
organization. Dr. Gregory B. Newby <> is our
volunteer CEO. Professor Michael Hart <> is our Founder
and Executive Director.
In terms of the day-to-day production of eBooks, our volunteers runthemselves. :-) They produce books, and submit them when completed.Our Production Directors help with general volunteer issues. ThePosting Team check submitted texts and shepherd them onto our servers.You can find current contact information for these people on theContact Information page at <>.
G.5. How many people are in Project Gutenberg?
As of mid-2002, there are about 100 active producers, and 200 regular,active helpers doing tasks like proofing. Something like 1500 peoplereceive our Newsletter.
G.6. How can I contact Project Gutenberg?
There are lots of ways to contact us, depending on what you want totalk about. The Contact Info page<> on the main web site liststhem.
G.7. How can I help Project Gutenberg?
Donate money! We're an all-volunteer project, and we don't have muchto spend, so even a little goes a long way. Our Donation page<> tells you how.
Produce a text! Turn an old book into an immortal etext.
The Volunteers' FAQ [V.1] tells you how.
G.8. How can I keep in touch with what Project Gutenberg is doing?
Subscribe to one of the Newsletters—weekly or monthly!
The page <> gives details of howto subscribe, unsubscribe and access the archives.
G.9. What is the relationship between Project Gutenberg, Projekt
Gutenberg-DE, Project Gutenberg of Australia, and Project Runeberg?
These are all entirely separate organizations. Projekt Gutenberg-DEand Project Gutenberg of Australia use the "Project Gutenberg"trademark with permission, and they operate within the copyright rulesof their respective countries. Project Runeberg has no specificconnection with Project Gutenberg.
About Project Gutenberg publications:
G.10. Does Project Gutenberg publish only books?
Project Gutenberg also publishes other cultural works like movies andmusic, but the bulk of our collection is books.
G.11. What books does Project Gutenberg publish?
Any books that we legally can, and that our volunteers want to workon.
We cannot publish any texts still in copyright without permission.This generally means that our texts are taken from books publishedpre-1923. (It's more complicated than that, as our Copyright FAQexplains, but 1923 is a good first rule-of-thumb for the U.S.A.)
So you won't find the latest bestsellers or modern computer bookshere. You will find the classic books from the start of this centuryand previous centuries, from authors like Shakespeare, Poe, Dante, aswell as well-loved favorites like the Sherlock Holmes stories by SirArthur Conan Doyle, the Tarzan and Mars books of Edgar Rice Burroughs,Alice's adventures in Wonderland as told by Lewis Carroll, andthousands of others.
These books are chosen by our volunteers. Simply, a volunteer decidesthat a certain book should be in the archives, obtains the book anddoes the work necessary to turn it into an e-text. If you'reinterested in volunteering, see the Volunteers' FAQ at [V.1] below.
G.12. What other things does Project Gutenberg publish?
We have published some music files, in MIDI and MUS formats. We havepublished the Human Genome. We have published pictures of theprehistoric cave paintings from the south of France. We have publishedsome video files and some audio files, including a Janis Ian track andreadings from public domain books.
G.13. How does Project Gutenberg choose books to publish?
Project Gutenberg, as such, does not choose books to publish. There isno central list of works that volunteers are asked to work on.Individual volunteers choose and produce books according to their owntastes and values, and the availability (or price!) of the book.
G.14. What languages does Project Gutenberg publish in?
Whatever languages we can! As above, this is decided by what languagesour volunteers choose to work with.
G.15. Why don't you have any / many books about history, geography,
science, biography, etc.?
Why aren't there any / more PG books available in French, Spanish,
German, etc.?
If we can legally publish a book, and it isn't in the archives, it'sbecause no volunteer has produced it yet. At the moment, we have apredominance of English language novels because that is what mostpeople have chosen to work on.
We're always looking for new languages and topics, and alwaysdelighted to see people producing them. If we don't have enough of thetypes of books you would like to see, why don't you help us out bycontributing one? If the people interested in a particular area don'tcontribute, we'll always be short in that area.
G.16. Why don't you have any books by Steven King, Tom Clancy,
Tolkien, etc.?
Project Gutenberg can publish only books that are in the publicdomain [C.10] unless we have the permission of the copyright holder.Current bestsellers have not yet entered the public domain, and we'renot likely to get permission from the authors to publish them.
G.17. Why is Project Gutenberg so set on using Plain Vanilla ASCII?
Don't misrepresent us—we support and publish many open formats, but,yes, we do want to have a plain text version of everything possible.
We're looking at our history, and we're planning for the longterm—the very long term.
Today, Plain Vanilla ASCII can be read, written, copied and printedby just about every simple text editor on every computer in the world.This has been so for over thirty years, and is likely to be so for theforeseeable future. We've seen formats and extended character setscome and go; plain text stays with us. We can still read Shakespeare'sFirst Folios, the original Gutenberg Bible, the Domesday Book, andeven the Dead Sea Scrolls and the Rosetta Stone (though we may havetrouble with the language!), but we can't read many files made invarious formats on computer media just 20 years ago.
We're trying to build an archive that will last not only decades,but centuries.
The point of putting works in the PG archive is that they are copiedto many, many public sites and individual computers all over theworld. No single disaster can destroy them; no single government cansuppress them. Long after we're all dead and gone, when the veryconcept of an ISP is as quaint as gas streetlamps, when HTML readslike Middle English, those texts will still be safe, copied, andavailable to our descendants.
The PG archive is so valuable, yet free and easily portable, that evenif every current PG volunteer vanished overnight, people around theworld would copy and preserve it.
If the ZIP format loses popularity, and is replaced by bettercompression, it will be easy to convert the zip formats automatically(and we post all plain-text files in unzipped format as well). If harddrives are replaced by optical memory, it will be easy to copy thefiles onto that. If even ASCII is superseded by Unicode or one of itsdescendants, it will be possible for our grandchildren to convert itautomatically (and ASCII is included in Unicode anyway).
By contrast, many of us have files saved in proprietary formats fromword-processors only 5 or 10 years old that are already impracticalfor us to read. Some of our files produced just a few years ago usingnon-ASCII character sets like Codepage 850 are already giving problemsfor some readers. Some eBook reader formats launched within the lastfew years are already obsolete. We have learned from that experience.
We also encourage other open formats based on plain text, like HTMLand XML, and even occasionally not-so-open ones when simple formattingisn't enough, but plain text and ASCII is the only format andcharacter set we're sure of in a rapidly-changing technologicallandscape.
Please see also the FAQ [F.1] "What formats does Project Gutenbergpublish?" for more detailed discussion of formats.
Readers' FAQ
About Finding eBooks:
R.1. How can I find an eBook I'm looking for?
For PG books, the simplest way is to go to the home page at<>, type the Author or Title into thesearch form, press the "Search" button, and follow the choices.
As of late 2002, there is a full-text search available at<>where you can search not only for titles and authors, but anywords or phrases you want to look up. For example, entering"Ample make this bed" and running an "entire books" search forall words leads you to Poems Of Emily Dickinson, Series Two.
R.2. Can I get a complete list of Project Gutenberg eBooks?
Yes. There are two main options:
GUTINDEX.ALL is the raw list of files posted. You will find it at:<>
PGWHOLE.TXT is the list of files cataloged. A Zipped version is:<>
When we post a book, the posting information contains title andauthor, eBook number, base filename and schedule year and month.This raw information goes into GUTINDEX.ALL.
After posting, our catalogers get to work and add more information—things like full title, subtitle, author birth and death dates,Library of Congress Classification, full filenames and sizes. Whena book has been cataloged, it is entered onto the website databaseso that you can search for it. PGWHOLE.TXT is a summary of thebooks in the website database.
People who want to bypass the search on the website and find booksthemselves will probably want to use GUTINDEX.ALL, since it doesn'twait for the cataloging.
R.3. How can I download a PG text that hasn't been cataloged yet?
In short, just browse to:
choose the schedule year of the text (newly-posted texts will usuallybe in the latest year) and look down the list to find the filenameyou're looking for.
In general, you need to know:
a) the address of an FTP siteb) the schedule year of the text you wantc) the basename of the text you want.
The fastest and safest FTP site to use for this is,which is the first of our two primary posting sites (the other We post to these two sites, and then other sitescopy from them at intervals, so with any FTP sites other than thesetwo, the file may not be available immediately.
You can get the schedule year and basename of the text from its linein GUTINDEX.ALL. Let's take an example. The file
Mar 2004 The Herd Boy and His Hermit, by C. M. Yonge [#32][]5313
has been posted just a few hours ago as I write this. From theGUTINDEX entry, the schedule year is 2004, and the basename of thetext is hrdbh.
We divide our texts into directories (folders) based on the scheduleyear, so this eBook will be in the directory for 2004, which will benamed something ending in /etext04. All the directories are namedetext plus the last two digits of the year. (Somebody's going to haveto change that convention in about 87 years from now! :-) We currentlyhave directories starting at 90, running through the 90s and then 00,01, 02, 03, 04. All eBooks produced before 1991 are in the /etext90directory, so if you're looking for
Dec 1971 Declaration of Independence [] 1
Aug 1989 The Bible, Both Testaments, King James Version [] 10
you should look in /etext90.
As it happens, ibiblio supports both HTTP (web) and FTP access to thetext, so we can just browse to
and choose the 2004 directory from there.
If you want to automate this, you could also use the more directaddress
The equivalent address for is
Either way, we see a long page of files, in alphabetical order. Scrolldown to the "H"s and look for hrdbh. We see four files with thisbasename:
hrdbh10.txt hrdbh10h.htm
This means that both plain text and HTML formats are available,and you can choose to download them either zipped or uncompressed.For more detail about conventions for filenames, see the FAQ "Whatdo the filenames of the texts mean?" [R.35]. The main thing you needto know is that any file beginning with hrdbh is some format oredition of this book.
Finally, all you have to do is click on the format you want todownload.
R.4. You don't have the eBook I'm looking for. Can you help me find it?
Sorry, no. We can suggest (see below) some other places to look forpublicly accessible books on the Net, but we can't do the search foryou.
R.5. Where else can I go to get eBooks?
The On-Line Books Page <> and the
Internet Public Library at <> are two sites that
specialize in creating a list of all books on-line from any source.
Searching them is a good place to start.
If you're looking for commercial books, like current textbooks orbestsellers, you're not likely to find them here, since recent booksare not in the public domain. For these, you should look forcommercial booksellers on the Net—any search engine will direct youto some if you enter search terms like "shop ebook".
R.6. I see some eBooks in several places on the Net. Do different people really re-create the same eBooks?
It does happen, but mostly by accident. Anyone experienced in eBookcreation will first search the usual places to see whether anyone elsehas already transcribed the book they're interested in. If it has beentranscribed, they will not duplicate the effort.
Etexts that are in the public domain very often float around the Netfor years—stored in a gopher server here, posted to Usenet there,held on someone's local computer for a year or two and thenreformatted as HTML and uploaded to a web site somewhere else. Andthis is good, because we want texts to be copied as widely aspossible.
Public domain eBooks are fair game for anyone to copy, correct, markup, package and post: that's what being in the public domain means.
Project Gutenberg eBooks are often quickly copied and reformatted, andposted on other sites like Blackmask at <>.
If you find an eBook in many different places, the odds are good thatit came from one original source, and was copied around.
It does sometimes happen that people duplicate the transcription ofbooks already made into text. Sometimes it's because they didn't findthe version already made. Sometimes they have a different edition, andwant to transcribe that. Mostly, though, we all try not to do morework than we have to.
About Using the Web Site:
R.7. Why couldn't I reach your site? (or: Why is your site slow?)
This isn't common, but it happens. Project Gutenberg is a very busysite, probably one of the busiest non-commercial sites on the Web, andsometimes the amount of traffic causes a slowdown.
There may also be a bottleneck somewhere else between you and thesite. If at first you don't succeed, don't tell us, just try, tryagain. The correct address is either:
R.8. I get an error when I try to download a book.
We do not keep e-text files on this site. Instead, many FTP sitesthroughout the world hold the whole Project Gutenberg archive oftexts. An FTP site is just a computer on the Internet that specializesin holding files for download and sending them to people on request.You can find a list of FTP sites that hold Gutenberg texts at<>.
When you're searching or browsing for titles and authors, you're onthis Project Gutenberg site, but when you click on the book todownload it, you are connected to an FTP site. At the time you clickon the filename, your browser contacts an FTP site and tries todownload the file from there. If you get an error, it could be becausethe FTP site is busy, or because there's a network traffic bottleneckbetween you and that FTP site, or because the text you're looking foris missing from that FTP site.
Usually, the easiest solution is to choose another FTP site todownload your text from. Go to the Search page, choose a different FTPsite, and search again for your text.
Tip: You should always try to choose the FTP site closest to you. Notonly are you helping to minimize Net traffic by choosing a nearbysite, but your file will download faster!
If all else fails, note the year and the filename of the book youwant, choose an FTP site from this list and click on one of them. Thenbrowse your way through the listings to the file you want.
For example, if you find "Lady Susan" by Jane Austen, you will seethat it was published by Gutenberg in 1997, and its filename islsusn10.txt, so browse to one of the FTP sites, choose the directorycalled etext97 and click (or right-click and Save, depending on yourbrowser) on the file lsusn10.txt.
R.9. I searched for a book I know is in Project Gutenberg, but got no results.
First go to the Advanced Search page. Sometimes you may miss insearching because of alternative spellings, so try searchingseparately using just one word in Author or Title. Read the SearchTips.
If that fails, you can Browse through the site catalog. Let's sayyou're looking for "The Wandering Jew" by Eugene Sue.
Go to the PG Home page: <>
Once on this page, click on: "Browse" in "Browse by Author or Title"
You are then brought to a new page, asking you to select an "FTPsite". Further details on how and why to choose an "FTP Site" areavailable on this page.
Select an FTP Site from the Selection List available at the bottom ofthe page, then click on "Select".
You get a new page, Click on "S", initial for "Sue, Eugene"
You should now see a list of all of the Authors whose Last name startswith "S". Scroll down till you find the direct links to the Sue,Eugene works.
Click on the work you are interested to, then click on the file linkfound on the page you were brought to, Etext Card ID -3987- whenselecting the work, as immediately above.
On this page, above the teaser, there are two working links:
· es12v10.txt - 2.95 MB
· - 1.10 MB
Click on the link of your choice in order to get the book.
If you can't find your text either way, the book has not beencataloged. The site catalog always lags behind the postings, since weneed to collect extra information about the book and the author beforeit goes into the full catalog. If you know that the book has beenposted recently, and maybe hasn't made it into the catalog yet, readthe FAQ "How can I download a PG text that hasn't been cataloged yet?"
If even this doesn't help, don't despair! We don't have it, but it maybe elsewhere on the Web. Go to the major search engines and try there.You can also try looking in the Book Search section of The On-LineBooks Page <> or the InternetPublic Library <>, and if you have no luck withthat, you might be able to find it listed as being In Progresssomewhere on their Books In Progress and Requested page at<>.
R.10. Can I copy your website, or your website materials?
Keeping the PG site updated with the latest e-text releases is anongoing job, and our experience is that people, howeverwell-intentioned, do not keep copies up to date. We want there to beone clear source for people seeking the latest Project Gutenberginformation, and we think that having a lot of out-of-date copies andpartial copies scattered around the net would be a bad thing.
We welcome mirrors and copies of our e-texts, in new FTP sites [R.14],but the main web site itself is copyrighted and may not be copied.
R.11. Your site doesn't look right in my browser.
I clicked on a button, and nothing happened.
We take a lot of trouble to ensure that our website uses only valid,standard HTML, and we're not even slightly tempted to use glitzyfeatures that look good in one browser but don't work in another, sowe can promise you that our site is not the problem.
The site uses Cascading Style Sheets (CSS), a W3C standard since 1996.Some older browsers have a buggy implementation of CSS, and this cancause some things to appear off-kilter. If your browser is even older,or doesn't know about CSS at all (as in the case of Lynx, forexample) it should have no problem.
If you actually clicked on a button, like the Search button or thePost button on the Volunteers' Web Board page, and nothing happened,you might be behind a proxy or web filter that doesn't like you makingPOST requests. If you have a web filter switched on, turn it off,reload the page and try again.
R.12. What does that thing about "Select FTP Site" mean?
Our texts are not actually held on the website. The website just holdsan index; the files themselves are held on many sites throughout theworld, called FTP sites. When you have found the book you're lookingfor, and you make that final click to get it, you're not actuallytalking to our website any more—you are transferred to the FTP siteyou selected. Some FTP sites are near you; some are far away. Some maybe faster than others, even if they are about the same distance; somemay have temporary technical problems.
You should usually select the FTP site nearest you. If you find you'rehaving problems with that one, you can select another.
R.13. What exactly is an FTP site anyway?
FTP stands for File Transfer Protocol, one of the oldest and mostreliable protocols of the internet. This is the method by which a filecan be copied from one computer to another.
An FTP site, or FTP server, is a computer that holds files that peoplecan upload and download. In the case of PG, the Posting Team uploadour texts when they're ready to two main FTP servers,<> and <>, which serve asour master copies.
Other FTP sites around the world automatically download the files fromthese master sites, so they have a full set of PG publications for youto download. Because they only check for updates and new files atintervals, some FTP sites may be a day or two behind. Some FTP sitesdon't have space available for everything, so they may hold only thezipped versions of the files. But most FTP sites will have theentire PG collection. These are called FTP "mirrors", since they are acopy of the original.
Many FTP sites exist that offer a full PG mirror but are not on ourFTP sites list. Commonly, these are in schools, where they serve thelocal students, but don't have enough bandwidth to offer downloads toworldwide users.
R.14. Can I become an FTP mirror?
Yes! We're always looking for more FTP mirrors.
If you manage an FTP site with a few GB of space, please check ourContact Information page <>and contact the appropriate person, who will make the arrangements foryou. If space is a problem, you can consider holding only zippedcopies of the texts. We can move you up or down the FTP site list asyou want more or less traffic.
R.15. Can I make a private FTP mirror for my school, library or organization?
We like all FTP mirrors to be open to as many people as possible, butwe know that not all schools have the resources to be a public mirror,so we welcome all mirrors.
And anyway, you don't even have to ask, because we don't controlwhat happens to our texts once we post them!
R.16. When I clicked on the file I want, nothing happened.
When you select a file for download, your request goes to the FTP siteyou selected, not to our website. If the FTP site you selected ishaving problems, or if there is the Net version of a traffic jambetween you and it, you may have problems downloading.
Select a different FTP site [R.12] and try again.
R.17. How many texts are downloaded through the web site?
We don't really do statistics, but in one particular month for whichwe did, we had a figure of about 800,000 searches completed. Since thefinal request for download goes to the FTP site selected and not to ourwebsite, we can't confirm that all of these were actually downloaded,but we expect that most people who have gone all the way through thesearch will finish the job.
In another month, we had about 1,000,000 downloads of files, our main FTP site. This does not count downloads fromother FTP sites, of course. Why are there more downloads thansearches? Because people who are already familiar with getting PGtexts can skip the website search and download straight from the FTPsites.
R.18. What are the most popular books?
We very rarely do statistics, but on one occasion in late 1999 when wedid, we found the top author searches to be:
1 shakespeare
2 poe
3 doyle
4 melville
5 dante
6 joyce
7 shaw
8 christie
9 conrad
10 porter
11 verne
12 hemingway
13 darwin
14 miller
15 woolf
16 zola
17 king
18 eliot
19 churchill
20 smith
21 twain
and the top individual books searched for to the point of downloadingwere:
1. Lady Susan, by Jane Austen
2. 1st PG Collection of Edgar Allan Poe
3. The Adventures of Sherlock Holmes, by Arthur Conan Doyle
4. Moby Dick, by Herman Melville
5. A Christmas Carol, by Dickens
6. The King James Bible
7. Twelve Stories and a Dream, by H.G. Wells
8. Stories by Modern American Authors
9. Lock and Key Library, Magic & Real Detectives
10. [Hans Christian] Andersen's Fairy Tales
11. The Legend of Sleepy Hollow, Washington Irving
These numbers vary a lot. When a movie based on a classic is released,downloads of that eBook go through the roof!
About Downloading and Using Project Gutenberg eBooks:
R.19. Should I download a ZIP or a TXT file?
If you know how to unzip a file, then downloading the zip is faster.For some non-text eBooks that contain multiple files, like HTML withincluded images, only a zip file may be available. For some otherformats, like MP3 or MPEG, there may not be a zipped version availablebecause the native format of the file is already compressed enoughthat zipping it doesn't save much.
R.20. I've got a ZIP file. What do I do with it?
Unzip it.
If you want a free program, you could try the open source Info-Zip
software available at
<> for Mac, MS-DOS,
Unix, Windows and just about everything else you might have.
If you want a commercial program, PKZIP from <>and WinZip from <> are among many popularshareware utilities that allow you to unzip files.
Mac-users using Stuffit Expander may like to set a preference (File /Preferences / Cross Platform) to "Convert text files to Macintosh format. . . When a file is known to contain text". This gets rid of strangecharacters (linefeeds), which are not wanted on a Mac, at the beginningsof lines. MacZip is another free program for Macs. Mac users can alsotry ZipIt or other shareware programs available from the Info-Macarchives, e.g. from<>.
R.21. I tried to unzip my file, but it said the file was corrupt, or damaged.
The chances are that it didn't download correctly. Try downloading itagain. If you don't succeed the second time, try downloading theunzipped version.
R.22. I see gibberish onscreen when I click on a book.
To save download time, our etexts are stored in zipped form as wellas text form. Zipped files are smaller, and take less time to transferto your computer, but you need a program to unzip them. If you try toview a zipped file directly, it looks like gibberish.
You can recognize zipped files easily because their filenames end
If this happens, either make sure you're asking your browser to Savethe file rather than display it (often, you right-click the file andchoose Save) or else click on the version of the file that ends in.txt instead of .zip. You don't need a zip program to view .txt files.
Looking at a zip rather than a text file is by far the most commonreason for this problem, but there are some others. If you're quitesure that you're not looking at a zip file, then it could be that thefile you downloaded is in a character set that your viewer doesn'trecognize, like Big-5 [V.78] for Chinese texts, or Unicode [V.77].If this is the case, you will have to find a viewer that works on yourcomputer for the specified character set. We may also have an ASCIIversion of the same text available for you—we do try to have ASCIIversions for everything [G.17], but some languages, like Chinese,just cannot be sensibly expressed in ASCII.
If you can see most of the characters, enough to be able to make outthe text, but there are regular gibberish characters, black squares,empty boxes or obviously missing characters scattered about throughwords, then you are probably looking at an "8-bit" text [V.79], withaccented characters, and your viewer doesn't handle the character set.See the FAQ "I can read the text file, but a few characters appear asblack squares, or gibberish" [R.31].
If there are a very few gibberish characters, black squares orobviously missing characters in the text, then it's likely that thiswas intended to be a 7-bit text, but a few 8-bit characters like theBritish pound symbol or accented letters slipped through.
R.23. Can I download and read your books?
Yes. That's what Project Gutenberg is all about—making textsavailable free to everyone!
R.24. What am I allowed to do with the books I download?
Most Project Gutenberg e-texts are in the public domain. You can doanything you like with these—you can re-post them on your site, printthem, distribute them, translate them to other languages, convert themto other formats, or redistribute them in unchanged form. However, ifyou distribute versions under the Project Gutenberg trademark, we doimpose some conditions, which are explained in the header and/orfooter in each text.
Some Project Gutenberg e-texts have copyright restrictions. You canstill download and read these, but you may not be allowed toreproduce, modify or distribute them. When browsing or searching onthe site, you will see these copyright-restricted texts indicated inthe listings. For fuller information about them, download the e-textand read the header or footer of the file, which will spell out theconditions in detail.
R.25. Does Project Gutenberg know who downloads their books?
No, and we don't want to!
Like any Internet transfer, our sites have to know the IP addressesthat contact them; without that, no communication is possible. But wedo not trace, hold or examine them beyond what is necessary to dealwith any problems or maintain logs or statistics. We never identify IPaddresses with people.
Further, we encourage people, sites, schools around the world tomirror, or copy, our texts to their sites. Once that happens, we haveno control over them, and we never have any idea who or even how manypeople access them after that.
Even further, we encourage people to distribute the texts on disks,CDs, paper, and any other storage format they can find. We encouragethem to convert the texts to other formats, and share them.
For most people reading this, anonymity is probably not an issue, butyou may live in a place or time where reading Paine, or Voltaire, orthe Bible, or the Koran, is considered suspicious or even subversive.We don't know who you are, and what we don't know, we can't tell.
Currently (mid-2002), by means of DRM (Digital Rights/RestrictionsManagement) many commercial publishers can make a list of exactlywho is reading which of their eBooks. We don't know, and we don'twant to know.
R.26. I've found some obvious typos in a Project Gutenberg text.
How should I report them?
The first thing to remember is that the people who actually make thecorrections you suggest are very experienced, and are used to seeinglots of different types of errata reports. So the exact format of yourreport isn't really very important—just get the report to us in anyclear form that we can understand.
Beyond that, here are some tips to avoid misunderstandings.
It's always helpful if you report the full title, etext number, yearand filename of the text you are correcting. We have multiple editionsand versions of some texts, like Homer's "Odyssey", and unless youtell us exactly what text you mean, we may have to spend some timesearching and guessing.
Especially, please check and report the exact filename of the text.It is amazingly common for people to report problems with abcde10.txt,when abcde11.txt is already posted, and has these and other errorsalready fixed.
When there are only a few errors, it's usually easiest to cut andpaste the line or lines where the error is into your e-mail, with yourcomment.
It can also be useful to give the line number of the place where theerror is, and some people who check texts regularly do this. If thisseems natural to you, do it; if it doesn't, don't.
An ideal report for a typical errata list might look like:
Title: The Odyssey, by Homer
Translated by Butcher & Lang
April, 1999 [Etext #1728]
File: dyssy08.txt
Line 884:
back Telemachus, who bas now resided there for a month.
"bas" should be "has"
Line 1491:
Ithaca yet stands. But I wouldask thee, friend, concerning
"would" and "ask" are run together here
Line 1563:
in his father's seat and the elders gave place to him
This is the end of a paragraph, and needs a period at end.
Line 15346-7:
'Hearken to me now, ye men of Ithaca, to the
will say. Through your own cowardice, my friends, have
I think there is something missing between "the" and "will"
But the following would get the job done as well:
In Homer's Odyssey, translated by Butcher and Lang, from /etext99,
file dyssy08.txt, I found the following errors:
Telemachus, who bas now resided
change "bas" to "has"
But I wouldask thee,
"would ask" run together
and the elders gave place to him
needs period
ye men of Ithaca, to the will say. line missing between "the" and "will"?
Where there are more than a few changes, it may be easiest all roundjust to submit a corrected version of the file. However, if you dothis, please do not re-wrap the paragraphs unless it is reallynecessary; we need to check your suggestions before reposting, and ifthe file is very different, it is difficult and time-consuming for usto find your real changes among all of the changes in the lines.
R.27. I've found some obvious typos in a Project Gutenberg text.
Who should I report them to?
The Posting Team, who post the books, also make the corrections, andultimately, the corrections need to go to them.
Many producers put their e-mail addresses in their texts, specificallyso that readers can contact them when errors are found. If you seethat in your text, you should try to contact the producer first. Thisis especially true if the corrections aren't obvious, as in the caseof missing words. The producer is likely to have the original book,and will probably be able to confirm your corrections without visitinga library. If the book needs the corrections, the producer can thennotify the Posting Team.
If you get no response from the producer, or if there is no e-mailaddress listed, or if the corrections are small and obvious, you cansend them to any or all of the Posting Team directly.
R.28. I've reported some typos. What will happen next?
This varies wildly. Sometimes, you may just get a response e-mail in aday or three saying thanks, and that we've fixed the typo. This isnormal when you've just reported one or a few obvious typos.
Where there is some text missing, or the changes you suggest areotherwise not obvious, we may have to find someone with an eligiblecopy of the book to confirm the changes, and that might take time.Normally, you will get an e-mail explaining that within a week.
Sometimes, even though you've noticed only one or two small typos, oneof the Posting Team who was looking at it may find many more, anddecide that the whole text needs to be re-proofed. This may also taketime.
If the text needs a lot of changes, we may post a new EDITION [R.35]of it, with a new filename: e.g. abcde10.txt may become abcde11.txt.In this case, you will receive a copy of the e-mail sent to the postedlist announcing the new file. Our current rule of thumb is that wecreate a new edition when we make twelve significant changes, but wejudge each on a case-by-case basis, and especially will usually notmake a new edition if the original was posted recently.
R.29. I've got the text file, and I can read it, but it seems to be double-spaced or it has control characters like ^J or ^M at the end of every line.
This is most often seen on Mac or Linux. If you want to dig into whythis effect happens, see the FAQ "Why use a CR/LF at end of line?" [V.85].
Perhaps viewing it in a different editor or viewer will help, but it'susually easiest just to globally replace all of the control characters(if you see them) with nothing, or to replace all double line-endswith single line-ends.
R.30. When I print out the text file, each line runs over the edge
of the page and looks bad.
If you have a file ending in .txt from Project Gutenberg, it is
usually formatted with about 70 characters per line, and with a
Carriage Return/Line Feed pair (also known as a "Hard Return" or a
"Paragraph Mark") at the end of every line.
This is the most widely accepted format for text files, but it's notideal on all computers and all programs. 70 characters per line meansthat if you are using an unusually large or small font to print it,lines may wrap around or not reach across the page. The hard returnmeans that on some systems, the lines may appear double-spaced.
Unfortunately, we can't advise you how best to format texts on allsystems, mostly because we don't know every system! Here are a coupleof tips you might try:
If your font is too big or too small, try setting the font to Couriersize 10 or Times size 12. It may not be ideal, but it mostly works.
In a word processor, you may be able to remove the Hard Returns, but
beware! if you remove too many, the whole text will become one
paragraph. One common formula for removing the HRs goes like this:
1. First, all paragraphs and separate lines should be separated
by two HRs, so that you can see one blank line between them.
Where they aren't, as in the case of a table of contents or
lines of verse, add the extra HRs to make them so.
2. Replace All occurrences of two HRs with some nonsense character
or string that doesn't exist in the text, like ~$~.
3. Replace All remaining HRs with a space.
4. Replace your inserted string ~$~ with one HR.
R.31. I can read the text file, but a few characters appear as black
squares, or gibberish.
The text is using some character set that your editor or viewer isn't.
For example, the text is using ISO-8859-1, and your viewer is using
Codepage 850—or vice versa. You can see the plain ASCII characters,
but non-ASCII characters like accented letters display as nonsense.
Look at the top of the file for a clue to the character set encoding:if it's there, it may help you to find which editor, or font, orviewer you should be using.
R.32. Can I get a handheld device for reading PG texts? Which device should I get?
To read eBooks on a handheld, you need three things: the eBookcontent itself (which you can get from PG and other sites), a device(which I will sometimes call a PDA, even though technically, theRocketBook isn't a PDA) and the reader software that runs on the PDA.
In mid-2002, there are three main families of handheld devices peopleuse for reading eBooks: Palms, Pocket PCs and RocketBooks (or theirsuccessor, REB1100s). In general, it is possible to use any ofthese in combination with any common type of personal computer.
Palms are very common, especially when you count not just the Palm<> itself, but PalmOS-based devices from othermanufacturers, like:
the Franklin eBookman <>, the Handspring Visor <>. the Sony Clie <> and
Because of the number of makers of PalmOS-based devices, you can buythem with lots of combinations of features—color screen, audio,different memory sizes. Of course, Palms have other applicationsbesides eBook reading. Palms are the smallest and most portable of thethree classes, and tend to have the best battery life for travelling,but they also have the smallest screen. Just about all reader softwarewill run on Palms, except the Microsoft Reader, which runs only onPocket PCs, but you don't need the Microsoft Reader for ProjectGutenberg eBooks.
In Pocket PCs, the Compaq iPaq is by far the most common in mid-2002.
More expensive and bulkier than a Palm, it does have a bigger screen.
Like the Palms, it can perform many functions besides reading eBooks.
Only Pocket PCs can support the Microsoft Reader, but this is not
necessary for reading Project Gutenberg eBooks. <>
The RocketBook, and its successor the Gemstar REB1100,<> are quite different from the others.These were built specifically for reading eBooks, and do not haveadditional functions. They are not, technically, PDAs. Their screensare bigger, and excellent for reading, but do not offer color. Theyalso don't offer a choice of readers—the dedicated reader is built-into the device. Both of them require the eBooks you load to beformatted for their reader, and files made for them usually have theextension .rb for RocketBook. The REB1100 does not come with theRocketLibrarian, which is the program you run on your PC to turn anetext into a RocketBook file, but people are still making .rb files,and the RocketLibrarian is still available and popular among anenthusiastic group of Rocket users. (The REB1200 is entirely differentfrom the REB1100, and, as far as we know, PG etexts cannot easily betransferred to it.)
In summary, the Rocket/REB1100 is a dedicated reader, with a goodscreen, but limited to what it does.
Palms are relatively cheap and common, with a wide range of options,and the capacity to function as PDAs as well. They can run allcommon readers except the Microsoft one.
The iPaq <> has a good color screen, but is
bulkier than a Palm, and can run lots of readers, including the
Microsoft one, but not all Palm readers are available for Pocket PC.
Like Palms, the iPaq can do other jobs besides displaying eBooks.
Different people make different choices among these for reading theireBooks, and they all work well; it's a matter of personal taste.
R.33. How can I read a PG eBook on my PDA (Palm, iPaq, Rocket . . .)
To read a book on your PDA, you need to get the file into a formatthat your reader software understands. Each PDA reader program willwork only with a specific format of file. Some will read severalformats, but, in general, it's a jungle of competing options.
Unless you use a Rocket or REB1100, you will need to install at leastone reader program, and many veteran readers install two or three todeal with different formats. There are many of them available. In arecent internal poll of Gutenberg volunteers who use PDAs,
C Spot Run <>,
Mobipocket <>,
PalmReader <>
Plucker <>
were our favored choices for reader programs.
Further, the process may be different depending on which readersoftware you're using. Each format that a reader understands has oneor more converter programs that run on your PC, and turn the plaintext file into that format. So in general, you have to:
1. Download the PG text
2. Edit the text for the layout the converter wants (often HTML).
3. Use the converter to create a file of the format the reader wants.
4. Transfer the converted file to your PDA.
If all this sounds too complicated, remember that many people take andconvert PG texts into many formats, and offer them for download fromtheir sites. Of course, there is no guarantee that someone will haveconverted the particular eBook you want, but there are lots ofoptions. Try Blackmask <>, which liststhousands of texts already converted for Mobipocket, iSilo, RocketBookand the Microsoft Reader.
There are many other sites that serve pre-converted PG texts.
MemoWare <> is also a useful resource forconverted eBooks, and has lots of information, including an excellentmap of the readers and formats jungle at<>
Tecriture <> hosts a service that downloadsand converts PG texts on the fly, and delivers them straight to you.
If you're "rolling your own", you'll probably need to convert ourplain texts to HTML at some point, because a lot of converters requireHTML as input, and this is a common theme in readers' explanations ofhow they get texts onto their PDAs. Don't panic! You don't have to bea HTML wizard to do this—in fact, you don't need to know anythingabout HTML at all! Usually, it's just a matter of removing some lineends and Saving As HTML. You won't get a lot of fancy markup, orimages out of thin air, but you will get the book.
One of the main things you usually have to do in making HTML is unwrapthe lines. If you're making your HTML manually, this is usually doneby replacing two paragraph marks with some nonsense marker like @@Z@@,replacing all single paragraph marks with a space, and replacing thenonsense marker with a paragraph mark. After unwrapping, the text canjust be Saved As HTML.
There are some applications that specifically assist withauto-converting text into HTML:
GutenMark <> was specifically writtenfor the purpose, and knows enough about PG conventions to do a verygood job.
InterParse <> is a Windows-based generic textparser that is very easy and intuitive to use.
The World Wide Web Consortium lists some other options at<>
If you're using a RocketBook or REB1100, you don't have either thechoices or the confusion to deal with. One of our volunteers who usesa RocketBook offered this recipe for getting a PG text onto aRocketBook:
On converting to Rocket:
1. Download text file.
2. Using your utility for showing formatting, enter your word
processing program's edit mode.
3. Replace all double paragraph marks with some nonsense sequence
that can't possibly actually be there, such as @@Z@@.
4. Replace all single paragraph marks with one single space
5. Replace your nonsense sequence with one paragraph mark.
6. Convert all your double spaces to single spaces. Repeat this
until you get "0" for how many replacements were made.
7. Save in HTML.
8. Go into your Rocket Librarian. Use "import file using Rocket
Librarian." Go and pick up the file, which will be automatically
converted to .rb in this process.
This sounds long, but it usually takes me under three minutes exceptfor a very long text. I've never taken longer than five minutes. Youcan just go in and pick up the text file with Rocket Librarian, butwhat you get onscreen doing this looks very odd. Steps 2-7 are notessential, and if I'm in a hurry to read something once I might skipthem, but if it's something I know I want to keep I use them.
This formula is not ideal for poetry or blank verse—if you want tokeep the lines unwrapped, you should avoid removing the paragraphmarks.
Another volunteer, who reads on Mobipocket <>offered this suggestion:
I use the MobiPocket Publisher, available free It wants to take a HTML file as input, so thefirst thing I have to do is convert my PG text to HTML.
I usually do this by running GutenMark, available at<>. I can also do it in MicrosoftWord using the following sequence:
Edit / Replace / Special and choose Paragraph Mark twice (or, fromreplace, you can type in ^p^p to get two Paragraph Marks) and replacewith @@@@. Replace All. This saves off real paragraph ends by markingthem with a nonsense sequence.
Now Replace one Paragraph Mark (^p) with a space. Replace All. Thisremoves the line-ends.
Finally, replace @@@@ with one Paragraph Mark. Replace All. Thisbrings back the Paragraph Ends.
Now I can Save As HTML.
GutenMark does a better job of converting to HTML than my simple Wordformula, since it recognizes standard PG features, and sometimesMobipocket doesn't like the HTML produced from Word—it complains of amissing file, or doesn't recognize quotation marks.
Having got my HTML file, I open Mobipocket Publisher, choose "ProjectGutenberg", Add the File I created, and just Publish it to MobiPocket.PRC format. Then I pick it up on my iPaq the next time I sync. Thewhole process takes two or three minutes, and the results, since Idiscovered GutenMark, are good.
I recently came across InterParse 4 at <>. Itdoesn't have the built-in knowledge of GutenMark, so the results aren'tas good, but it's really easy to use, and you can see the effect of yourchanges onscreen as you do it. For most PG books, all you have to do isjust Open the text file and choose Options / Remove all CRLFs (Except atParagraph End), then Convert / Text to HTML and Save As the HTMLfilename you want. Quick and painless.
About the Files:
R.34. What types of files are there, and how do I read them?
The vast majority of our files are plain text. You can read these withany editor or text viewer or browser. Some are HTML. You can readthese with any browser.
For a full listing of other file types as of mid-2002, and how to readthem, please see the Formats FAQ [F.2].
R.35. What do the filenames of the texts mean?
PG files are named for the text, the edition, and the format type.
As of February, 2002, all PG files are named in "8.3" format—that is,up to eight characters, a dot, and three more characters.
The first five characters in the filename are simply a unique name forthat text, for example, "Ulysses" by Joyce begins with "ulyss".
If the text has been posted as both a 7-bit and 8-bit text, then the
first character of the filename will be a 7 or an 8, to indicate that.
For example, we have both 7crmp10 and 8crmp10 for Dostoevsky's
Crime and Punishment.
The 6th and 7th characters of the name are the edition number—01through 99. We normally start at edition 10 (1.0); numbers lower thanthat indicate that we think the text needs some more work; numbershigher than that mean that someone has corrected the original edition10.
The 8th character of the filename, if it exists, indicates either theversion or the format of the file. When we get a different version ofthe text based on a different source, we give it an a, b, c, as forexample if the text is from a different translation. Where we haveposted a text in a different format, we also add an eighthcharacter—"h" for HTML, "x" for XML, "r" for RTF, "t" for TeX, "u"for Unicode are established formats. There have been some experimentalpostings with "l" for LIT, and "p" for either PRC or PDB.
So, for example:
7crmp10 is our first edition of Crime and Punishment in plain ASCII
8sidd10 is our first edition of Siddhartha, as an 8-bit text
dyssy10b is our first edition of our third translation of Homer's
Odyssey, in plain ASCII
jsbys11 is our second edition of Jo's Boys, in plain ASCII
vbgle10h is our HTML format of our first edition of Darwin's
Voyage of the Beagle
7ldv110 is our 7-bit ASCII version of the first volume of the
Notebooks of Leonardo da Vinci
To make it worse, we don't always stick to these rules, for example:
1ddc810 is our first edition of the first book of Dante's
Divina Commedia in Italian, as an 8-bit text
80day10 is our first edition of Verne's Around the World in 80 days,
in plain 7-bit ASCII in English.
emma10 is our first edition of Jane Austen's "Emma"—with a
4-character basename instead of 5.
Some series have special, non-standard names. Shakespeare is namedwith a digit representing the overall source (First Folio, etc), then"ws", then a series number, so for example 0ws2610, 1ws2610 and2ws2610 are all versions of "Hamlet". The Tom Swift series is namedwith a two-digit prefix denoting the series number, then "tom", so forexample 01tom10 is "Tom Swift and his Motor-Cycle".
And what should we do with a text from a different source that isformatted as HTML? For example, if dyssy10b is the name of the thirdtranslation, what should the HTML version be named? dyssy10bh isobvious, but it uses 9 characters.
The problem, of course, is that we are trying to fit a lot ofinformation into an 8-character filename, and as the collection grows,and the number of formats and versions increases, we come across morepressure on filenames, so while the filename is a good guide to thecontents, it's not definitive.
R.36. What is the difference within PG between an "edition" and a "version"?
We give the name "edition" to a corrected file made from an existingPG text. For example, if someone points out some typos in our file of"War and Peace", we will fix them, and, if enough are found to warranta "new edition", then instead of just replacing the file wrnpc10.txt,we may make a new file wrnpc11.txt, and leave the original alone. Anew edition is always filed under the same year and etext number asthe original—it's just an update.
We give the name "version" to a completely independent e-text madefrom the same original book, but a different source. For example,Homer's Odyssey was translated by many different people, but they allworked from the same book. The translations by Lang, Butler, Pope andChapman are very different, but they all come from the same root.
Thus, these are all "versions" of Homer's Odyssey. We give them allthe same basename—dyssy—and each gets a new number, but we keep theoriginal basename, and add a letter to the filename to indicate thatthey are "versions" of the same original book:
dyssy10.txt Butler's Translation dyssy10a.txt Butcher & Lang's Translation dyssy10b.txt Pope's Translation
The differences don't have to be as extreme as this for us to create anew version. "Clotelle"/"Clotel", for example, was a book publishedmultiple times in English by William Wells Brown, and each time, hechanged the text. We preserve three different texts of the same bookas different versions: clotl10 clotl10a and clotl10b.
R.37. What is the difference between an "etext" and an "eBook"?
If there is any, it seems to be in the eye of the MarketingDepartment! Michael Hart started the whole thing, and coined the word"Etext". The term "eBook" is gaining in popularity, even for textsthat are not full books, so we've started using that more now.
R.38. What are the "Etext/Ebook numbers" on the texts?
These are simply a series of numbers. We give one to each etext as itis posted, so the earliest etexts have low numbers and later etextshave higher numbers. Etext number 1 is the Declaration ofIndependence, the first text that Michael Hart typed in to themainframe that he was using in 1971.
A few numbers are reserved for books that we hope to have in the PGarchive someday; for example, 1984 is reserved for Orwell's classic.
When we improve an text by making some corrections, we call it a newEDITION, and it keeps the same etext number, but when we post adifferent VERSION of the same text, from a different paper book—likedifferent translations of Homer's Odyssey—each new version gets a newetext number.
R.39. What do the month and year on the text mean?
Project Gutenberg sets a production target for itself. The idea isthat we try to produce X texts in a month, and we date the textsaccording to what month of our schedule they appear in. For example,if our target for September 2000 was 50 texts, and we actuallyproduced 55, then the last five would be dated October 2000, and we'dget a head-start on the month. At the time of writing, in July 2002,that target is the publication of 200 books per month. However, ouractual production has far outpaced our targets, with the result thatthe "head-start" has accumulated so much that we are currentlyreleasing books scheduled for March, 2004!
The fact that we're so far ahead of schedule makes this quite confusingfor newcomers. If it bothers you, just don't think about it! But atleast it's better than being behind schedule. We didn't always produceso many books. In the September 1994 newsletter, Michael Hart wrote:
As always, I am terrified of the prospect of doubling our output to 16 Etexts per month for next year, we really need your help!!!
That was when the Project's target was 8 Etexts per month. Today,our target is heading towards 8 eBooks per day!
Copyright FAQ
C.1. What is copyright?
Copyright is a limited monopoly granted to the author of a work. Itgives the author the exclusive right, among other things, to makecopies of the work, hence the name.
C.2. Does copyright differ from country to country? From state to state?
Copyright laws are constantly changing all over the world. Eachcountry has its own copyright laws, some within the framework ofinternational treaties, some not. Within the U.S., copyright laws arefederal, and do not vary from state to state.
C.3. What are the copyright laws outside the U.S.?
Sorry, we can't advise on copyright law outside the U.S. We can pointyou to resources like <>which tries to summarize the various copyright regimes, but we can'tguarantee that these are accurate. Even when they are accurate, it isvery hard to express some of the subtleties of copyright law in asummary—for example, the question of what constitutes "publication"for copyright purposes is sometimes unclear.
C.4. Why does Project Gutenberg advise only on U.S. copyright issues?
The Project Gutenberg Literary Archive Foundation is registered in theU.S. as a 501(c)(3) organization, and our two posting servers aresituated in the U.S., so we are subject to U.S. copyright law, andonly to U.S. copyright law.
Because copyright laws are so tangled and different between countries,not only in the broad sweep but also in the detail, and becauseProject Gutenberg is subject only to U.S. copyright law, we just don'thave the expertise, time or resources to research and advise on thelaw in other countries.
C.5. I don't live in the U.S. Do these rules apply to me?
Your country's copyright laws are different from those in the U.S., andunderstanding and dealing with them is up to you. If you have a bookthat is in the public domain in your country, but not in the U.S., itis perfectly legal for you to publish it personally there, but wecan't.
Similarly, it may be legal for us to publish it here, but not for youto publish it, or perhaps even copy it, where you are.
There are organizations in other countries operating in more liberalcopyright regimes that may be able to publish texts that we cannot.For example, Project Gutenberg of Australia at<> can accept many works not eligible inthe U.S.
C.6. What is the public domain?
The public domain is the set of cultural works that are free ofcopyright, and belong to everyone equally.
C.7. What can I do with a text that is in the public domain?
Anything you want! You can copy it, publish it, change its format,distribute it for free or for money. You can translate it to otherlanguages (and claim a copyright on your translation), write a playbased on it (if it's a novel), or a novelization (if it's a play). Youcan take one of the characters from the novel and write a comic stripabout him or her, or write a screenplay and sell that to make a movie.
You don't need to ask permission from anyone to do any of this. When atext is in the public domain, it belongs as much to you as to anyone.
(However, when some character or part of the work is also trademarked,as in the case of Tarzan, it may not be possible to release new workswith that trademark, since trademark does not expire in the same wayas copyright. If you propose to base new works on public domainmaterial, you should investigate possible trademark issues first.)
C.8. How does a book enter the public domain?
A book, or other copyrightable work, enters the public domain when itscopyright lapses or when the copyright owner releases it to the publicdomain.
U.S. Government documents can never be copyrighted in the first place;they are "born" into the public domain.
There are certain other exceptional cases: for example, if a substantialnumber of copies were printed and distributed in the U.S. before March,1989 without a copyright notice, and the work is of entirely Americanauthorship, or was first published in the United States, the work is inthe public domain in the U.S.
C.9. How does a copyright lapse?
Copyrights are issued for limited periods. When that period is up,the book enters the public domain.
Copyrights can lapse in other ways. Some books published without acopyright notice, for example, have fallen into the public domain.
C.10. What books are in the public domain?
Any book published anywhere before 1923 is in the public domain inthe U.S. This is the rule we use most.
U.S. Government publications are in the public domain. This is therule under which we have published, for example, presidentialinauguration speeches.
Books can be released into the public domain by the owners of theircopyrights.
Some books published without a copyright notice in the U.S. prior to
March 1st, 1989 are in the public domain.
Some books published before 1964, and whose copyright was not renewed,are in the public domain.
If you want to rely on anything except the 1923 rule, things can getcomplicated, and the rules do change with time. Please refer to ourPublic Domain and Copyright How-To at<> for more detailed information.
C.11. My book says that it's "Copyright 1894". Is it in the public domain?
Its copyright date is 1894, which is before 1923, so its copyright haslapsed.
C.12. How can a copyright owner release a work into the public domain?
A simple written statement, which may be placed into the work asreleased, is sufficient. When a copyright holder places a book intothe public domain and wants PG to publish it, all we need is aletter [V.70] saying that they are or were the holder of the copyright,and that they have released it into the public domain.
C.13. When is an author not the owner of a copyright on his or her works?
An author may sell, assign, license, bequeath or otherwise transferhis or her copyright to another party, such as a publisher or heir.
C.14. What does Project Gutenberg mean by "eligible"?
A book is eligible for inclusion in the archives if we can legallypublish it.
We can legally publish any material that is in the public domain inthe U.S. [C.10], or for which we have the permission of the copyrightholder.
C.15. I have a manuscript from 1900. Is it eligible?
Maybe not.
Works that were created but not "published" before 1978 will not enterthe public domain before the end of 2002. This gets complicated, andit's not too common. If you have such a case, ask about it.
A borderline example is the classic "Seven Pillars of Wisdom" by T. E.Lawrence, which was actually printed and privately distributed, butnot "published", in 1922. We haven't been able to confirm any pre-1923"publication" for this.
C.16. How come my paper book of Shakespeare says it's "Copyright 1988"?
Shakespeare was published long enough ago to be indisputably in thepublic domain everywhere, so how can a Shakespeare text becopyrighted?
There are two possibilities:
1. The author or publisher has changed or edited the text enough toqualify as a "new edition", which gets a "new copyright".
2. The publisher has added extra material, such as an introduction,critical essays, footnotes, or an index. This extra material is new,and the publisher owns the copyright on it.
The problem with these practices is that a publisher, having addedthis copyrighted material, or edited the text even in a minor way, maysimply put a copyright notice on the whole book, even though the mainpart of it—the text itself—is in the public domain! And as time goeson, the number of original surviving books that can be proved to be inthe public domain grows smaller and smaller; and meanwhile publishersare cranking out more and more editions that have copyright notices.Eventually it becomes harder and harder to prove that a particularbook is in the public domain, since there are few pre-1923 copiesavailable as evidence.
Among the most important things PG does is preventing this creepingperpetuation of copyright by proving, once and for all, that aparticular edition of a particular book is in the public domain, sothat it can never be locked up again as the private property of somepublisher. We do this by filing a copy of the TP&V, the title pagewhere the copyright notice must be placed, so that if anyone everchallenges the work's public domain status, we can point to a provenpublic domain copy.
C.17. What makes a "new copyright"?
1. New edition
When a text is in the public domain, anyone—from you to the world'sbiggest publisher—can edit it and republish the edited version. Whenthe edits are substantial enough, the edited work is deemed a "newedition", and gets a new copyright, dating from the time the newedition was created.
How substantial must the edits be to qualify as a "new edition"?That is for a court to decide in any particular case. Changing somepunctuation or Americanizing British spelling would not qualify a workfor a new edition. Theorizing something about Shakespeare andrewriting lots of lines in "Hamlet" to emphasize your point wouldmake a new edition. In between those extremes is a grey area, whereeach new edition would have to be considered on a case-by-case basis.
A special case, that isn't quite a new edition, is when someone "marksup" a public domain text in, for example, HTML. Where this happens,the text is in the public domain, but the markup is copyrighted. We'vealready seen that when an editor adds footnotes to a public domaintext, he owns copyright on the footnotes but not on the text:similarly, when he adds markup to the text, he owns copyright on themarkup.
2. Translation
Translation is a common and justified special case of a new edition.When someone translates a public domain work from one language toanother, they get a new copyright on the translation (but not on theoriginal, of course, which stays in the public domain so that lotsmore people can use it.)
C.18. I have a 1990 book that I know was originally written in 1840, but the publisher is claiming a new copyright. What should I do?
From a practical point of view, there's not much you can do about it.It's a Catch-22 situation: in order to prove that the new printingshould be in the public domain, you need a provably public domain copyto compare against the allegedly copyrighted edition, and if you havethat, you don't need the modern edition anyway.
C.19. I have a 1990 reprint of an 1831 original. Is it eligible?
Yes, as long as we can show that it is a reprint, which usuallymeans that it has to say that it's a reprint somewhere on the TP&V.
However, we need to be very careful in a case like this. Commonly, thebook itself is eligible, but introductions, indexes, footnotes,glossaries, commentaries and other such extras may have been addedby the modern publisher, so you should not include them except whereyou can prove that they are part of the reprinted material.
C.20. I have a text that I know was based on a pre-1923 book, but I don't have the title page. Can I submit it to PG?
Unfortunately, no.
What you "know" isn't proof that we could take into court if we werechallenged about it in 20 years, and the whole problem of "newcopyright" [C.17] makes it effectively impossible to tell for surewhat is and isn't copyrighted anyway, without reliable evidence likethe title page.
You need to find a matching paper edition for proof. See the FAQ "I'vefound an eligible text elsewhere on the Net, but it's not in the PGarchives. Can I just submit it to PG?" [V.62]
C.21. How does Project Gutenberg "clear" books for copyright?
Usually, we just look at the TP&V. If it was published before 1923, orsays it is a reprint of a pre-1923 edition, that's all we have to do.
In other cases, we may look up library publication data to prove, say,that a book published in the U.S. without a copyright notice wasindeed published in the years when a copyright notice was required. Orwe may simply see that a particular text was published by the U.S.Government.
The bottom line is the question: if someone comes to us claiming tohold the copyright on a text, do we have proof to show that they'rewrong?
Whatever proof or search we have to do, we then file it, either onpaper or electronically, so that the proof will be available in 20 or50 years' time, or whenever the challenge is made.
C.22. I want to produce a particular book. Will it be copyright cleared?
If it was published before 1923, you will have no problem with itsclearance. If you're relying on one of the other rules, it may just betoo much work to try and prove its public domain status.
C.23. I have some extra material (images, introduction, preface, missing chapter) that should go into an existing PG text. Do I have to copyright-clear my edition before submitting it?
Otherwise we would have no proof that the extra material you're addingisn't copyrighted by someone. It's quite common for modern publishersto add introductions or illustrations to a public-domain novel, and weneed the same standard of proof for these additions that we do for themain text.
This doesn't apply to an occasional word or two that was omitted bymistake when the text was first typed. For example, you don't needto clear another edition just to restore the words "thus perfected the"and "eliminating all" to the sentence:
And while we Country, we were also sorts of tediums, disputable possibilities, and deadlocks from the game.
while fixing typos.
C.24. I see some Project Gutenberg eBooks that are copyrighted. What's up with that?
Authors or publishers may grant Project Gutenberg an unlimited licenseto republish their works. In this kind of case, the copyright holdersstill retain their rights, but grant permission for us to share theseeBooks with the world.
These copyrighted PG publications can still be copied, but thepermissions granted are spelled out in their headers, and usuallyforbid anyone to republish them commercially.
C.25. What are "non-renewed" books?
Works published before 1964 needed to have their copyrights renewed intheir 28th year, or they'd enter into the public domain. Some booksoriginally published outside of the US by non-Americans are exemptfrom this requirement, under GATT. Some works from before 1964 wereautomatically renewed.
C.26. How can I get Project Gutenberg to clear a non-renewed book?
As of mid-2002, you probably can't. Because of all of the checks weneed to do to ensure that the book wasn't renewed, or wasn't one ofthe exceptions that was automatically renewed, we just don't have thetime to do it. But we're working on it. Right now, we're processingcopyright renewal records with the aim of making them searchable.
Volunteers' FAQ
About the Basics:
V.1. How do I get started as a Project Gutenberg volunteer?
What you actually need to do to produce a PG text can be stated verysimply:
1. Borrow or buy an eligible book. 2. Send us a copy of the front and back of the title page. 3. Turn the book into electronic text. 4. Send it to us.
That's it! All the rest of the producing parts of the FAQ are aboutthe details of how different people approach these steps.
Different people find their own ways into PG work, and once in, findtheir own niches. If you have your own ideas, don't let anything herestop you from pursuing them.
Some people just read the FAQs, go up to their attic, pull an eligiblebook off the shelf, send TP&V [V.25] in, and start typing or scanning.Next time we hear from them is when they send in [V.46] the completedeBook for posting. It can be as simple as that.
Some people just download existing PG texts, re-proof them verycarefully and send in corrections.
Some people find regular collaborators through gutvol-d or theVolunteers' Board or the distributed proofing sites, earn a reputationas reliable proofers, and continue working as proofers.
Most people start small, and after a little experience of distributedproofreading or other proofing, begin their PG career as producers.
If you're a typist, cheer now, because you can ignore all thecomplicated paraphernalia of computer interfaces, and scanners, andthe quality of OCR software and the mistakes it makes. You can justsit down at the keyboard with your eligible [V.18] book.
If you're not a typist, start thinking about scanners. It may be awhile before you're ready to start scanning for yourself, but it'snever too early to find out about them.
As soon as you have a solid grasp of how to turn a book into an etext,please start thinking about how you're going to become a producer.While proofing work is valuable, PG can only add books when someonemakes the effort to actually make etexts from them, and the people whorun distributed and co-operative proofing projects have to do a lot ofwork before and after the proofing step; we want to spread that aroundas widely as possible. Project Gutenberg needs more producers!
Whatever you do, don't just hang around expecting someone to offeryou a task to undertake. There is no "head office" where overworkedstaff occasionally need interns to do filing and odd-jobs. There aremaybe 200 fairly regular contributors to PG, producers and significantproofers. We almost never meet each other in person. We have jobs, andfamilies, and other interests. We work for PG when we can, and when wewant to. In many ways, you could look at us as 200 unrelated people,each doing our own etext project, using Project Gutenberg as anumbrella group that sets loose standards, files copyright proofs andprovides secure placement for the finished texts. Since we each haveour own self-assigned single-person tasks, there isn't too much roomto delegate some of that work to a beginner. By all means, volunteerfor some tasks—on the Volunteers' Board, or in gutvol-d—but youshould think in terms of defining your own tasks, and making your owncontribution.
Absolutely everyone—scanners, typists, proofers—should first spendsome time working on a distributed or co-operative proofing project.This will allow you to get a feel for what happens in making an etextfrom paper pages without committing you to more than a few hours'work.
This is not in any way an institutional requirement, since we don'thave any institutional requirements, but it is very good advice. Manyvolunteers start eagerly, wanting to do lots of PG work, and then dropout because they took on too much, too fast, without understanding thenature of the work. Don't let that happen to you. Take it in smallchunks.
Check out these distributed proofing sites:
Charles Franks: <>
JC Byers: <>
Dewayne Cushman: <>
and spend a few hours over a couple of weeks just processing somepages for real.
While you're doing that, you should also join a couple of PG mailinglists [V.12]—gutvol-d and either the weekly or monthly Newsletter list.Reading these will start to get you connected to what's going on.Browse the Volunteers' Board—there may be some offers going, andthere's a lot of experience captured in some of those "back-issues",so don't confine yourself to the front page.
Inform yourself on e-text issues generally, not just within ProjectGutenberg. Explore The On-Line Books Page and the IPL [R.5] and fromthem find other eBooks available on-line.
Have a look at our In-Progress List and some lists of suggestionsfrom others [B.4].
Look at sites like Blackmask <> andPluckerbooks <> and Memoware<> and Bookshare <> tolearn how our work is being used as a basis and copied and convertedand amplified in many other projects.
Above all, READ a few Project Gutenberg eBooks! You don't have to readthem in full; you don't need to spend weeks poring over Dostoyevsky orstudying Shakespeare. Just download a few and skim them—you'll absorbwhat a PG text should be quite painlessly, and maybe you'll get caughtup in the story! If you're looking for light reading, and can't thinkof something that you specifically want, how about these all-timefavorites:
The Gift of the Magi, by O. Henry.
The Lady, or the Tiger?, by Frank R. Stockton
A Christmas Carol, by Charles Dickens
Alice in Wonderland, Lewis Carroll
Anne of Green Gables, by Lucy Maud Montgomery
The Marvelous Land of Oz, by L. Frank Baum
A Princess of Mars, by Edgar Rice Burroughs
Heidi, by Johanna Spyri
A Connecticut Yankee in King Arthur's Court, by Mark Twain
Black Beauty, by Anna Sewell
Tarzan of the Apes, by Edgar Rice Burroughs
Tom Swift and his Motor-Cycle, by Victor Appleton
Rebecca Of Sunnybrook Farm, by Kate Douglas Wiggin
Little Lord Fauntleroy, by Frances Hodgson Burnett
Aesop's Fables
Grimms' Fairy Tales
The Art of War, by Sun Tzu
Dracula, by Bram Stoker
Swiss Family Robinson, by Johann David Wyss
The War of the Worlds, by H.G. Wells
If you have a taste for detectives and mysteries, there's
The Adventures of Sherlock Holmes, by Arthur Conan Doyle
Monsieur Lecoq, by Emile Gaboriau
The Mysterious Affair at Styles, by Agatha Christie
Arsene Lupin, by Edgar Jepson & Maurice Leblanc
Edgar Allen Poe's "The Gold-Bug" and
"The Murders in the Rue Morgue" in The Works of Edgar Allan Poe V. 1
For the excessive buckling of various swashes, see:
The Prisoner of Zenda, by Anthony Hope
The Man in the Iron Mask, by Dumas, Pere
The Three Musketeers, by Alexandre Dumas
Treasure Island, by Robert Louis Stevenson
The Scarlet Pimpernel, by Baroness Orczy
Effen youse got a hankerin' for a Western, there's:
Riders of the Purple Sage, by Zane Grey
The Virginian, Horseman Of The Plains, by Owen Wister
Back to God's Country, By James Oliver Curwood
Selected Stories by Bret Harte
Jean of the Lazy A, by B. M. Bower
Or if you prefer your fiction more domesticated, there's:
Little Women, by Louisa May Alcott
Pride and Prejudice, by Jane Austen
The Warden, by Anthony Trollope
The Heir of Redclyffe, by Charlotte M Yonge
Mother, by Kathleen Norris
For something to raise a smile, you can rely on:
The Devil's Dictionary, by Ambrose Bierce
The Wallet of Kai Lung, by Ernest Bramah
The Importance of Being Earnest, by Oscar Wilde
Three Men in a Boat, by Jerome K. Jerome
Piccadilly Jim, by P. G. Wodehouse
If poetry is your thing, you have lots to choose from:
Shakespeare's Sonnets
Project Gutenberg's Book of English Verse
The Home Book of Verse, edited by Burton Stevenson
The Complete Poems of Henry Wadsworth Longfellow
Leaves of Grass, by Walt Whitman
Now, that's just a handful from our over 5,000 eBooks, so don't tellme you can't find anything to read! If you do have ideas of your own,download GUTINDEX.ALL or PGWHOLE.TXT and browse through the wholelist, or Browse by Author on the website at<>.
Download a few. Read them on your PC, or reformat them and print themout, or convert them for your PDA. Get used to working with andformatting text. Look at the formatting decisions that earliervolunteers have made—they're not entirely consistent; differentpeople make different choices, different books require differentmethods, and PG conventions have shifted slightly over the last 10years—but they're all perfectly readable and convertible today.
If you find typos [R.26] in any of them, tell us! That's also a partof being a Gutenberg volunteer. Our eBooks improve with time!
If you're thinking of making the best use of your time looking forerrors in posted texts, a good start would be to download 40 or 50texts, and run a spelling checker and gutcheck [P.1] on them all,spending only 5 or 10 minutes on each. Having had a quick look at allof them, concentrate on the ones that seem to have mostproblems—where automated checkers see 10 problems, a careful humanwill usually be able to pick up 20.
Getting Productive
OK, so you've seen what etexts should look like, you know what we do,and proofing hasn't scared you off. It's time to step up and become aproducer. If you're not a typist and you don't have a scanner, take adetour down to the Scanning FAQ [S.1] now, and come back when yourscanner is set up. If you're a typist or you've already got a scanner,read on . . .
Get a book. Just do it, OK?
Ya gotta start somewhere, right? And finding an eligible book isdefinitely somewhere.
Finding an eligible book is a threshold for many beginningvolunteers—it's the first major step on the way to producing. For alot of people, it's also the toughest barrier they have to cross.Fortunately, the barrier is only psychological, and can be crossed ina few minutes.
It's an unfamiliar process, and one that a lot of beginners feel someanxiety about. Don't. It's quite straightforward: it's just buying abook—you've done that, haven't you? Don't over-think it, don't worryabout whether you're making the "right" choice, don't spend monthscomparing lists and choosing. Just do it. Once you've got your first,you'll wonder what all the fuss was about. Thanks to the wonders ofthe internet, your book can be on its way to you in an hour if youhave $20 to spend.
Typists blessed with a good local library don't even have to buy theirbooks—they can just borrow one and type it up! (You may be able toscan a library book, but get some experience with scanning first, andavoid damage!)
Let's deal with the decisions and other issues of picking one.
For your first book, don't try getting fancy with copyright issues.Choose one that was published before 1923, and you're in the clearfor U.S. and PG copyright purposes. You can read the dates just aswell as we can—with books printed before 1923, there are no hiddencatches: "Pre-'23 is free". Just read the TP&V [V.25] of the book,and see that it was printed before 1923, and you have no problems.Of course, reprints [V.19] of books copyrighted pre-1923 (and variousother cases) are also clear, but if you have any concerns, just stickto pre-'23 editions.
Which book?
The answer to this question is different for everyone, but see howmuch you agree with the following statements:
"I have a favorite book, and I'd really like to produce that."
Well, hey, this is no problem! You already know what you want.
Go check out whether the book is already on-line [V.29].
"I'd like to work on an important book, but I don't know which."
Well, everybody's definition of "important" is different, but somepeople have put their various ideas forward already; you can seewhether you agree with them! The InProg List contains some, with thenotation "Suggested book to transcribe" beside them. Steve Harriskeeps a list of unproduced possibles at John MarkOckerbloom's "Books Requested" page lists titles that people haveasked for. [B.4] Your problem if you fall into this category is thatother people probably wanted to produce "important" books too, andlots are already done.
"I just want an easy, trouble-free book to start with."
Your first book doesn't have to be War and Peace (we've already gotthat anyway!). Here's a tip: try looking for children's or what wewould nowadays call "Young Adult" books. These are typically short,and may have large print, which makes life much easier if you'rescanning. They age well: children's stories from a century or more agoare still readable and interesting to children today. We have manychildren's and YA eBooks: not just the classics like Grimm andAndersen and Heidi and Oz and Peter Pan and William Tell, butlesser-known but still enchanting stories like The Counterpane Fairy,or Lang's Fairy books. There are series, like the Motor Girls, or the(Country) Twins series, or the Bobbsey Twins. There is lots and lotsof material here for you to start with, and these books are relativelyplentiful, since they were made to take the kind of treatment childrendish out, and many of them have been in school libraries or attics foryears.
Whatever your choice, pick a book that you'll like; you'll be livingwith it up close and personal for a while. Light reading, adventurefiction, and books aimed at younger readers are safe first choices formost people. If you admire 19th Century scientists or scholars, andwant to immortalize their work, great! But don't feel that you have todive in at the deep end just because someone else wants you to.
Getting your book: a practical exercise
The Search
At this point, you've got a list of books—maybe just one, maybeseveral by an author or two, maybe just a genre like "Children'sBooks" with some specific ideas. Maybe your mind is still wide-open.
Before used booksellers had the Net, finding a particular old book wasa daunting job. Booksellers had informal networks among themselves andexchanged catalogs so that each would know something about what wasavailable elsewhere, but, for a buyer, finding a particular book wasstill hit-and-miss. Now, however, a number of large sites provide aservice to booksellers, where they can list their inventories forpeople to search from anywhere.
So now we go hunt for them on the Net. No, you don't have to buy themon the Net—you can rummage in booksales and garage sales and usedbookstores, and that's its own kind of fun, though on a physical hunt,what you need is to bring a long list of "already done" books withyou. But even if you never buy over the Net, it's a vast source ofinformation about what books are available, which are plentiful, andwhich are cheap. It gives you some experience of what to expect whenyou do your in-person browsing.
Here's a story of a typical Net-hunt. And you can follow along with itat home. :-) Your results, and the sites you end up at, will bedifferent from mine, but even if you don't end up buying a book on thishunt, you'll get some experience of what's involved. C'mon, do it withme—see if you can find a better bargain!
I'm starting with two lists, and I'll follow up whatever seemspromising. I'd like to spend about $20—might go to $30. Definitelynot interested in $50 and up. I'm keeping in mind that I'll have toadd a bit for delivery—usually up to $10 within the U.S., but can getexpensive if you're in Perth, and ordering from a bookstore in Munich.
I'm also avoiding anything that might be tricky to clear on thissearch, and confining myself to books printed before 1923.
Of course, by the time you read this, some of these books may alreadyhave been produced, so if you're actually thinking of buying any,check carefully first!
My first shortlist consists of books that caught my eye from David
Price's In-Progress List, Steve Harris's site, and The On-Line Books
Requested page [B.4], and it reads:
Louisa May Alcott: The Inheritance
E. W. Hornung: Irralie's Bushranger
E. W. Hornung: Stingaree
A. A. Milne: The Dover Road
A. A. Milne: Once on a Time
Samuel Richardson: Pamela
Oscar Wilde: The Critic as Artist
As well as following along with my list, you should try finding two orthree books of your own, from those sites or from your ownpreferences, and search for them in the same ways that I do.
Everyone has their own searching technique and their own favoritesites to search. For this session, I'm opening up three copies of mybrowser—one for Alibris <>, one for Abebooks<>, and one for the Catalog of the Library ofCongress <>. I'll do my initial searches onAlibris and Abebooks, and keep the LoC site handy for reference.
In Alibris, I head straight for the Advanced Search page, since theyallow searching by date, and I immediately put "before 1923" intoevery search, which avoids having to scan through modern reprints. InAbebooks, I choose "Hardcover" in their advanced search, which is notquite as good a filter, but does at least screen out recent paperbackeditions.
In each of the sites, I just enter the author's surname and one wordfrom the title of each book, and look at the search results.
Louisa May Alcott's "Inheritance" looks like it's going to be tough. Idon't find it in either of my two bookstores. On doing a littlechecking with modern bookstores, I find it was her first novel,written when she was 17, and as far as I can see, not published duringher life: apparently only recently published—the LoC site hasnothing prior to 1997. A disappointing start to my search. Iunderstand why it's very desirable to get it online, but this one'sgoing to be very tough to clear, and I'm staying away from it.
E. W. Horning's "Irralee's Bushranger" is also elusive: it doesn'tshow up at either of my sites, so I check out the LoC to confirm Ihave the title right, and yes, there it is: "Irralee's Bushranger, astory of Australian adventure, 1896." So I widen my search by visiting<> and searching many of the sitesthere. Still no luck. If I were particularly eager to get this book,there are several things I might do at this point: I might register a"want" with one of the sites, asking to be notified when a copy islisted, I might use the OCLC WorldCat search (which Abebooks calls"Find it at a local library") where I can locate libraries that havecopies, or I might even contact some individual booksellers and make arequest that they look for it. Some booksellers actually specialize inlooking for hard-to-find books; but of course I expect I'd have to paya bit more for it when they do find it, and given my success with therest of my list, and my price bracket, there seems no need to go thatfar today.
Horning's "Stingaree", by contrast, seems to be everywhere, in severaleditions, and cheap. It must have been a bestseller in its day—notsurprising, from the author of "Raffles". 1902, 1905, 1909 editionsabound. The cheapest are 1910 and 1907 editions for $4.95 and $5.00from booksellers listed at Abebooks.
Milne's "Dover Road" is available from both sites. There seems to havebeen a Putnam's printing in 1922 of "Three Plays: The Dover Road. TheTruth About Blayds. The Great Broxopp." of which lots of copiessurvive. There also seem to be later printings which would qualify asreprints if I were desperate, but the 1922 edition is priced from$12.00 to $50.00, so I'll take the 1922 $12.00 copy from Abebooks. Asa bonus, I don't see the other two plays listed as being onlineanywhere, so I'll get three texts (and short ones, too!—279 pages forall three) for the price and effort of one.
Milne's "Once on a Time" is a bit less common, but once again aPutnam's printing of 1922 keeps it in the race. There are a couple ofbooksellers in England selling for 15 pounds (which just about makesmy $20 threshold) and 20 pounds, and an ex-library copy going for $25.
There are lots of eligible copies of "Pamela" available, ranging froma fourth edition at a mere $4,999 (no, thanks!) to a 1921 printing at$6.60 at Alibris. I'll take that one, please.
Wilde's "Critic as Artist" is fairly widely available. A 1905 editionof "Intentions: the Decay of Lying; Pen Pencil and Poison; the Criticas Artist; the Truth of Masks" is available at Alibris for $8.80, (andother copies of the same edition there and on Abebooks in the $20-$30range) and Abebooks lists a London 1919 edition at $12.50. There areseveral copies listed in both places as "undated" and "reprints"—I'mavoiding these, since while it's quite likely that they might beclearable, I'm not taking risks on this search.
My second list isn't a list—just a vague category: children's booksthat are easy to do.
I go to Alibris' Advanced Search, and enter "Child's" in the title,and pre-1923 in the date, and, excluding titles already on-line,immediately get:
A Child's History of France $13.20
A Child's Story of the Bible $5.50
First Lessons in Botany or The Child's Book of Flowers $13.20
The Child's Book of American Biography $11.00
The Child's First Bible $8.80
The Child's Music World $8.80
and so on through quite a list.
OK. That's a good start. But my choice so far is unimaginative. I needbetter search terms. So I go to main search engines with the terms"children's antiquarian books" and find a half-dozen or so sites thatspecialize in them. I can browse around there, though it's slowergoing without searches to focus my results. I find<>, specializing in children's books. Wadingthrough the miles and miles of Alcotts and Barries and Burnetts, whichare mostly already online, I think, I find a couple of authors fromthem who must have been popular, because they seem to have publishedlots of books before 1923: Angela Brazil and Dorothy Canfield. (I onlygot as far as the "C"s!)
I could of course stop here and buy some, but today I want to see whatelse is out there.
Back at Alibris and Abebooks, armed with my authors to search by, Iturn up 4 pre-1923 books under $20 for Angela Brazil:
A Terrible Tomboy
The Youngest Girl in the Fifth
A Fourth Form Friendship
A Pair of Schoolgirls
and several between $20 and $30.
Dorothy Canfield immediately yields multiple copies of:
The Brimming Cup
Home Fires in France
Hillsboro People
Understood Betsy
Rough Hewn
The Real Motive
and others, and I haven't even got to $20 yet, nor to the letter "D".
A browse through the Ebay Collectible and Antiquarian Books sectionalso throws up a respectable list of eligibles. I won't even bothercounting that.
In 20 minutes, I have found five of the seven on my search list. Inless than hour after that, I found over 16 eligible children's books,all under or around $20 and all available online.
Before committing to one, though, I would double-check that the bookhasn't been transcribed online, and isn't In Progress.
Double-checking your selection
If you're concerned that the book you have chosen duplicates anotherthat might be in progress, and want to double-check, you can e-mailthe Posting Team asking them to check whether any recent clearanceshave come in for that title.
Duplications do happen—there's no way of avoiding them when differentpeople are making independent decisions—but they are rare.
Dealing with used booksellers
As a class, used booksellers are very pleasant people—remarkablyfriendly, knowledgeable and helpful, even to people buying on atypical Gutenberger's budget.
Some of them are not, however, models of ideal data organization whenit comes to Internet listings. There are lots of one- or two-personoperations dealing with an inventory of many thousands of books, andhaving located your book online, you should check that it's stillavailable.
You can place an order through the site and wait for the confirmation,or you can simply call the bookseller. Not all booksellers' contactdetails are listed, so it's not always an option, but when you dophone you're likely to be speaking immediately to someone who can tellyou for sure whether the book is still there, can pull the book offthe shelf and answer questions about it, and can take your credit carddetails on the spot and dispatch the book immediately.
Copyright Clearance
As soon as your book arrives, send us the information needed forCopyright Clearance first. Even if your book is a true-blue,no-questions-asked pre-1923 edition, we should know about it as soonas possible so that it can go onto the In-Progress list for others tosee that someone has started on it.
Wait for the confirmation e-mail before starting any serious work.Some people have thought that "Copyright 1923" plus some wishfulthinking would be good enough, and, unfortunately, it isn't. Somepeople have gone ahead and produced the whole book before sendingin the clearance, only to be disappointed, all their work wasted.
Books published in 1922 or earlier are clearable, but some people,ever optimists, overlook that little "1927" in small print on theverso. Sometimes there is no copyright date on the front, and otheroptimists assume that these books are OK. They may be; they may notbe. Don't get caught in the copyright trap.
As soon as you have what you think might be an eligible book, donot start on it. Do not ask another volunteer's opinion. Just sendin the TP&V and wait for the confirmation e-mail to find out for sure.
Even when your TP&V clearly says "Copyright 1901", send it in.We need to get it into the clearance files so that we can registerit as being In-Progress.
If you're a typist, there's not much more you need to know from thispoint: you can just get on with the job, with maybe a few tips fromthe FAQ. In fact, if you're a typist, you might wonder why the rest ofus make such a fuss about scanners, and settings, and OCR. Take pityon us! we just can't produce the way you can. Smile indulgently,ignore all the scanner jargon, and submit your completed text whilewe're still saying bad words about the guttering on a greyscale imageof page 372. :-)
If you are using a scanner to copy a book for the first time, bepatient with yourself. Some people start off with too highexpectations of what they can achieve. Believe it or not, scanningdoes work effectively; it just doesn't work perfectly. And often, youneed a little practice before your scans work right with your OCR. TheScanning FAQ [S.1] has lots of specific tips you can try. Start byscanning a double-page about a third of the way through the book. Scanin Black and White and in Greyscale, at 300dpi and 400dpi. Try 600 dpiif it seems like a good idea. Put it through your OCR and see whatcomes out. Move your scanner so that you can be comfortable whileplacing the book and turning pages. Allow yourself an hour toexperiment with different settings, and different pages. Put thesample images included with the Scanning FAQ through your OCR andsee how the output compares to the text produced by other packages.That first hour finding out about how your setup works will be themost valuable hour of scanning you will ever do.
Having figured out what settings you want to use for this book, makesure you implement the best speed you can. Usually this means tellingthe scanner to scan only as much area as the book covers. This isquite important, since the scanner will by default scan its wholearea, and you don't need all that; it just wastes time and makes yourimages bigger.
You may also be able to set your OCR or scanner software to auto-scanpages with some preset delay, like 5 seconds. This also speeds thingsup, because the scanner isn't waiting for you to hit the keyboard, andyou have both hands free at all times to turn the page and replace thebook. It takes a few pages to get into the rhythm; if you miss apage-turn, don't worry—you can get it on the next scan.
Using a reasonably modern but quite ordinary home/office type flatbedscanner, you should be able to scan 200 pages an hour [S.9] of atypical book, at good quality. 400 pages an hour is not unheard-of.Now, it may fairly be said that scanning offers all the fun of ironing,without the sense of adventure :-), but if you have got your settingsright, you will probably be able to do the whole job in less than twohours. And now you're really on the road!
V.2. What experience do I need to produce or proof a text?
For producing, you will have to be able to type pretty well, or havea scanner.
For proofing someone else's text, when you don't have a copy of thebook in front of you, you should be reasonably familiar with thelanguage used in the book, and the styles of the time—Chaucer'sEnglish was quite different from ours, and even 19th Century novelistswrite some phrases unfamiliar to us today.
That's it. You don't need experience in publishing, editing, orcomputers.
V.3. How do I produce a text?
There are acres of words in this FAQ about that, but it all boilsdown to 4 simple steps:
1. Get an eligible book—pre-1923, or one of the exceptions. Pull
it from your attic, borrow it from a library or a friend, buy it
in your local bookstore, in a flea-market or on-line. We don't
care which.
2. Send us a copy or the front and back of the title page so we
can file proof of copyright clearance.
3. Copy the text from the book into a computer text file. We don't
care whether you type it, scan it, voice-dictate it, or think of
some totally new way to do it. Just get it into a file.
4. Send us the computer text file.
That's all there is to it!
V.4. Do I need any special equipment?
You need the use of a computer of some kind, and Internet access isusual, though we have had some volunteers contribute texts on floppydisks.
If you intend to scan books, you will need a scanner, but if you'rejust typing or proofing you won't.
V.5. Do I need to be able to program?
Absolutely not! Very little of Project Gutenberg's work involvesprogramming, and it is never necessary to any part of volunteering.
V.6. I am a programmer, and I would like to help by programming.
What can I do?
At the risk of sounding facetious, the very best thing you can do isfigure out ways that more programming can help Project Gutenberg!
A lot of programmers work on PG books, and anything easy has probablyalready been done. The challenge for programmers who want to writesomething that will help to produce etexts is not in writing the code;it's in identifying ways that programs can help.
Please see the FAQ "What programs could I write to help with PG work?"[P.2] for some ideas in this direction. Whatever you do, don't justhang around waiting for someone to ask you to write something, becausethat's not going to happen. Think up a project, ask volunteers if theywould use it, and dig in! Better still, produce a few etexts yourself,using the existing tools, and get a feel for the kinds of problemsthat new software could help with.
Apart from text production, we do develop some programs to help withposting work, but as of mid-2002, we have nothing like an ongoingprogramming project which people can join.
V.7. What does a Gutenberg volunteer actually do?
We buy or borrow eligible books, scan, type, and proofread. There area few other activities, but they consume only a very small fraction ofvolunteer time.
V.8. Can I produce a book in my own language?
Yes! We want to encourage people to produce books in all languages,and we cheer when we can add a new language to the list.
V.9. Does it have to be a book? Can I produce pieces from a magazine
or other periodical?
Magazines, newspapers, and other publications are just fine. Forcopyright clearance, they work just the same way as a book.
You do need to check the length of your piece [V.17]; we don't want azillion separate one- or two-page files. If the piece you have in mindisn't long enough, you can add other pieces to it, or even most or allof the magazine. If the work was serialized over multiple issues, youcan join them together for your PG text, but you do have to copyrightclear every issue of the magazine from which you copy material.
If you have lots of old periodicals, you could even take one piecefrom several, and make a new text which is a "theme" anthology ofthose pieces. You can give it an appropriate title: "Civil WarCommentaries from X magazine 1892-1898."
V.10. Do I have to produce in plain ASCII text?
Certainly not if it doesn't make sense. To take an extreme example, ifyou're working in Japanese or Arabic, or creating audio files, thereis no point in trying to reproduce that in ASCII!
Where the text can largely be expressed in ASCII, we do want to postan ASCII version, even if it is somewhat degraded compared to theoriginal. However, we will post your file in as many open formats asyou want to create, so that your original work is available for thosewho have the software to read it.
V.11. Where do I sign up as a volunteer?
You don't. We have no formal sign-up process, no list of volunteers,no roll-call. If you produce a PG eBook, or help to produce one, youare a volunteer.
V.12. How do PG volunteers communicate, keep in touch, or co-ordinate work?
We are very scattered geographically: U.S., Australia, Brazil, Taiwan,Germany, South Africa, Italy, India, England, and all over the world,so we can't really meet for coffee on Thursdays. :-)
Most co-operation and co-ordination goes on by private e-mail. This isefficient for volunteers who have worked with each other before, sincethey know each other's interests and skills, but not so easy forbeginners to break in on, since they don't.
The Volunteers' Web Board at <> is apublicly accessible forum for volunteers or potential volunteers topost any question or information about how to create a PG eBook.
There are a few Project Gutenberg mailing lists. Information aboutjoining them is available on the main site, at<>.
The Project Gutenberg Weekly and Monthly Newsletters, gweekly andgmonthly, are one-way announcements, which allow PG to communicate withnon-volunteers who are interested in the eBooks we produce, but theyalso contain notes and requests for assistance from volunteers.
The Volunteers' Discussion Mailing list, gutvol-d, is a an e-maildiscussion forum for subscribers about any Gutenberg topic.
The Volunteers' List, gutvol-l, is for private announcements foractive volunteers.
The Programmers' List, gutvol-p, is for discussion of programmingtopics.
There are some other, specialized, closed lists for people whodo specific work within PG:
The "Posted" List, posted, is for people who perform indexing on ourtexts. An e-mail is sent to this list every time we post a text (seethe FAQ "How does a text get produced?" [V.16] section 5: Notification)and the members of the list use it to update their catalogs.
The Whitewashers' List, pgww, is for Posting Team internal messages.
The Heroic Helpers List, hhelpers, is for people who can devote somefairly regular time to doing odd jobs.
V.13. Where can I find a list of books that need proofing?
There is no central list of this kind. There are distributed proofingprojects, currently at
Charles Franks: <>
JC Byers: <>
Dewayne Cushman: <>
where you can proof parts of a book. This is advisable when you'rejust starting out because it gives you some feel for what the work islike.
You can also look up existing, posted texts from the archives andproof them. Just as there always seems to be one more bug in anygiven program, there always seems to be one more typo in any giventext! Download a few, and scan quickly for problems by doing aspellcheck or other automated check; if you can find any problemsquickly, then there are likely others to be discovered by a carefulproofing.
V.14. Is there a list of books that Project Gutenberg wants?
No. Project Gutenberg, as such, does not "want" any specific books.Individual volunteers choose what books to produce. Nobody givesorders to volunteers about what they should work on. Nobody has anofficial "hit-list" of books to add to the archives.
Of course, individual volunteers and non-volunteers have theirpreferences, and may suggest books to transcribe, and such suggestedlists pop up every so often, and are often useful to people lookingfor ideas.
There are usually some suggestions in David Price's InProgress list.The On-Line Books Page has a section where people can list requests,and Steve Harris has a site devoted to lists of books not yet inGutenberg or elsewhere. Treat all of these lists with some caution,since someone may have started or even finished one of theirsuggestions since they were last updated.
PG Books In Progress <>
On-Line Requested List <>
Steve Harris' "To-do"s <>
V.15. I have one book I'd like to contribute. Can I do just that without
signing up?
Well, since there is no formal sign-up, of course you can! A lot oftexts have been contributed by people who just wanted to immortalizeone favorite book. Many of them had already created the eBook beforethey even heard of Project Gutenberg, and we're always delighted toadd these to the archive!
About production:
V.16. How does a text get produced?
As stated back in the Basics section, all you need to do is:
Borrow or buy an eligible book.
Send us a copy of the front and back of the title page.
Turn the book into electronic text.
Send it to us.
That's all you actually need to know in order to be a producer. But ifyou're interested in the details of how other people actually do this,and want to know what else happens behind the scenes, here's a full,blow-by-blow account.
1. Finding an eligible book
Volunteers find eligible books [V.18] in all sorts of ways. Some luckypeople have them in their bookshelves, or their attic. A lot of peoplehave a good library nearby, where they can find books, or request themon interlibrary loan. Some people are big eBay fans; others like tohunt for bargains on specialist booksites. And of course lots ofvolunteers enjoy rummaging through actual used bookstores, or localmarkets, or yard sales.
Even if you're not going to take on a book yourself right now, searchfor some on the Net and find out about how to get a copy. Next timeyou pass an antiquarian bookstore, or a book market, drop in andbrowse. Ask your local library about interlibrary loans. Eligiblebooks aren't hard to find once you know where to look.
2. Copyright Clearance
New volunteers sometimes find it hard to understand why this is soimportant, and why, in particular, Project Gutenberg is so carefulabout it. At base, it's simple: by keeping a filed copy of the TP&V[V.25] of every book we produce, we can at any time protect ourpublications against claims from publishers that they "own" the work,and thus we can keep them available to the public.
The copyright laws can be difficult to understand, and sometimes itmay take serious research to prove that a particular edition isactually in the public domain. If you're not legally-inclined, justkeep repeating "Pre-'23 is free" if you're in the U.S.A. and stickto books published before 1923. If you do want to delve deeper, readour Copyright Rules page at <>and then go on to reading the Library of Congress Copyright Officeofficial papers at <>. If you're in anothercountry, find out about your own copyright laws.
Volunteers send in the TP&V from the book for us to inspect. This notonly gives us the proof to file, it also lets us know that someone isreally working on the text so that we can list it as being In Progressfor the information of others who might be interested.
3. Scanning, typing, proofing and editing
This makes up the bulk of PG's effort, and is discussed at greatlength elsewhere in this FAQ. There are many, many ways to create anetext from a paper book, and different people use different methods,but it all boils down to making a text file. For a typical book, itwill probably take 40 hours of a volunteer's time. All that happenshere is that somebody makes the effort to transcribe one paper bookinto a file that can be shared around the world and for all time.
4. Posting
[Note: this information is quite specific to the process we go throughnow. It is quite likely to change as we improve the automation of thetasks.]
Posting is done by the Posting Team. The basic job is to receive thetext from the producer, check that it has been copyright cleared,check that it conforms to Project Gutenberg standards, check it forcorrectness (which can be anything from XML validity to simplespelling), add the Project Gutenberg header and copy the text to thetwo PG servers.
In a simple case, where everything goes right, this can take as littleas fifteen minutes. In a complicated case, where we have to convertformats, or there are a lot of errors in the text, or there areproblems with the copyright clearance, it can take hours or even dayswhile we wait for responses, or do a lot of editing, or findconversion tools.
Michael Hart used to do this work entirely alone, but in September2001, he created the Posting Team to handle the load. (The PostingTeam are nicknamed the "Whitewashers" in honor of Tom Sawyer'svictims. :-)
Transferring the file
You send the text to us [V.46] either by Web, by FTP with a usernameand password that any of the Posting Team can give you privately), orby e-mail.
If you're FTPing, you should e-mail one or more of us as well, tolet us know what you've uploaded.
One problem is files that don't transfer correctly. Especially bye-mail, some files get damaged on the way. It's better to ZIP thefile before sending, if possible, to prevent some common problemswith text files. The use of compression formats other than Zip canalso create problems. Members of the Posting Team work on multipleplatforms—DOS, Windows, Linux, Solaris—and zipping and unzippingprograms are commonly available for all of these. Other compressionmethods, like Stuffit or bzip2, are not so readily available, andmay give us trouble.
We login via ssh to beryl, which is the Unix system on which we workwhen posting, the same one that you FTPed the file to, unzip the fileand glance at the top of it.
Checking Clearance.
We then check it for copyright clearance. The one and only absoluterule that we NEVER bend, no matter what, is that we WILL NOT post afile that doesn't have a clearance. If it ain't in the clearancefiles, it don't get posted.
Most regulars know that they should include their clearance line inthe e-mail submitting the text, but not everybody does, and noteverybody remembers every time. This can be frustrating, whenclearance is not included and not obvious.
When Michael gives you your clearance on a book, he sends you back ane-mail that has just one line, something like this:
The Works Of Homer [Iliad/Odyssey] Tr. George Chapman Jim Tinsley 06/14/01 ok
He saves these lines in files that we posters can access. We regardthis information as private, so we don't publish the details of whohas cleared what.
When we get the text, we check whether the submitter has cleared it.If there is a clearance line in the e-mail notifying us about thetext, there's no problem. If we can find the title of the text underthe submitter's name in the clearance files, there's no problem.Unfortunately, sometimes we can't find it. There are two usualreasons: either the text submitted is part of the work cleared (forexample, submitting one play from a collection), or the text hasn'tbeen cleared yet. If the clearance isn't straightforward, we can goback and forth and round and round in e-mails for a while.
This is why it's a good idea to paste the clearance line into youre-mail.
If the title of the text you're sending isn't the same as the title ofthe text cleared, BE SURE to paste in the clearance line AND explainthat the text you're sending is PART of the cleared book. Please alsolist the titles of the other parts; it really does cause confusion anddelay when this is not clear.
Checking and Editing
Sometimes, people send in a book in a non-text format like Word Perfector Microsoft Word, or send a text with unwrapped lines. In that case, wetry to get the submitter to fix them, but if they can't, we have toconvert the file to straight text before starting.
Some producers, particularly inexperienced ones, want to addnon-standard annotations and mark-up and symbols to the text. This canget ticklish; we don't want to discourage them, but we need to keeptexts reasonably standard. Usually, we can work something out. Maybethe book should be added in both text and HTML, for example.
Assuming that it's a plain text file, we next run gutcheck and a quickspellcheck on the file. This will tell immediately if it adheres to PGstandards and if there is any serious problem with it.
If the file looks clean, we may skim it, looking for potentialproblems or formatting issues. For clean texts, the only things weusually need to change are unindented quotations or inconsistentchapter headings (a lot of people seem to mix "CHAPTER III" with"Chapter 14" and have irregular numbers of blank lines) or spacing anda few 8-bit characters. Occasionally, we have to rewrap a text. Wealso look out for included publishers' trademarks, which we normallyprefer to remove (trademarks are NOT subject to copyright expiration:Macmillan(TM), the publishing house, is still around and trading),unnecessary or downright odd indentation or centering, stray pagenumbers, and prefaces or introductions or appendices that may not bein the public domain. If the file has lots of 8-bit characters, weprobably need to make a separate 7-bit version, and post both.
If the gutcheck and spellcheck don't look clean, or if conversion isrequired, we may spend a lot more than 15 minutes on it. In a badcase, we may have to get the file re-proofed.
If you are conscious that you're doing something non-standard, andreally mean it to stay, say so in your e-mail. (For example, Irecently posted a text containing a family-tree representation thathad lines over 80 characters. Now, I would have left that one aloneanyway, but it helped that the submitter drew my attention to it inthe e-mail.) If it's too non-standard, the poster may not allow it tostay, but at least you can discuss it. When a text needs a lot ofnon-standard formatting or markup, you really need to ask yourselfwhether you shouldn't be submitting it in HTML, with all the bells andwhistles, and settle for something more normal in the text variant.
Mostly, errors are obvious, and there are at least some obvious errorsin most texts. When errors are completely obvious, we just fix themwithout feedback to the producer unless you have specifically askedfor feedback in your e-mail.
We're getting more HTML formats now, which is great, but incomingHTML often needs a lot of work, because people who are not experiencedwith HTML often make mistakes. The W3C <> isthe official standard for valid HTML, but, for the average volunteer,it's awkward to use. However, if you're submitting a HTML format,please use Tidy, which you can get from <>,to check your text before sending it.
Header and Footer
We add the PG header and footer. If there is a header and footeralready there, we strip them off first, since recent changes in theheader mean that a lot of people send files with headers that are outof date. We have written programs to help with this.
We get the number for the text from a program on beryl called "ticket"that Brett Fishburne wrote, that dispenses the next number. That way, iftwo or three of us are posting at the same time, we won't all grab thesame number. We create a 5-letter base filename, checking that it hasn'tbeen used before, and finally zip up the file.
We now transfer the .ZIP and .TXT files to two and (This is usually the point atwhich we realize that we forgot to make a change we noticed whilechecking. Aaaargh!)
5. Notification
At this point, the book is posted, but nobody knows about it! We needto do something about that. . . .
We compose an e-mail to the "posted" e-mail list, cc: the producer,with the line that is to go into GUTINDEX.ALL, the master list of PGfiles.
The "posted" list has only a few subscribers. These are the people whoindex and create links to PG texts, and include both PG volunteers andthe maintainers of other sites that link to PG texts.
They also commonly download the texts to get more information fortheir indexes, and tell us if there is anything wrong with the files.
This e-mail is simply the official notification to all these peopleand the producer that the file has been posted. Here's a sample ofsuch an e-mail:
To: "Posted Etexts for Project Gutenberg" <>
Subject: [posted] Posted (#5301, Duncan) !
From: "Jim Tinsley" <>
Date: Tue, 25 Jun 2002 06:21:27 -0400 (EDT)
Mar 2004 The Imperialist, by Sara Jeannette Duncan [SJD#4][]5301
There may also be some remarks, if the text is in any waynon-standard, or if files other than plain text were posted with it.
From this e-mail, you can, if you want to see any corrections made,immediately download the posted file and compare it to your version.Since the notification is made after the file has been copied to theservers, it should be there waiting for you.
To find out how to download a book that has just been posted, see the
FAQ "How can I download a PG text that hasn't been cataloged yet?" [R.3]
6. Indexing
From the "posted" list, the posting line is added to GUTINDEX.ALLand our indexers begin the cataloging process, which is much morethorough, for the website. This includes work like finding author'sdates of birth & death, getting the Library of Congressclassification, and the other information that makes up the websitesearchable index. That process takes extra time, which is why thewebsite searchable catalog must always lag behind the actual titlesposted.
7. Corrections
It's remarkable how many people who went over and over the text to thepoint of hating it suddenly see problems with it when they download ita couple of days after it's posted! Something psychological there, Iexpect. Anyhow, if you do download your text and see problems with it,don't worry, just e-mail whoever posted it, or any other member of thePosting Team. No, you're not stupid, or if you are, you're in goodcompany, because we've all done it! There's no big deal aboutreplacing the posted file with a corrected copy immediately.
Over time, other readers may submit corrections. If you find an errorin a PG etext, see the FAQ "I've found some obvious typos in a ProjectGutenberg text. How should I report them?" [R.26]
When the corrections are small, as most are, we will just make thechange to the existing text. If there are a lot of changes, we maypost a new edition [R.35] with a new edition number; e.g. if thefile abcde10 was corrected, we may post abcde11. We never make anew edition when we get corrections immediately after posting.
V.17. How long must a text be to qualify for PG?
The rule of thumb is that we try not to post texts shorter than 25K,or about 350 lines of 70 characters. This rules out, for example, alot of individual short poems. If you are interested in contributingthis type of material, consider making a collection of similartexts—poems by the same author, or magazine articles on the samesubject. We have made a few exceptions, like Martin Luther King's"I have a dream" speech, but very few.
V.18. What books are eligible?
A book is "eligible" for posting if we can legally publish it. This isthe case if:
1. it is in the public domain in the U.S.A.,
2. the copyright holder has granted unlimited
non-exclusive distribution rights to PG.
V.19. Are reprints or facsimiles eligible?
A reprint or facsimile of a book that would be eligible is itselfeligible.
For example, if a book published in 1995 is a reprint of a bookpublished in 1900, then it is eligible. However, the onus is on usto prove that it is a reprint, and if it doesn't say on the TP&Vthat it is a reprint, confirming its eligibility may be impractical.
V.20. What is the difference between a reprint and a facsimile?
A facsimile retains the page layout and formatting of the original. Areprint keeps the same words, but may lay the pages out differently.For our copyright purposes, there is no difference—we can use either.
V.21. What is the difference between a reprint and a "new edition"?
A reprint contains only the words and pictures that were printed inthe original. A new edition is in some way changed; it has differenttext, or pictures. It may be abridged, or expanded. It may havematerial added or changed, using other versions of the book.
A new edition gets a new copyright, and has to be cleared based on itsown copyright date and status, not the date of the original printingof the title. See also the FAQ "How come my paper book of Shakespearesays it's 'Copyright 1988'?" [C.16] for an example.
Please note that we are talking here about a new edition of theprinted book, not a new (corrected) edition number for ProjectGutenberg naming purposes.
V.22. What book should I work on?
Nobody in Gutenberg is going to set assignments for you. You decidewhat book to process. Just pick one that no-one else has already done,or is working on. It's also sensible to pick one that you'lllike—you'll be living with it for a while. On a practical note, it'sprobably better to start with a short book or even a short story,since a long book can take quite a while to produce.
Start by thinking of books written before 1923. Pick a book you like,and check it out. If it's already done or still in copyright, tryother books by the same author.
Visit the Project Gutenberg site and download a full list of Gutenbergbooks in GUTINDEX.ALL. Have a look at the List of Books In Progress andComplete [B.1]. Look for authors you like, and see what books by themaren't yet available.
Check out your old books. Maybe you have an eligible edition thatwould be of great help to the project.
Try your library. They may have some eligible editions—books we canprove to be in the public domain—and you will certainly come awaywith ideas. Ask your librarian. Librarians are keen to help onprojects like this.
Browse second-hand bookshops in your area. There are lots of treasuresto be picked up very cheaply.
Search for literature pages and bookshops on the Internet.
If all else fails, you can always ask on the Volunteers' Board or trythe gutvol-d mailing [V.12] list for ideas. Others may know of booksthat people are especially looking for, or projects already startedwhere you could help out.
V.23. I have a book in mind, but I don't have an eligible copy.
First, determine whether there are any eligible copies of the book, byfinding out the date it was published, possibly from the Catalog of theLibrary of Congress [B.5] and checking the Public Domain and CopyrightRules [B.1]. If there is a public domain edition, the next problem is tofind one to work with.
V.24. Where can I find an eligible book?
The most commonly used outlets are used bookstores, garage sales,library sales, charity shops and any other place that sells old books.
The Internet is a wonderful medium for finding used and antiquarianbooks—used bookstores all over the world have found ways ofco-operating and listing their inventories on the Net, so that whetheryou live in Los Angeles, Moscow or Perth, you can still find that bookyou're looking for in a shop in a laneway of Amsterdam. Most on-linelistings will quote the publication year of the book, so you can checkthat it's pre-1923.
Two such sites that allow second-hand booksellers to list theirinventory are:
Advanced Book Exchange <>
Alibris <>
The book search page at [B.5] has a list of many such Netbookshops, or you can simply visit any search engine and search for Usedor Antiquarian Bookshops. You can often buy eligible books through thesesites very cheaply.
If you still can't find the book you need, post a message on theVolunteers' Board or to the gutvol-d mailing list; maybe someone elsecan find it for you.
Sometimes, it may be possible for you to work from a later edition, solong as somebody who has an eligible edition can check it to make surethat no changes have been made. Sometimes, you may be able to find amodern reprint; reprints may be eligible, as long as they say they arereprints of an edition that would be eligible.
If you can type, or can scan without damaging the book, you can borrowbooks long enough to produce them. Even if your local library doesn'thave the books you want, they may well be able to get them for you oninter-library loan. Ask your librarian about it.
V.25. What is "TP&V"?
This is an abbreviation for "Title Page and Verso", and means a paperor image copy of the front and back of the title page.
Even if the back is blank, we need to have an image of it for thefiles, to show that it is blank, so that if, in ten years' time,somebody queries our right to publish, we can show that we haven'tjust lost it.
Publishers print copyright information, like title, author, copyrightyear and owner, and whether the book was a reprint, on the TP&V, andby filing this, we can prove that the book we produced was in thepublic domain.
Sending us the TP&V is the One True Way to getting PG copyrightclearance [V.37].
V.26. What is "Posting"?
Posting is the final stage in the production process, where the fileis given a number and official PG header, and copied onto our FTPservers for distribution. See section 4 of the FAQ "How does a textget produced?" [V.16] for a blow-by-blow account.
V.27. I think I've found an eligible book that I'd like to work on.
What do I do next?
Make sure nobody else is working on it, and that it's not alreadyonline somewhere.
V.28. What books are currently being worked on?
Check out David Price's In Progress List (a.k.a. "the InProg List")online at <>. Davidgets the information from Copyright Clearances that have been done,and organizes it into a list. It can never be 100% up to date, sinceclearances come in all the time, but it's the best online facility wehave, and it's much more clearly presented than the original clearancefiles.
V.29. How do I find out if my book is already on-line somewhere?
There's no foolproof method; some student somewhere could have scannedit and put it on her college web page without announcing it anywhere.However, there are some regular places to check.
It may sound obvious, but you should always look in the PG archivesfirst. Download GUTINDEX.ALL and keep it handy. Search the InProgList [B.1].
The two other main places to search for your book are the InternetPublic Library <> and the On-Line Books Page<>. These projectsspecialize in indexing books that people make available on-line.
If you still don't see your book on-line anywhere, hit your favoritesearch engine, and give it the title, author's last name, andpreferably a few uncommon words from the first page of the book.Sometimes one of those solo efforts shows up in a general search.
V.30. My book is not on the In-Progress list, and I can't find it on-line.
Is it safe to go ahead and buy it?
Probably. It could have been cleared, but not included in the InProglist yet. If the amount of money to buy it is a consideration, you cane-mail any of the members of the Posting Team, and ask them to checkthe latest clearances for you. Even this isn't foolproof; anothervolunteer could be placing their order at the same time you're placingyours. Such duplications do happen, but they are very rare.
V.31. My book is on-line, but not in Project Gutenberg. What should I do?
If the on-line file is from the same edition as the one you have (e.g.not a different translation) then you may be able to submit that file,perhaps slightly edited, to Gutenberg using the clearance from yourpaper copy. See "I've found an eligible text elsewhere on the Net, butit's not in the PG archives. Can I just submit it to PG?" [V.62] forhow to do that.
And of course, you can always still make your own version for PG. It'ssurprising how often even very similar paper editions have smalldifferences that can be interesting or significant.
V.32. My book is already on-line in Project Gutenberg, but my printed book is different from the version already archived. Can I add my version?
Yes! In fact, assuming that the version already there is in the publicdomain, you can piggyback on the work already done by what is called"comparative retyping". For example, let's say that you have a lateredition than the existing file; you can just take the existing file,edit it to match your paper version, and submit it as a new file. Ofcourse, you must have Copyright Cleared [V.37] your paper version aswell.
V.33. I see a book that was being worked on three years ago. Is anyone still working on it?
Maybe, maybe not. Some people abandon books, some people who areregular producers clear them and put them at the bottom of the pile,perhaps for years (though they will get round to them sometime), andsome people just simply take two or three years to produce a book.
Once, we put names and contact details on the public InProg list, butfor privacy and spam-prevention reasons, we've taken them off.However, the Posting Team have access to the master list of clearedfiles, and will send a message on your behalf to the person whooriginally cleared the book, asking if the project is still active, orif the producer wants help.
So if you really want to check this situation out, e-mail one of the
Posting Team.
V.34. I've decided which book to produce. How do I tell PG
I'm working on it?
As soon as you get Copyright Clearance [V.37], your book is enteredin the "cleared" files. David Price will take these, and add yourentry in his next release of the In Progress List.
V.35. I have a two- or three-volume set. Should I submit them as one text, or one text for each volume?
Quite a lot of 18th and 19th Century books, even straightforwardnovels, were published as multipart sets. When you have such a set,you should usually submit one text for each volume, and a "complete"text with the contents of all volumes together.
People who do this often complete and submit one volume at a time,until they've finished, and then contribute the "complete" file.
V.36. I have one physical book, with multiple works in it (like a
collection of plays). Should I submit each text separately?
If the works are clearly separate, stand-alone texts, and are longenough [V.17] to warrant inclusion on their own in the archives, thenyes, you should, and you may also submit a "complete" version as well,if it seems appropriate. This most commonly happens in a collection ofplays, though essays and other works may also fit the criteria.Collections of poetry rarely do, since most poems are too short tosubmit as stand-alone texts.
Sometimes the book includes a preface or introduction or glossarycovering all the works in it. In this case, you can decide whether toinclude these with each of the parts, or save them for the "complete"version.
V.37. How do I get copyright clearance?
Basically we need to see images of the front and back of the titlepage of the book, which is where copyright information is usuallyshown. This is called "TP&V", for "Title Page and Verso" [V.25].
To Submit Online:
As of late 2002, we have a new automated upload procedure using a webpage. This is by far the fastest and easiest way to get clearance.You need scanned images (PNG, JPEG, TIFF, GIF), of the two pages,of good enough resolution that the text can be read clearly, thoughthe files don't need to be huge.
Just go to <> and follow theinstructions.
There are two other, older ways to submit a text for clearance.
To submit by paper mail, photocopy the front and back of the titlepage, even if the back is blank, write your e-mail address on it, andsend the photocopies to:
This is called Title Page & Verso, or TP&V for short, and is neededfor copyright research. A colored envelope is best, to make sure yourletter is easily recognized as TP&V.
E-mail Michael when you send them, so he knows they'reon the way. It's a good idea to check back with him by e-mail after aweek or so if you haven't heard from him.
About this, Michael says: "Please include always your e-mail name andaddress, and mark the envelope with some distinctive mark and orcolor. Colored envelopes fine. Just something so I can find it easily,the mail here is slow and deep, like snow. Please send a note to:<> for more info."
To submit by e-mail, scan the front and back of the title page, even if
the back is blank, and e-mail the images to Greg Newby
<> as TIFF, JPEG or GIF in medium resolution. Make
sure that the print is legible before you send.
Whichever method you use, you should expect to get an e-mail backafter about a week, with one line containing the Author, Title, yourname and date with the word "OK" at the end. This means that your texthas been cleared.
A Clearance Line looks something like:
The Works Of Homer [Iliad/Odyssey] Tr. George Chapman Jim Tinsley 06/14/01 ok
If you don't get any response, e-mail to check that your TP&V wasreceived OK. If the word at the end of the line is not "OK", thenyour text is not eligible, and a comment will probably be appendedexplaining why it is not eligible.
Don't start work on your book until you get that OK! It's verysickening to do all that work, and then find out that your textcan't legally be put on-line!
V.38. I have a two- or three-volume set. Do I have to get a separate clearance on each physical book?
Some multi-volume works, notably reference books and translations,were published in a series, and it may be that the first volume is1922, but the others are 1923 or later, so we have to clear eachindividually.
V.39. I have one physical book, with multiple works in it (like a collection of plays). Do I have to get a separate clearance for each work?
No. Since they were all printed together, one TP&V will suffice forall, but . . .
You should list each separate title included, if you intend to submiteach title separately (see the FAQ "I have one physical book, withmultiple works in it like a collection of plays. Should I submiteach work separately?" [V.36]). If, say, you clear a "Collected Playsof Sheridan", and later submit an eBook of "The School for Scandal",we will have trouble finding your clearance unless we have made a notethat "School for Scandal" is part of the contents of "CollectedPlays".
In a case like this, you should include, on your paper or e-mail,something like:
George Bernard Shaw. Plays Unpleasant. 1905.
Preface to Unpleasant Plays
Widower's Houses
The Philanderer
Mrs. Warren's Profession
You only need to do this when you are going to submit each partseparately, which is commonly the case with plays, and sometimesessays, stories and novellas. Taking a different example, the"Collected Poems of Emily Dickinson", we would not need to list thecontents, since we wouldn't publish each poem separately.
There is one exceptional case: if your book was printed after 1923,but contains stories or plays some of which are stated to be reprintsof pre-1923 editions, you should give as much detail as possible aboutwhat you intend to submit.
V.40. Who will check up on my progress? When?
Nobody. There are no schedules or timetables. You're welcome tocontact other volunteers [V.12] with comments or questions, though.
V.41. How long should it take me to complete a book?
Most books get done in between one and three months, but this varieswildly. It depends on the amount of time you can afford to give it,the length of the book and, if you're not typing, the quality of thescan—if the book scans badly, you need to put more time intoproofing.
Some very productive volunteers manage to turn out an e-text a week;some books can take a year or more.
Scanning itself doesn't take too long. Even if it takes you as much astwo minutes per page to scan, you will still complete a 300 page bookin 10 hours, and you will probably be scanning much faster than that [S.9].The problem is that the text generated by the scanner and your OCRpackage is usually faulty. There are many cute scanner errors,mistaking b for h, or e for c, so that "heard" is scanned as "beard"or "ear" as "car". Makes the story more interesting sometimes!
So now you need to do a first proof of the e-text. Read it carefully,correct scanning mistakes, and make sure that you haven't left outpages or got them in the wrong order. Unless your scan wasexceptionally good, this is the time-burner in the process.
When you've done the first proof, you can either do a second proofyourself, or send it to another volunteer for second proofing.
If you're a typist, of course, you can skip right over the messyscanning and scan-correction process. Yay typists!!
V.42. I want/don't want my name published on my e-text
No problem. When you send the e-text for posting, mention exactlywhat, if anything, you want the Credits Line [V.47] to say.
V.43. I'd like to put a copy of my finished e-text, or another
Gutenberg text, on my own web page.
Great! PG encourages the widest possible distribution of e-texts. Welike to publish everything in plain text, which is the most accessibleformat, since everybody can read plain text. But once it's availablein plain text, it's open to you or anyone else to convert it to otherformats like HTML for further distribution.
If you are reposting a text, though, please be careful to check thatyour posting complies with the conditions spelled out in the header,especially for copyrighted works.
V.44. I've scanned, edited and proofed my text. How do I find someone
to second-proof it?
You can post a request on the Volunteers' Board, or on the gutvol-d
Mailing List. You will probably get some offers there. In a difficult
case, you might ask Michael Hart to add it to the "Requests for
Assistance" section of the next Newsletter.
In general, the best way to handle it is to make a co-operativeproofing project out of it. This is like a miniature version of thedistributed proofreading sites, without the page images.
There are always people looking for proofing work, but many beginnerstake on more than they can handle, and don't finish the job, and thiscan be very disappointing if you give the whole thing to one volunteerwho then vanishes without trace. You can minimize the risk of this bysplitting the book into chunks of about 20-30 pages, or one chapter ifthat's around the right size, each. Write explicit instructions aboutwhat you want them to do when they spot a suspected error, like fix itor mark it with an asterisk. (Marking is probably safer with beginnerswho don't have the book or an image of the page to refer to.) Give thefirst chapter to the first person who responds, the second to thesecond, and so on. As you hand out the chapters, let the proofers knowthat if they're not returned within three or five days, you'll assumethey've quit. Three days is more than plenty of time for 20 pages. Ifsomeone returns a chapter, you can give them another. If someonedoesn't get back to you within the time set, assume they're not goingto, and recycle that chapter to someone else. No hard feelings, noproblem. This process of "co-operative proofing" ensures thatbeginning proofers don't choke on the work, and that one vanishingvolunteer doesn't hold up the whole project.
V.45. I've gone over and over my text. I can't find any more errors, and I'm sick of looking at it. What should I do now?
We all know that feeling! Particularly with your first book, you'veprobably gone through a patch when you thought you'd never finish—andwhen you do, you can't stand the idea of looking at it again. Heh.Cheer up—the first twenty texts are the worst! :-) And you'll feel alot better when you see your text available for everyone to read.
You have three choices:
You can send it for posting as it is. [V.46]
You can put it aside for week or so, and come back to it with fresheyes.
You can ask in any of the standard ways [V.12] for someone else tosecond-proof it for you. This has a lot to recommend it; it getsother sets of eyes looking at the text, it relieves the pressure thatyou may feel, it may rekindle your enthusiasm for the text, it allowsyou to "meet" other volunteers, and possibly form partnerships forfuture PG collaboration. Above all, it gives new proofers a chance toget their feet wet, and this is good for them, and good for PG. Youare not only contributing a text, you're helping to train andencourage the next generation of producers.
V.46. Where and how can I send my text for posting?
As of late 2002, we have a new automated upload procedure using a webpage. This has a lot of good things going for it, because we keep arecord of what's uploaded, you get an e-mailed copy of the notification,you don't have to fiddle with FTP, and we can make up the headerautomatically from the information you enter, which saves time andprevents keying errors.
As always, it's better to ZIP your file first, because it'll takeless time to transfer.
Just go to <>, fill in theform, specify the file to upload, and hit "Send" at the bottom.
And you're done!
If, for some reason, you can't use this page, there are two backupoptions: you can e-mail it, or you can upload it by FTP. Whicheveryou use, it is always best to ZIP the file first if you can.
If you are comfortable with sending files by FTP, this is better thane-mail, First, you will need a username and password, which you can getby e-mailing any of the Posting Team.
If you already know how to use command-line FTP, here's how to do it:
Log in to using the username and password suppliedand change to the work directory by typing "cd work". Change to binarymode with the "bin" command and "put" your file.
Summary instructions:
login: yourlogin
password: yourpassword
cd work
put yourfile.ext
Here is a sample session:
Connected to
220-Access from unknown@ logged.
220 FTP Server
User ( xxxxxxxx
331 Password required for xxxxxxxx.
Password: xxxxxxxx
230 User xxxxxxxx logged in.
ftp> cd work
250 CWD command successful.
ftp> bin
200 Type set to I.
ftp> put MYFILE.ZIP
200 PORT command successful.
150 Opening BINARY mode data connection for MYFILE.ZIP.
226 Transfer complete.
ftp: 172313 bytes sent in 17.34Seconds 9.94Kbytes/sec.
ftp> quit
When you are in the work directory, you will not be able to listfiles, but they do exist and they are there.
When you have uploaded your file, e-mail a note to any or all of the
Posting Team, including your
1. filename
2. credits line as you want it on your text
3. clearance line you received [V.37]
An ideal note might be:
Subject: Beryl upload for posting: Hamlet
I have uploaded to beryl:
Hamlet, by William Shakespeare
File is:
Credits line is: Produced by John Doe <>
Clearance was given as:
Hamlet William Shakespeare John Doe 05/03/02 ok
If you'd rather send it by e-mail, send the e-mail, including theCredits Line and Clearance Line as in the sample above, to any or allof the Posting Team, with your text as an attachment. Again, ZIPpedis better, since it avoids certain damage that can happen to a plaintext e-mail along the way.
Do not add the Project Gutenberg header or footer to your file,unless we specifically asked you to. If you do add it, we'll justhave to strip it off again, since we add headers automaticallywhen posting. There are times, perhaps when you're working inan unusual non-editable format, when we may give you a headerand ask you to add it, but this is rare.
Please read section "4: Posting" of the FAQ "How does a text getproduced?" [V.16] for more detail about what happens in posting.Especially, if you want to draw some peculiarities of this textto the Posting Team's attention, or want feedback on any minoredits done during posting, you should say so in the e-mail you send.
Don't assume that we know anything when you send the e-mail. Wedon't know what you want us to put on the Credits Line. We don't knowthat this is an unusual text, and needs some kind of specialreformatting. We don't know that the text should be split into twovolumes before posting. We don't know that you would really like us tocheck it closely before posting. You have to tell us, exactly andprecisely, what you want on the Credits Line. If the text needs somespecific work, you have to tell us exactly what that is. And please dothat in your e-mail, not in the text itself. Remember that we could bedealing with five or ten other texts at the same time, and even if theposter you discussed it with two weeks ago is the same one who poststhe book, he may not remember.
V.47. What is the "Credits Line"?
The Credits line is a line that the Posting Team can insert intoeach PG text naming the producer or producers of a particular text.
You should decide what you want on the credits line of your text;it's really not up to us.
Most credits lines are something like:
Produced by John Doe <>.
If you don't want to be mentioned by name at all, just say, in youre-mail:
Please omit the Credits Line for this text. I want to contribute
it anonymously.
If you do want to be mentioned, please give the exact wording you wantus to use. Some people want their name only; they don't want us toinclude their e-mail addresses. Others want to make their e-mailaddresses public so that readers can contact them with comments.That is entirely up to you, but you do need to tell us. If you dowant to include your e-mail, remember that having it permanentlyon the net is a spam-magnet, and we can't effectively remove or changeit later.
Occasionally, a Credits Line can spill onto more than one line,for example:
This text was converted to HTML by Jane Roe <>
from an original ASCII text scanned by Jack Went
and proofed by Jill Hill
V.48. How soon after I send it will my text be posted?
First read the "Posting" section of the FAQ "How does a book getproduced?" [V.16] to understand the process.
You should expect some response within three or four days. We try toget to all submissions within that time. In most cases, that responsewill be simply the official notification that it has been posted. Ifthere is a query on your text, for example if we can't find thecopyright clearance or if we have trouble converting or correctingyour text, we will probably e-mail you back directly with questions.
If you don't hear from us within four days, send a follow-up e-mail;it could be that your original note never got to us, or just fellthrough the cracks.
If your file happens to arrive while one of us is logged in andworking, it could get posted within the hour. Some frequentcontributors who know our habits know just how to time their uploads!
V.49. I found a problem with my posted text. What do I do?
Most postings go smoothly, but problems can happen. Sometimes, one ofthe servers is down. Sometimes a file gets corrupted for some unknownreason. Sometimes, let's face it, we screw up.
Usually, one of the indexers will tell us about it, but if you catchit first, e-mail whoever sent out your notification e-mail and explainthe problem. Don't worry; your original file will be quite safe, sincewe keep these long after posting them.
V.50. Someone has e-mailed me about my posted text, pointing out errors.
Since you're the original producer, you're in the best position todecide whether these are real errors. If they're right about it, tellthe Posting Team and we'll correct the text.
V.51. Someone has e-mailed me about my posted text, thanking me.
Nice feeling, isn't it? :-)
About Proofing
V.52. What role does proofing play in Project Gutenberg?
A very big one!
Typists' work doesn't usually need many corrections, butunfortunately, scanners and OCR packages are far from perfect, andscanned text varies from "almost-right" down to "maybe I shouldconsider typing instead of scanning". Proofing is the process thatturns a scan into a readable e-text.
Proofing a typist's work is straightforward; you just read it, andkeep an eye out for mistakes. Typists typically have few mistakes intheir texts, but the errors that they do make tend to be hard to spot.Proofing OCRed text has its quirks, and you can expect many, manyerrors to correct.
The only thing that all proofers agree on is to differ in theirmethods. Some people scan and almost complete the proofing processwithin their OCR package, others do no editing at all within theirOCR. Some spell-check first, others spell-check last. Some workthrough in one pass, doggedly line by line, others make several lightpasses. Some start at the end and work backwards! Some proofers markall queries with special characters like asterisks (*) in the text,most just make all the obvious changes and mark only the dubious ones.Some people always send their texts out for proofing, others prefer todo it all themselves.
So this guide is not prescriptive; this is not the "only way" to doit. The only rule is that, at the end of the process, your e-textshould be as error-free as you can make it, and should conform toGutenberg's editing standards, which are mostly just common senseguidelines to make readable text.
The aim of this FAQ is to give you an understanding of what text lookslike when it comes fresh off the scanner, and an overview of the wholeprocess by which it becomes a publishable e-text.
V.53. What is Distributed Proofing?
It has always been common for volunteers to share proofing work amongthemselves—you take the first five chapters, I'll take the next, andso on.
When you're just starting as a PG volunteer, you should go to one ofthe Distributed Proofing sites [B.4] and do some work there to get agrounding in the basics and a feel for whether you would like tocontinue working in PG. In distributed proofing, you get a very shortsection, as little as a page of text at a time, and usually an imagefile of the page as it scanned. You then make the text match theimage. This is a great start, since all you have to do is read,compare and correct. However, other work also needs to be done, andwill normally be done by the project managers of these sites. Thesamples below give you an idea of the whole process, and also someideas of what proofing a whole book from start to finish is like.
V.54. What do I need to proof an e-text?
You actually need only two things: the e-text itself and a texteditor or word-processor that can handle book-sized files and savethem as text.
Nearly all word processors and text editors in current use will work.Volunteers use many common programs, including WordPerfect, MicrosoftWord, WordPad, DOS EDIT, vi, Brief, Crisp, EditPad, MetaPad, emacs,AbiWord, and the word processors from Open Office abd AppleWorks. Andall of these are in actual use by volunteers today. Since all of themcontain the necessary basic functions, the best program is the oneyou're most comfortable with.
Be cautious with recent, powerful word-processors that "auto-correct"text, or use "smart quotes" or any other such automatic retyping orformatting feature, since they can Do Bad Things to your e-textwithout your consent! When using any such package, it is best toswitch off any feature that makes changes without asking you.
Two utilities which may come in useful are a spell-checker and aversion difference checker. These may be built into your wordprocessor, or you may have them as separate packages.
A spell-checker is like a chain-saw: a powerful tool, but one to beused very carefully. It is very easy to say "Yes" to the wrong change,and make a really bad mess of the text. Spell-checkers have problemswith proper names, foreign words, archaic usages, and dialects.Incautious use can leave you with a text such as that immortalizedin the
Owed two a Spell in Chequer.
Eye half a spell in chequer,
It cane with my Pea Sea.
It plane lee marques four my revue
Miss steaks eye can knot sea.
Every e-text should pass through a spell-checker at some point, butthe human half of the partnership needs a very light hand on theconfirmations of change!
A difference checker, such as FC or COMP for MS-DOS, diff for Unix orExamDiff <> forWindows, may also come in handy. A difference checker compares twoversions of the text, and points out the changes. This is importantwhen you've sent a text out for proofing, and you get it back withchanges. Rather than re-reading the whole text, you can use adifference checker to highlight the changes so that you can verifythem against the printed text. As a proofer, you can use it to comparethe original text with what you're sending back to ensure that you'veonly changed what you meant to change.
V.55. Do I need to have a paper copy of the book I'm proofing?
Your job as proofer is to ensure that the e-text you're working on isreadable in itself, and contains no obvious errors. Where you thinkthere might be an error, but you're not sure, you mark the spot in thee-text, and let the volunteer who has the paper book look it up.
V.56. What's the difference between "first proof" and "second proof"?
These are fuzzy terms used to indicate how accurate the e-text is, andwhat type of work is needed to improve it. Quite commonly, the samevolunteer who scans the book proofs the whole thing in one or twopasses. Sometimes, given a good scan, the text can be sent out for"first proof" with little or no preparatory fixing-up. Often, thescanner makes quite a lot of corrections, then sends the text out for"second proof".
A text is ready for first proofing when it's obvious that there areplenty of errors, but it's possible to figure out, in almost everycase, what the correct text should be without needing to refer to thebook.
The objective of first proofing is to eliminate all the obviouserrors, so that if you speed-read quickly through the text, youprobably won't notice any.
Second proofing involves taking a text that has been first-proofed andcorrecting all the remaining, more subtle errors. Often, some simpleerrors such as incorrect spacing and quotes may be left for secondproofing. Texts that have been typed instead of scanned will alwaysbe of at least second-proof quality.
V.57. What do I do with an e-text sent to me for proofing?
First, establish reasonable expectations. A typical book takes 10-15hours of concentrated effort, and when you first start, you'reclimbing a learning curve. For your first session, decide to mark outa chapter or two—something like 500 to 1,000 lines—and work only onthat. If you get through 1,000 lines in your first sitting, you havedone extremely well! It's a good idea to send this first 1,000 linesor so back immediately. The volunteer who sent you the e-text willcomment on it, and let you know about any style guidelines you mayhave breached or common errors you may have missed. Most beginningproofers do make mistakes, so don't worry about it—it's easier tocorrect these in 1,000 lines than to go back over them in 15,000lines!
You will usually receive the e-text as an attachment to your e-mail.It's better to send e-texts as attachments than to paste them as textinto the body of the e-mail to make sure that the text isn't changedby different e-mail clients. It's better to send e-mailed attachmentsas ZIP files [R.20], since e-mails sent as text can be damaged along theway. But whether you receive a TXT file or a ZIP file that you have toopen, you should save the .TXT file to your hard disk and open it withyour editor.
It may be that the text you see appears double-spaced—every secondline is blank—or that all the text is on one incredibly long line.This is a familiar effect when moving between a DOS/Windows computerand a Mac or Unix system, but it can happen between any two editors.It is caused by the use of different characters to mark the end of aline. If you have this problem, ask whoever sent you the text tore-send it, telling them what kind of computer and editor you have.
Now you make any changes that obviously need to be made, and mark anyplaces where the text looks wrong, but you're not sure what the righttext should be. You can usually use asterisks (*) to mark thesedubious spots, but you might use other characters if the text alreadycontains asterisks. When in doubt, mark them all, and let thevolunteer with the text sort them out!
It is usually best not to make global changes to line lengths byreformatting lots of paragraphs, since the person who sent you thee-text may want to use a difference checker when you return it, andchanged line-lengths throughout mean that every line will bedifferent.
When working on a long text, or when making a lot of changes, it maybe wise to save several versions of the text with different filenamesat different stages so that if something goes badly wrong, you canrevert to the last good version. This applies especially to saving thetext just before performing a spell-check.
When you're finished with the e-text, make sure you save it as a plaintext file (.TXT) and send it back by zipping it if you can, andattaching it to an e-mail.
V.58. What kinds of errors will I have to correct?
Each text has its own peculiarities, but there are a number ofwell-known scanning errors you will be dealing with all the time.
Punctuation is always a problem. Periods, commas and semi-colons areoften confused, as are colons and semi-colons. There are also usuallya number of extra or missing spaces in the e-text.
The problem of quotes can assume nightmarish proportions in a textwhich contains a lot of dialog, particularly when single and doublequotes are nested.
The numeral 1, the lower-case letter l, the exclamation mark ! and thecapital I are routinely confused, and often, single or double quotesmay be mistaken for one of these.
Lower-case m is often mistaken for rn or ni.
The letters h and b and e and c are commonly mis-read, and these areprobably the hardest of all to catch, since ear/car, eat/cat, he/be,hear/bear, heard/beard are all common words which no spell-checkerwill flag as problems.
For example:
" Hello1' caIled jirnmy breczily. 11Anyone home ? "
There seemed to he no-oneabout. Only tbe eat beard him."
should read:
"Hello!" called Jimmy breezily, "Anyone home?"
There seemed to be no-one about. Only the cat heard him.
As well as scanner errors, which affect one letter at a time, you haveto keep an eye out for editing mistakes by the volunteer who scannedthe text or by previous proofers. These are typically cases where awhole line, paragraph or page has been omitted or misplaced. They showup as sentences that don't make sense, or paragraphs that don't followfrom the previous one.
This means that you have to keep reading the flow of the text, so thatyou can spot context errors as well as typos.
V.59. How long does it take to proof an e-text?
This depends on how long the e-text is, how clean the text is when youstart, and how thorough you're being, as well as how much time per dayyou can give it and how fast you can proof.
On a first proof, it can take a very long time to get the e-text to areadable condition if it scanned badly. As a beginner, you would beunlikely to be given such a difficult text to work with. First proofsare usually done by the same person who did the scanning, and are onlygiven out in the context of established scanning/proofing teams.
You might expect to proof anywhere between 500 and 2,000 lines perhour during a second proof. A short novel or novella might have as fewas 6,000 or 7,000 lines; War and Peace weighs in at about 54,000lines. Most novels run to 10,000 to 15,000 lines. So you might spendanything between 5 and 30 hours second-proofing a standard book, with10 to 15 hours being typical.
For an average novel, a week or two for second proofing is good going.
A month is reasonable.
Proofing an e-text is a significant amount of work, and you may findit psychologically more comfortable to take on a chunk at a time—say1,000 lines per session—and send that proofed section back, ratherthan wait until the whole job is done before sending anything back.This helps to avoid the fairly common case where you keep fallingbehind where you expect to be until you dread the thought of gettingback to the text, and finally just abandon it.
If you find after a while that you just don't want to continue, pleasetell the person who sent you the text that you're not going ahead withit. It's very frustrating for the volunteer who scanned the book, andwho wants to get it posted, to wait for two or three months, only tohave to start all over again with another proofer.
V.60. Are there any special techniques for proofing?
The classic way to proof is to open the text in your editor or wordprocessor, and just start reading carefully.
This method has received a major boost since editors and wordprocessors have added a feature of showing squiggly red underlinesunder words not in their dictionary. While this is very useful, youstill need to read carefully, since not all errors produce misspelledwords. The classic, and very common, example of this is scanning "he"for "be". These visual spellchecks also commonly do not check wordsbeginning with capitals. Capitalized words are commonly names not inthe dictionary, and when checking of capitalized words is switchedoff, they will not query "Tbe". Other errors that a spellcheckerdoesn't look for include missing spaces, mismatched quotes andmisplaced punctuation. For these, you can try gutcheck [P.1]. And ofcourse, no automatic check will find omitted lines or words. Worse,spellcheckers will query words not in their dictionary that might bequite correct, and this can be quite troublesome when dealing witholder texts or dialect.
Still, if your concentration is up to the job, scrolling through atext with non-dictionary words underlined in red is a fast andeffective way of giving a text the final once-over.
Volunteers have also used other techniques for proofing. Some peoplecan't sit at their screen and read for hours; many people don't wantto.
Some people just use the good old-fashioned method of printing out thetext to be proofed, and blue-pencilling the mistakes.
It is becoming fairly common now for people to load the text ontotheir PDA, and read it from that. Mistakes found can be bookmarked orjotted down and fixed when they go back to their PC.
Getting your computer to read the text aloud is a very effective way ofachieving high accuracy. Modern PCs have audio capabilities built in,and it is possible to find free or cheap shareware "read-aloud"text-to-speech packages for just about everything. Some PDAs are alsocapable of doing text-to-speech.
The first time you try text-to-speech, it will probably sound and feela little strange, but you will quickly learn to hear errors inwords. This can be very effective, but you should have given the textat least a light proofing before you begin; it is hard to deal with ahigh number of errors using a text-to-speech method.
When proofing by a speech program, you either set your text-to-speechprogram to pronounce all punctuation, or, if that is not possible, youmake a special version of your text to feed it, first doing a globalreplace of "," with " comma ", ";" with " semi-colon ", and so on.Mark a block of 500 to 1,000 lines for reading aloud, and set thereading speed to whatever is comfortable for you. Then you sit downwith the original book in front of you, and listen. When you hear anerror, mark the place in the text with a light pencil. Stopping thereading at every error, editing the text and restarting is possible,but it breaks the flow, and ends up taking longer. When the reading isdone, go to your keyboard and correct the errors found.
V.61. What actually happens during a proof?
Stage One—The original Scan
We start with a scanned e-text, in this case a paragraph from TheOdyssey. The paragraph used as an example here has been "enhanced"with more errors than in the real scanned text, so that you can seesamples of many problems all in one place.
We begin by looking at the original OCRed text, of which our samplesection reads:
1There Periniedes and Eurylochus held the victims, but l
drew my sharp sword from my thigh, and dug a pit, as it were
a cubit in length and breadth, and about it poured a drink-
offering to all the dead, first with mead and thereafter with
sweet wine, and for the third time with water, And 1 sprink-
ODYSSEY X, 24-56.
ODYSS.EY XI, %4-56. 173
lef white incal thereon, and entreated with many prayers
strengthless beads of the dead, and prornised that on my
return to Ithaea 1 would offer in my halls a barren heifer,
the best 1 had, and fil the pyre with treasure, and apart unto
Teiresias alone sacrifice a black rarn without spot, the fairest
of my flock. But when 1 bad hesought the tribes of the
d with vows and prayers, 1 took the sheep and cut their
s over the trench. and the dark blood flowed forth,
he spirits of the dead that he departed gathered
from out of Erebus.
It's clear that we should tidy up the page headings and numbers thathave been scanned in with the main text, and that we should separatethe paragraphs and remove the spaces inserted by the scan at the startof some lines. We also need to restore some of the text that got lostin the scan. Since there isn't much of it, we just type it in. Havingdone this, we get to . . .
Stage Two—First pass through the scanned text
At this point, we have a complete text. All of the words are actuallythere, and we have eliminated page breaks and other extraneousartifacts of proofing. Again, mileage varies: some people like topreserve page breaks and numbering until much later, to make it easyto refer back from the e-text to the book.
Our job in this phase is to fix all of the obvious scanning errors anddouble-check that we really do have all the text. Our aim here is tocreate an e-text that is ready for First Proof. In fact, since it'sfairly clear what all the words are, this text could be consideredready for first proof.
1There Periniedes and Eurylochus held the victims, but l drew my sharp sword from my thigh, and dug a pit, as it were a cubit in length and breadth, and about it poured a drink- offering to all the dead, first with mead and there after with sweet wine, and for the third time with water. And 1 sprink- led white incal thereon, and entreated with many prayers the strengthless beads of the dead, and prornised that on my return to Ithaea 1 would offer in my halls a barren heifer, the best 1 had, and fill the pyre with treasure, and apart unto Teiresias alone sacrifice a black rarn without spot, the fairest of my flock. But when 1 bad besought the tribes of the dead with vows and prayers, 1 took the sheep and cut their throats over the trench. and the dark blood flowed forth, and lo, the spirits of the dead that he departed gathered them from out of Erebus.
Now we convert those numeral 1s to capital Is and to quotes, whereappropriate, we straighten up the quotes and we deal with otherobvious scanning errors, which brings us to . . .
Stage Three—The First Proof
At this point, we could hand over the text to an experienced prooferwho doesn't have a copy of the book. This would be called a "firstproof". An e-text is at first proof stage when there are still plentyof errors, but in each case it's pretty obvious what the correct wordis. The excerpt now looks like normal text.
Unfortunately, in stage two above, we accidentally deleted a line.
'There Periniedes and Eurylochus held the victims, but l drew my sharp sword from my thigh, and dug a pit, as it were a cubit in length and breadth, and about it poured a drink- offering to all the dead, first with mead and there after with sweet wine, and for the third time with water. And I sprink- led white incal thereon, and entreated with many prayers the strengthless beads of the dead, and prornised that on my return to Ithaea I would offer in my halls a barren heifer, Teiresias alone sacrifice a black rarn without spot, the fairest of my flock. But when I bad besought the tribes of the dead with vows and prayers, I took the sheep and cut their throats over the trench, and the dark blood flowed forth, and lo, the spirits of the dead that he departed gathered them from out of Erebus.
Stage Four—Corrections from First Proof
We receive the first proof back from the proofer, and find that ithas been mostly corrected.
The corrections made were "l/I", "there after/thereafter","prornised/promised", "bad/had", and "rarn/ram".
We have also wrapped the lines—at 60 characters in this case, but itis commonly as much as 70 characters per line. Sentences which lookwrong, but where it isn't clear what the right text should be, havebeen marked with asterisks (*).
'There Periniedes and Eurylochus held the victims, but I drew my sharp sword from my thigh, and dug a pit, as it were a cubit in length and breadth, and about it poured a drink-offering to all the dead, first with mead and thereafter with sweet wine, and for the third time with water. And I sprinkled white incal * thereon, and entreated with many prayers the strengthless beads of the dead, and promised that on my return to Ithaea I would offer in my halls a barren heifer, * Teiresias alone sacrifice a black ram without spot, the fairest of my flock. But when I had besought the tribes of the dead with vows and prayers, I took the sheep and cut their throats over the trench, and the dark blood flowed forth, and lo, the spirits of the dead that he departed gathered them from out of Erebus.
We look up the text where the first proofer has asterisked it, andmake the corrections.
The text is now ready for second proofing. An e-text is ready forsecond proofing when you can skim through the text without noticingthat there are errors.
We can either do a second proof ourselves, or send it out for secondproofing.
Second proofing involves a very careful reading of the text, lookingfor small errors. In some ways, it's much harder than first proofing,since it's very easy to let your eyes run on auto-pilot and in doingso, miss subtle errors.
Having performed the second proof, which caught errors like"beads/heads", "Ithaea/Ithaca", "Periniedes/Perimedes" and "he/be",we now have our final e-text.
'There Perimedes and Eurylochus held the victims, but I drew my sharp sword from my thigh, and dug a pit, as it were a cubit in length and breadth, and about it poured a drink-offering to all the dead, first with mead and thereafter with sweet wine, and for the third time with water. And I sprinkled white meal thereon, and entreated with many prayers the strengthless heads of the dead, and promised that on my return to Ithaca I would offer in my halls a barren heifer, the best I had, and fill the pyre with treasure, and apart unto Teiresias alone sacrifice a black ram without spot, the fairest of my flock. But when I had besought the tribes of the dead with vows and prayers, I took the sheep and cut their throats over the trench, and the dark blood flowed forth, and lo, the spirits of the dead that be departed gathered them from out of Erebus.
Hooray! At long last we have an e-text to post, which can bedownloaded, read and enjoyed by anyone in the world from now on.
About Net searching:
V.62. I've found an eligible text elsewhere on the Net, but it's not in the PG archives. Can I just submit it to PG?
You can submit it, but you can't "just" submit it.
We wish we could give a permanent home to all the etexts that peoplehave produced and placed on the Net, but without proof of theirpublic domain [C.10] status, we can't.
We need to be able to prove that the eBooks we publish are in thepublic domain, so, in order to use one of the many texts that arejust floating around the Net, you need to find a matching paperedition that we can prove is eligible [V.18].
(By the way, please be sure that it isn't already in the PG archive. Alot of texts circulating on the Net originated at PG, and people quiteoften submit them back to us.)
Before you get into this, you should check whether the text you havefound is likely to be in the public domain in the U.S. A quick way toverify this is to hit the Library of Congress Catalog site at<> and search for the title or author. If youfind no publications before 1923, then you should probably move on;the Library of Congress doesn't list every book, and in particulardoesn't list all books published outside the U.S., but, if there isn'ta pre-1923 copy there, it may be difficult to follow up on. If you'renot dissuaded, do a search on the Net for used book shops that mighthave pre-1923 copies.
Sometimes, with a text on the Net, you know who typed it; it's onsomeone's website, or the transcriber is named in the text. Sometimes,the text has just been floating around Usenet or old gopher sites foryears, with no attribution.
The first thing to remember is that we would like to give credit tothe original transcriber if they want it, and if we can identify them.
The next thing to consider is that the original transcriber may wellhave an eligible copy of the book, and may be able to provide TP&V[V.25] for it.
So, if you can locate the original transcriber, it makes sense toe-mail them, explain what you propose to do, and ask them whether theycan help with copyright clearance and whether they would like to becredited in the PG edition. Often, you will get no response, or aresponse but no prospect of material that will help with clearance,but sometimes you will get lucky.
If the transcriber can't help with TP&V, it's up to you to find amatching paper edition of the same book. This may not be as hardas it sounds. Libraries can help, and may get editions for you oninterlibrary loan.
This is an ideal way for students, academics and librarians tocontribute texts to PG, since you probably have access to a goodlibrary with stocks of old books to find matching paper editions.
If you find a matching paper edition, you then need to compare theetext you found with the book. Legally, what we're trying to provehere is that we have done "due diligence"—that we have done our bestto prove that the etext is indeed a copy of a public domain work.
The minimum "due diligence" we can perform is to compare the first andlast pages of each chapter, (or every 20 pages where the book is notneatly divided into chapters of about that size). You should list allof the differences between the book and the etext that you find onthose pages. It is to be expected that there will be some minordifferences of punctuation, spacing and spelling, and even perhaps ofwording. Minor differences are OK, but we do need to list them, toprove that we did the comparison. When you have your lists, you cansend in the TP&V as normal, accompanied by your lists, for clearance.
Many texts floating round without attribution, and indeed many withattribution, could do with a thorough checking, and another option youhave is "comparative retyping", where you go through the whole etext,proofing it carefully against the cleared paper book, and changingeverything that is different in the etext to match the paper edition.If you do this, you don't need to produce a list of differences, sincethere won't be any by the time you've finished; you can just submit itas a normal text—and it may well be a lot cleaner! However, if youdo take this path, please do a very thorough job on the proofing andcomparison.
If the etext you find has been marked up, in HTML for example, youshould remove all HTML for the PG edition, because, even though thetext itself has been proved to be in the public domain, the originaltranscribers may hold copyright on the HTML markup, even if you can'tfind them. If you do want to make a HTML edition of it for PG, strip outall of the original markup and then re-add your own markup.
If you do find the producer and he or she wants to be identified, youmay submit a double credits line like:
Transcribed by Sally Wright <>
Produced for PG by You <>
V.63. I've found an eligible text elsewhere on the Net, but it's not
in the PG archives. Why should I submit it to PG?
The first reason is file safety.
Yes, we accept that the file is already available to everyone today,but it may not be safe in the long term. We've seen college studentswho put books on their personal site, and then lose that site whenthey graduate. We've seen individuals who transcribe several books,and later lose interest, or move, or die, and the work they've done islost. We've seen small projects with a few volunteers who produce andpost books for a few years, but then break up or run out of funds tomaintain their site. We've seen large institutions drop theircollections as part of a cost-cutting exercise. We've even seenorganizations lock public domain works up behind licenses, requiringusers to commit to registration and a "no copying" agreement beforedownloading them.
Whenever a set of etexts is published and distributed by only oneperson or organization, there is a danger that their etexts willdisappear from the Net sometime. We want all etexts to be spread aswidely as possible, copied as much as possible, so that no one eventor loss, or whim of a sponsor, can obliterate them.
We think that the PG collection is, for that reason, the safest placeto put a text for its long-term survival. There are copies of the PGarchives all over the world, on public servers and private CDs. PGpublications are widely converted, collected and read on PDAs. Othertext projects copy works from PG.
The PG archive is so valuable, yet free and easily portable, that evenif every current PG volunteer vanished overnight, people around theworld would copy and preserve it. Even if PG itself decided towithdraw all our texts, we couldn't do it, because so many people havemade copies.
The second reason is legal safety.
Unlike some other projects and individual efforts, PG retainsdocumentary proof of the public domain status of its texts. This ismore valuable than it might appear at first glance.
Publishers often claim a new copyright [C.17] on works that theyrepublish, and as time goes on, it becomes harder and harder to provethat a particular book is in the public domain. Walk into your localbookstore and check out how many works by Shakespeare, Poe, Dickens,and Twain have copyright notices on them! People who want to translatethese, or create derivative works like screenplays or lyrics or filmsmust first prove that they are basing their work on a public domainedition, but the creeping copyright practices of commercial publishersmake that difficult.
Here's a practical example: we were approached by a film student whowanted to make a short piece based on characters from James Joyce's"Ulysses". But before he could do that, he needed to confirm that thematerial on which he was basing his movie was in the public domain,and all the editions he could find were copyrighted. However, becausePG had already established the public domain status of Ulysses, wecould point him to our established PD version, and even tell him whereto find a paper copy published in 1922. Without that evidence, hecould not have made his project.
V.64. I have already scanned or typed a book; it's on my web site.
How can I get it included in the Gutenberg archives?
Great! We get these a lot, but it's always nice to see another!
You need to send us the TP&V [V.25] so that we can prove that youredition is in the public domain. If you don't have the TP&V, you willneed to find a matching paper book with eligible TP&V for us to be ableto use it.
V.65. I have already scanned or typed a book; it's on my web site.
The world can already access it. Why should I add it to the
Gutenberg archives?
The Project Gutenberg archives are widely copied and searched, andmuch safer and more permanent that any individual website can possiblybe. We aim to keep this collection together over not just years, butcenturies. You took the trouble to transcribe this book. We canrelate; that's what we do, as well. We know you want this work tosurvive you and your ISP, and we believe we can do that. And it's notas if you have to take it off your website when we make a copy; you'rejust using your candle to light another!
If you want to let readers know that your site has other related
material, you can put that information in the Credits Line [V.47].
Taking a real-world example, you could ask us to add this to the
Credits line for a C. M. Yonge text:
A web page for Charlotte M. Yonge will be found at
V.66. I have already scanned or typed a book, but it's not in plain text format. Can I submit it to PG?
Yes, of course. We'll be happy to discuss format options with you, andwe're quite experienced in converting between multiple formats anddeciding which formats work best and will have the longest life. Allyou need is to get us a copy of your TP&V [V.25].
About author-submitted eBooks:
V.67. I've written a book. Will PG publish it?
PG gets submissions from young people, for example, who just want toget a story they wrote published in PG. We wish them well with theirwriting, but that's not really why we're here.
If you are a published author, or perhaps an academic who wants to puta textbook into the archives, it's quite likely that we will publish*t.
V.68. I have translated a classic book from one language to another.
Will PG publish my translation?
Yes, if we can.
The book that you translated needs to be in the public domain, and wewill need the same proof of eligibility that we would use if you werecontributing the book in its original language.
For example, if you were translating Hesse's Siddhartha (publishedpre-1923 in German, but no pre-1923 English translation available), wewould need to copyright clear [V.25] the original German edition fromwhich you worked—it needs to be a pre-1923 or otherwise public domainedition. (We actually did this one, thanks to the hard work andscholarship of some volunteers.)
V.69. OK, this is one of the cases where PG will publish it.
What do I do next?
You need to decide about copyright issues. Do you want to release yourwork to the public domain, or do you want to retain copyright? If youwant to retain copyright, what terms do you want to release it under?The next few questions deal with those issues.
Having decided that you want PG to publish it, and decided whatrestrictions (if any) you want to place on further distribution, youjust need to write the appropriate letter and send the text to us.[V.46]
V.70. I hold the copyright on a book. Can I release it to the public domain?
You can. All you need to do is put a statement into the releasedversion of the text saying that you have.
If you want to release it into the public domain and distribute itthrough Project Gutenberg, you should send us a letter to that effect.
To: Michael S. Hart
Founder, Project Gutenberg
405 West Elm Street
Urbana IL, 61801-3231, USA
Dear Project Gutenberg:
I am the sole copyright holder for the book, "Wallaby Happiness." It gives me pleasure to release this work into the public domain, and I invite Project Gutenberg to publish this public domain edition.
Gregory B. Newby
Once you have released it into the public domain, neither we noranyone else needs your permission to publish it, but for us to be surethat it is a public domain version, we do need a signed letter.
V.71. I hold the copyright on a book. Do I have to release the book into the public domain for Project Gutenberg to publish it?
Absolutely not! For example, many contributors of copyrighted materialwant to share it with the world, but do not want it commerciallyrepublished by other companies.
You can grant Project Gutenberg perpetual, non-exclusive, world-widerights to distribute your book on a royalty-free basis by sending aletter to Michael Hart. Your letter may be brief, but must be signed,and must include the name of the book and the assertion that you arethe copyright holder or the agent for the copyright holder.
If you want some related information, like a link to your website,included in the text, we will be happy to oblige.
Once we have posted a text, many people will copy it. We have noeffective mechanism for "recalling" texts that we have posted, soplease be sure, before you commit to this, that you intend to followthrough with it, because there is no way to change your mind later.
Here is a sample letter, including the address to send it to:
To: Michael S. Hart
Founder, Project Gutenberg
405 West Elm Street
Urbana IL, 61801-3231, USA
Dear Project Gutenberg:
I am the sole copyright holder for the book, "Wallaby Happiness." It gives me pleasure to grant Project Gutenberg perpetual, worldwide, non-exclusive rights to distribute this book in electronic form through Project Gutenberg Web sites, CDs or other current and future formats. No royalties are due for these rights.
Gregory B. Newby
V.72. I hold the copyright on a book, and would like Project Gutenberg to publish it. Can I choose what rights to assign?
For PG to be in a position to copy it, we do need perpetual,worldwide, non-exclusive, royalty-free rights to distribute the bookin electronic form. What rights you choose to assign to readers afterthat is a decision for you to make.
The Creative Commons site <> may giveyou some ideas of what practical use you can make of your copyright tosee that the work is used in the ways you intended.
About what goes into the texts:
V.73. Why does PG format texts the way it does?
PG texts are formatted as plain ASCII, with 60-70 characters per line,with a hard return [CR/LF] at end of line, and some people ask "Why doit this way? You could omit the hard returns and let the reader's wordprocessor or Reader software wrap the lines. You could use "8-bit"accented characters for non-English characters." "You could use ' - 'instead of '—' for an em-dash." And so on, through a different choicewe could make for every formatting feature. And the answer, of course,is that we could do it differently, and sometimes we do, but mostly wekeep to one consistent style.
We'll be discussing each of the formatting decisions below, not onlygiving the summary PG answer, but also discussing the plusses andminuses of each, and the possible options.
Like any question beginning "Why does/doesn't PG . . . ?", the answeris "Because that's what the volunteers and readers want!". Theseconventions have been worked out over the years, largely by MichaelHart, our founder and chief volunteer, in conjunction with all of usvolunteers, as the result of feedback from readers.
We are guided throughout by the principle that we want to producetexts in the simplest format that will adequately express the content.Quoting Michael Hart (1994):
Etext as developed and distributed by Project Gutenberg since 1971 was
never intended to be a copy of a paper or a parchment [remember, first
Project Gutenberg Etext was typed in from parchment replicas of the US
Declaration of Independence].
The major purposes of Project Gutenberg have always been:
1. to encourage the creation and distribution of electronic texts for the general audience.
2. to provide these Etexts in a manner available to everyone in terms
of price and accessibility [i.e. no special hardware or software],
and no price tag attached to the Etexts themselves.
3. to make the Etexts as readily usable as possible, with no forms or
other paperwork required, and as easily readable to the human eyes
as to computer programs, and in fact, more readable than paper.
There is sometimes a conflict between "simplest format" and"adequately express the content"; further, different people havedifferent views on what is "simple" or "adequate". You, the producerof the text, have spent the time and effort to make the eBookavailable to the world, you have thought more about it than anyoneelse, and we respect your informed judgment. However, please makesure that your judgment has been informed, by studying theprecedents and reasons behind our guidelines.
Where a simple, standard PG-ASCII layout does not, in your view,"adequately express the content", you should think of making your textin another open format, perhaps HTML or XML or TeX, that allows you touse more characters, more formatting options, and images. We arealways happy to accept these kinds of files. In these cases, youshould also provide a standard PG-ASCII version, even if you feel itis unacceptably degraded, for those who cannot use your preferredformat.
Just ten years ago, presentation as plain ASCII was not only auniversal standard, it was effectively the only way that most peoplecould view the books. The first version of the HTML specification hadbeen drafted, but was unknown among the general public. XML did notexist. SGML was (as it still is) the province of specialists.Specialized eBook readers and PDAs had not yet appeared.
In 2002, plain vanilla ASCII is still readable everywhere, but peoplealso want to convert our texts into other formats for more convenientloading on readers and web sites. We therefore have to keep in mind thatour works will be processed by automatic conversion programs, none ofwhich is perfect, and we have evolved some "defensive formatting"practices, which, while retaining the universality of plain text, alsosupply clues to automatic converters about how they should treat thelayout. These do help to keep converters from making at least the worstmistakes. The most significant "defensive formatting" practices areindenting unwrappable text like quotations, and using underscoresrather than CAPITALS for italics. Different volunteers have differentpriorities: at one extreme, some people want to make the best plain textthey can, giving no weight to conversion issues; at the other, somepeople emphasize the cues that will allow automatic reformatters toconvert the texts well, even if that causes some ugliness in the plaintext. Most of us operate somewhere between, making the choices we feelare best depending on the context. Getting a text on-line is theimportant thing; which choices you make in doing so is a matter ofdetail.
About the characters you use:
V.74. What characters can I use?
a) You should use plain ASCII for straight English texts.
b) When producing a text partly or completely in a language that requires accents, you should use the appropriate ISO-8859 character set for the language, and specify which you are using, and also provide a 7-bit plain ASCII version with the accents stripped.
c) When producing a text in a language that doesn't use one of the ISO-8859 character sets, you should use the encoding most commonly used for that language. [e.g. Chinese—Big 5]
d) When producing a text containing more characters than can be found in any one of the ISO-8859 character sets, you should use Unicode.
You should use plain ASCII wherever possible—that is, the letters andnumbers and punctuation available on a standard U.S. keyboard, withoutaccented letters. The immediate and major exception to this is when youare typing a text written in a language like French or German thatrequires accents.
There is a problem with using non-ASCII characters. They do notdisplay consistently on all computers; in fact, they do not evendisplay consistently on the same computer! On my computer, forexample, what looks like an e-acute in this editor just shows as ablack box in another editor, or even using a different font in thesame editor. And this is by no means confined to some theoreticalminority; we have to deal with it all the time when posting texts.
Further, standards are changing: ten years ago, the character setCodepage 850 [MS-DOS] was very common; now it's rare except in sometexts that have survived those ten years.
We want to preserve these texts over centuries, not just decades,and at the moment there is no single clear standard that we can useacross all texts. Unicode may perhaps be a future standard, but, rightnow, it's not something that people use every day, and it's notsupported by a lot of common software.
ASCII, while limited, is supported by almost all computers everywhere,so we make a point of always supplying an ASCII version wherepossible, even if the ASCII version is degraded when compared to the8-bit original. When we get a text in, say, German, we post twoversions of it—one with accents and one without.
V.75. What is ASCII?
Don't get scared by the computer jargon; ASCII (pronounced ASS-key) isjust a name for the set of unaccented letters, numbers and othersymbols on a standard U.S. keyboard.
ASCII (American Standard Code for Information Interchange) is a set ofcommon characters, including just about everything that you can typein on an English-language keyboard. It includes the letters A-Z, a-z,space, numbers, punctuation and some basic symbols. Every character inthis document is an ASCII character, and each character is identifiedwith a number from 0 through 127 internally in the computer.
Just about every computer in the world can show ASCII characterscorrectly, which makes it ideal for PG's purpose of providing textsthat can be read by anyone, anywhere, but ASCII does not includeaccented characters, Greek letters, Arabic script and othernon-English characters, which causes some problems when we producetexts that need non-ASCII characters.
V.76. So what is ISO-8859? What is Codepage 437? What is Codepage 1252?
What is MacRoman?
Today's computers mostly work on the basis of dealing with one "byte" ata time. A byte is a unit of storage than can contain any number from 0through 255—256 values in all. It's very convenient for computers toassociate one character with each of these numbers, so that we can haveup to 256 "letters" viewable from the values stored in one byte. Thefirst 128 values, zero through 127, are defined by ASCII—so, forexample, in ASCII, the number 65 represents a capital "A", 97 representsa lowercase "a", 49 stands for the digit "1", 45 for the hyphen "-",and so on.
ASCII doesn't define characters for the values 128 through 255, and inearly days computer manufacturers used these values to hold non-ASCIIcharacters like accented letters and box-drawing lines. Of course, 128wasn't nearly enough values to hold all of the characters that peopleneeded to use for different languages, so they made the character setsswitchable, so that a PC in France could use a different set ofaccented letters from a PC in Poland. Microsoft's version of this wascalled Codepages. Each Codepage held a different set of non-ASCIIcharacters. Codepage 437, and later Codepage 850, were commonly usedfor English and some major Western European languages on MS-DOS.
MacRoman was Apple's first codepage, containing most of the accentedletters in Latin-derived languages, and MacRoman is still in commonuse on Apple Macs today.
Later, the International Standards Organization ISO got around tolooking at the problem, and defined ISO-8859-1, ISO-8859-2 and so on,as the standards for different language groups. These sets all definethe characters 160 through 255 as accented letters and other symbols,and define the 32 characters from 128 through 159 as control characters.
Since Microsoft Windows has no use for the control characters 128through 159, Windows fonts commonly use Codepage 1252, which has ASCIIin the first 128 characters, ISO-8859-1 in characters 160 through 255,and other symbols in the characters 128 through 159. Just to make analready chaotic system worse, all characters can be defined differentlyin different fonts!
Of course, most of these codepages are incompatible with each other.For example, the byte value 232 shows as a lower-case "e" with a graveaccent in ISO-8859-1 and CP1252, a capital letter "E" with diaeresisin MacRoman, a Latin capital letter "Thorn" in CP850, a Cyrilliclower-case "Sha" in ISO-8859-5, a Greek capital letter "Phi" in CP437,and so on. So if you view a text intended for one of these charactersets with a program that assumes a different character set, you seegibberish.
The good news, for mostly-English texts at least, is that ISO-8859-1,Codepage 1252 and Unicode agree on the numerical values of the accentedcharacters and symbols to be represented by the values 160 through 255.And everybody accepts ASCII—a pure ASCII file is valid ISO-8859-anything,valid Codepage-anything, and valid Unicode UTF-8.
For more detail about the mappings between Unicode and other formats,you can view Unicode<—>ISO-8859 mappings at<—>Windows mappings at Unicode<—>Apple mappings at
If you're not confused enough by now, please read the excellent guideto the whole "alphabet soup" problem at <>.
V.77. What is Unicode?
Recognizing that no single set of 256 characters can hold all of thesymbols necessary for true multi-lingual texts, ISO 10646 was created.This defined the Universal Character Set (UCS) using 31 bits, whichhas the potential for a staggering 2 billion characters.
The Unicode Consortium is a group of computer industry companieswho agree the Unicode standard. Unicode accepts the ISO 10646standards, and adds some restrictions and implementation processes.It plans for a modest million or so characters; however, this isenough for all living and extinct languages, and imaginable futureones too.
Using 4 bytes for each character is wasteful, though, when mostcharacters need only one or two, and there are programming problemswith implementing 4-byte characters, so Unicode provides TransformationFormats (UTF) which allow the characters to be encoded using fewerbytes where possible. UTF-8 and UTF-16 are common.
UTF-8, which is the most practical of these from the PG point of view,allows ASCII to be encoded normally, and usually uses two or three bytesfor other non-ASCII characters.
Because of the extra work needed to support this extra space, and thefact that most people work mostly in one or maybe two languages, Unicodeis being adopted only slowly, and most computer programs in 2002 do notfully support it. But when you need to mix Arabic, Greek, Ogham andSanskrit in one text, it's the only possible answer!
For more about this, go straight to the source at <>.
V.78. What is Big-5?
Big 5 is an encoding of a set of 13,000+ traditional Chinesecharacters.
V.79. What are "8-bit" and "7-bit" texts?
For practical purposes, 7-bit texts are plain ASCII; 8-bit textshave accented letters.
This comes from computer jargon. You can represent the 128 charactersof ASCII using 7 bits—binary digits—but to represent the 256characters needed for the various codepages and ISO-8859 standards,like accented letters, you need 8 bits. Hence, we call a text thatuses non-ASCII characters in a character set like Codepage 850 orISO-8859-1 an "8-bit" text.
When we post a text as both 8-bit and 7-bit, as we do when ASCII isnot enough to render the text acceptably, we name the file with an"8" or a "7" at the start. So, for example, Crime and Punishment byDostoevsky is named 8crmp10 for the 8-bit version with accents, and7crmp10 for the 7-bit version without accents.
See also FAQ [R.35]: "What do the filenames of the texts mean?"
V.80. I have an English text with some quotations from a language that needs accents—what should I do about the accents?
If stripping the accents would unacceptably degrade the book, thensubmit two versions, one "8-bit" with the accents included and one"7-bit" plain ASCII, and we will post both.
This is a hard choice. What constitutes "unacceptable degradation"?
Clearly this is a decision that all of us in PG have to make. It's avery common problem, and different people have different views. Forthat matter, different print publishers have different views; you willsee the words "debris", "facade" and "cafe" printed with and withoutaccents in different books, and even in different editions of the samebook.
We don't want to post two versions when we don't have to. It doublesthe posting work, doubles the disk space needed, potentially confusesdownloaders, doubles the maintenance when we need to correct the text.On the other hand, we don't want to degrade the text.
There is no clear line, no definitive answer to what level ofdegradation is acceptable. Most producers feel that there is no pointin making a separate version when dealing only with a few foreignwords thrown in among the English, but when, for example, somesignificant dialog between the characters is in French or Spanish,it's harder to say that stripping the accents is acceptable. You, theproducer, need to decide this on a case-by-case basis. If you're notsure, discuss it with one of the Directors of Production or one of thePosting Team.
If you have made the text with accents, you can choose to make your own7-bit version and send it to us, or just send the 8-bit version andwe'll make the 7-bit version from it. Some people prefer to make theirown 7-bit editions; some don't. Whether you use a Microsoft Codepage,one of the ISO standards or MacRoman doesn't matter—we can convert anyof them for you.
V.81. I have some Greek quotations in my book. How can I handle them?
There is no way to show Greek letters in ASCII. You have threeoptions:
You can just replace the Greek words with [Greek] to indicate to thereader that you have omitted it.
You can "transliterate" the Greek to ASCII. Greek letters do have acorrespondence to plain "Latin" letters—for example, the Greek letter"delta" can be represented by the letter "d". There is a simple PGguide to transliteration at <>.This practice has had a long and honorable history: words like"amphora" and "hubris", for example, are straight transliteration fromthe Greek. This is usually the best option.
If there is enough Greek to warrant it, and no other accentedcharacters, you may be able to use the ISO-8859-7 character set, andsubmit both 7-bit and 8-bit versions [V.79]. ISO-8859-7 is for modernrather than classical Greek, but, if necessary, you will surely be ableto express the Greek fully in Unicode. However accurate your Greek,that still leaves the issue of what to do with the 7-bit ASCIIversion, where transliteration is probably still your best bet.
V.82. I want to produce a book in a language like Spanish or French
with accented characters. What should I do?
Use the appropriate ISO-8859 Character set [V.76] for your8-bit version.
About the formatting of a text file:
This section of the FAQ goes into great detail about all kinds offormatting questions. However, looked at from a higher level, the onlyreal issue is that we want to render texts clearly, with formattingthat reflects the original, so that readers of the plain text formatcan read them easily, and people converting them to other formats cando so reliably. When you come across a case that is not covered by thedetailed guidelines below, keep this ultimate aim in mind, and makethe best decision you can. Don't get hung up for hours or days over aquestion of formatting—if you want advice, look at how other peoplehave handled the same situation in previous texts, or ask othervolunteers for their ideas.
V.83. How long should I make my lines of text?
For normal prose, such as you find in a novel, your lines shouldmostly be 60 to 70 characters long, not shorter than 55, not longerthan 75 except where it can't be helped. Never, ever longer than 80,except where you're trying to render a non-text structure, like afamily tree.
For poetry, make the text look as much like the book as possible. Thisalso applies to some plays where the lines are clearly intended to bebroken at specific points, whether blank verse or not.
V.84. Why should I break lines at all? Why not make the text as one line per paragraph, and let the reader wrap it?
We could either use 70-character lines and let readers unwrap them ifthey want to, or use infinite-length lines and let readers wrap themif they want to. We choose to wrap the lines so that they are readableon even the simplest of text editors and viewers.
V.85. Why use a CR/LF at end of line?
CR/LF can lead to double-spacing, notably on Mac and Unix, but atleast there is a CR in there for Mac users, and there is an LFfor *nix users.
If you don't know or care what this is about, please skip blithely on.
There are three differing standards for how to represent the end of aline of text. In brief, Apple Macs use the CR character. Unix and itsvariants use the LF character. Microsoft systems, from MS-DOS throughWindows, use both together.
If you want the history behind these:
CR stands for Carriage Return, and comes from the old typewriter /teletype idea of a command to move the print head from the right ofthe page back to the left when it reaches the end;
LF stands for Line Feed, and comes from the old typewriter / teletypeidea of a command to move the print head down a line;
CR/LF together indicate moving down a line and back to the left of thepage.
The history is not relevant to today's computers in principle, but inpractice they all use one of these legacy conventions, and there'snothing we can do about it but pick one.
V.86. One space or two at the end of a sentence?
Whichever you prefer, but if using two spaces, please use them only atthe end of a sentence, not after abbreviations like "Dr." and "percent.", and not after non-sentence-ending punctuation like thequestion-mark in the sentence: "Must you go? when the night is yet soblack!"
Many people have strong views on either side of the "one space ortwo?" question, and we're not about to try and argue with them. Usewhichever is most natural for you.
However, if using two, you take responsibility for deciding where thesentence ends. You can't just place two spaces after every period,question-mark and exclamation mark, since periods are also used forabbreviations end ellipses, and question-marks and exclamation-marksdon't always end sentences.
V.87. How do I indicate paragraphs?
Just leave a blank line before each paragraph.
V.88. Should I indent the start of every paragraph?
Printers do this when publishing paper books because they do not leaveblank lines in the text, but there is no need for indenting in oureBooks.
V.89. Are there any places where I should indent text?
Yes. You should always make poetry look like the original, and thatmay mean indenting some lines, for example:
I was a child and she was a child,
In a kingdom by the sea;
But we loved with a love that was more than love—
I and my Annabel Lee;
Even when poetry doesn't have indented lines, it is a good idea toindent quotations embedded in prose. Remember, others will beconverting your text later—to HTML, to PDA reader formats, to formatsthat don't even exist yet—and much of this conversion will be doneautomatically, by computer programs. It is very hard for a program toknow when it can and can't re-wrap lines to fit a screen size unlessit has a clear signal that this line should not be wrapped. This isone of the biggest problems with auto-converting PG texts.
Just about all formatting programs "know" that lines that are indentedshouldn't be wrapped, so by indenting lines just a space or two, youcan prevent
I think that I shall never see
A poem lovely as a tree.
from turning into
I think that I shall never see A poem lovely as a tree.
in some future reader's eBook.
You don't really need to do this in texts where the whole book ispoetry or blank verse, since these will probably be recognized aswhole books that shouldn't be rewrapped, but when there are a fewlines of quotation amid an acre of straight prose, a few spaces willbe a life-saver. Even in the original plain text version, the extraspaces serve to set the quotation off from the main text.
You shouldn't get carried away and indent things 20 spaces for thisreason, though. Anything up to four spaces is reasonable; more isexcessive. If you're indenting many short verses in this way, keepyour number of spaces for indentation consistent throughout the book.
There are some other times when you may judge it best to indent, wheretext is indented in the paper book, like newspaper headlines orpictures of handwritten notes.
V.90. Can I use tabs (the TAB key) to indent?
The problem with tab characters is that they act differently indifferent applications. Typically a tab will move the text to the nexttab stop, which might be four spaces on your PC, but 20, or none, onsomeone else's. The effects are unpredictable.
V.91. How should I treat dashes (hyphens) between words?
In typography, there are four standard types of dashes: the hyphen, theen-dash, the em-dash, and the three-em-dash.
Originally, printers called these the "em-dash" because it was thesame width as the capital letter M in whichever font they were using,the "en-dash" because it was the same width as the capital letter N,and the "three-em-dash" because it was as long as three capital Ms.
The hyphen is used for hyphenated words, like "en-dash" itself, or"to-day" or "drawing-room". For this, you just press the single dashor hyphen key on your keyboard.
In typography, the en-dash is a little longer than the hyphen, and istypically used for duration, where you could substitute the word "to".For example, if you were printing "1830-1874", or "9:00-5:30", you woulduse an en-dash instead of a hyphen. The en-dash is also sometimes usedas hyphenation between words that are already hyphenated, for example,"bed-room-sitting-room" might use an en-dash as its central dash toemphasize that it is a different type of separator from the plain hyphensbefore "room". However, there is no ASCII character for an en-dash, andwe use the hyphen in these cases. (HTML and some character sets do provideseparate entities for en-dash and em-dash.)
The em-dash is shown in print as a longer dash, and for PG purposes, youshould render it as two hyphens with no spaces around them.
You use the em-dash as a kind of parenthesis—as I am doing here—orto indicate a break in thought or subject within a sentence. There isno ASCII equivalent of the em-dash; there is no key on your keyboardthat you can press to get one. For PG texts, we represent the em-dashas two dashes with no space between or around them—like this.
The em-dash can also be used at the end of a sentence or speech toindicate that the speaker stopped or trailed off. For example:
"When I saw you with Emily, I thought you were— I thought she was—"
In a case like this, there may be a space following the em-dash, andthe context may demand that there should be a space following theem-dash, not because of the em-dash as such, but to make the breakbetween the statements or sentences clear.
These two hyphens represent one character, so you should never breakthem at line end, with one hyphen at the end of the first line and theother at the start of the second. If you have an em-dash near lineend, you can break the line either before or after the em-dash, butnever in the middle.
The fourth type of dash, the three-em-dash, is used to represent amissing word, or an undetermined number of missing letters. Youwill often see it in a sentence like:
Dr. P——— was known for his honesty.
Dr. ——— was known for his honesty.
where there is a convention that the character's name has beenredacted. Logically, we should represent the three-em-dash as sixdashes, but you may reduce that to four. Whichever you choose, do useit consistently in the text you're producing.
Unlike the em-dash, you should leave a space in such cases wherever aspace would have been before the letters were replaced by dashes.
Here's a summary table of the dashes:
Name ASCII Used for
Hyphen - Hyphenated Words
En-dash - Durations, like "3:00-5:30"
Em-dash — Break in sentence or parenthetical comment
Three-em-dash ——— Indicating a word that was edited out.
V.92. How should I treat dashes replacing letters?
If the dashes obviously represent individual letters, use the samenumber of hyphens. Otherwise, you can use a three-em-dash (see above:6 or 4 hyphens) in such places.
A common convention when a character in a novel is using bad language,or when reference is given to a character whose full name is not beingused, is to replace the letters with dashes. For example,
"That D—-l, Mr. C———s will regret his hasty actions!"
In this case, it is clear that "D—-l" is meant to represent "Devil"and that there is a character whose name begins with "C" and ends in"s" whose name is not spelled out in full. Where the book makes itclear how many letters are represented by hyphens, just use that numberof hyphens.
Where the number of letters omitted is not clear, you can decide howlong you want to make your extended dash. Typographers often use the"three-em-dash" for this, so called because it is as wide as threecapital Ms. Logically, since we represent an em-dash by two hyphens, wemight represent a three-em-dash as six, but if you feel that sixhyphens is too long, you can choose a shorter length, like four, but ifyou do, keep it consistent within your text:
It was in the town of S——, walking on M—— Street, that
Sowerby came upon Dr. T—— taking the morning air.
V.93. What about hyphens at end of line?
Remove the hyphens from single words that were wrapped by the printerat line-end on the paper copy. Where two words are joined with ahyphen, you can leave the hyphen at end of the text line.
Books are usually printed with words broken at end of line to make theright side of the text perfectly even. You should remove all suchhyphens. For example, in the sentence:
Mary's mouth tightened as she saw the marks on the car- pet, and her hands balled into fists.
you should remove the hyphen from "carpet".
Words which are strung together and hyphenated by the author pose adifferent question. It is perfectly OK from the point of view of areader of the plain text version for such a hyphen to occur at end ofline, for example:
Now that the guns were silent, convoys brought badly- needed medical supplies and food.
However, be aware that if somebody later rewraps the text for use in adifferent format like HTML, it is possible that they will introduce aspace where it should not be:
Now that the guns were silent, convoys brought badly- needed medical supplies and food.
so there is still a small disadvantage to having a hyphen at line-end.
Sometimes it's not entirely clear whether the hyphen is there becauseit has to be, or just because it happens to fall at the end of theline:
Daisy rushed to the door, but there were no letters for her to- day, and she retreated sadly.
Sometimes "today" is written as "to-day", especially in older works.So which is this? Should we remove the hyphen or not? In this case,the best thing to do is search the rest of the text for the same word,and see whether it is consistently hyphenated or not in other places.
V.94. What should I do with italics?
There are three different ways volunteers currently render italics:like THIS, like this and like /this/. Pick one, and use itconsistently in your text.
There are really two questions here: "How should I render italics?"and "When should I render italics?"
The original PG standard for italics was to render emphasis italics asCAPITALS, using underscores for an italicized I, and do nothing fornon-emphasis italics like foreign words and names of ships, and thisis still the most common usage. For reading a plain-text file in aplain text editor, it is still arguably the most reader-friendly usageas well.
It has two drawbacks:
1. if you do want to preserve italics for non-emphasis words, you may end up with a very ugly text where there are too many capitals.
2. it is impossible to convert CAPITALS reliably back into italics, since the original text might have had a capital letter, or even been all capitals in the first place. This is especially true of automatic conversion for people who want to read PG texts on eBook readers.
To overcome these problems, many volunteers now use underscores or/slants/ to render italics. These allow you to preserve all italicswithout creating an ugly plain-text, and to remove the ambiguity ofCAPITALS. Underscores are more popular than slants, but some peoplefeel that underscores should properly be reserved for underlined text.Since printers tend to avoid underlines, however, there aren't manybooks where this causes a real conflict.
V.95. Yes, but I have a long passage of my book in italics! I can't really CAPITALIZE or otherwise /mark/ all that text, can I?
No, you really can't. On the other hand, if the author intended thatsection to stand out, you don't want to ignore that information andwithhold it from future readers.
What you can do is format it differently from the rest of the text.For example, if you're averaging a 68-character line throughout normalparagraphs, you could reasonably use shorter lines, like 58characters, for the italicized section. Going a step further, youcould shorten the lines and indent them a space or two as well. Thiswill give a clear signal to future readers and converters that thissection is to be treated specially.
V.96. Should I capitalize the first word in each chapter?
Capitalization of the first word is often used in printed material toemphasize the break at the start of a section or chapter on the paper,but it is not necessary in an eBook, and leads to the same kind ofambiguity as does the capitalization of italics, and for far lessreason.
If you feel you really must capitalize the first word, we probablywon't stop you, but if so, please do it consistently throughout thebook, not just in one or two places, so that a future reader can becertain that these capitalized words were a chapter-head convention,and not otherwise intended for emphasis.
V.97. What is a Transcriber's Note? When should I add one?
A Transcriber's Note is a small section you can add to a text youproduce to give the reader some information about changes you made tothe book when rendering it into text.
A Transcriber's Note is not the same as a footnote—a footnote is partof the text you have transcribed; a Transcriber's Note is a note thatyou add to the text, explaining something you have done oromitted. If there is a Transcriber's Note, it may be at the top or theend of the text, and it should be clearly marked so that a readercannot confuse it with the main text or an introduction.
The main thing is to ensure that a reader cannot confuse text that youhave added with text that was in the original book.
Transcriber's Notes are rarely needed, but if, for example, you foundmisprints in the text, or things that might look like misprints eventhough they're not, you may note them here, if it seems relevant. Ifthere is an image in the book that is important to the content, youmay describe it in a note. If there was unusual typography that youhad to represent in some uncommon way, you might well explain thathere.
You don't need to add a Transcriber's Note just for common conversionslike italics, and you should not use such a note to add your owncomments or views about the text or the author. It's just there to letthe reader know what decision you have made about rendering the text.
Here are some examples of Transcribers' Notes:
Transcriber's Note:
The irregular inclusion or omission of commas between repeated words("well, well"; "there there", etc.) in this etext is reproducedfaithfully from the 1914 edition . . .
Transcriber's Note:
Inserted music notation is represented like [MUSIC—2 bars, melody] or
[MUSIC—4-part, 8 bars]
[Transcriber's Note: This letter was handwritten in the original.]
Transcriber's Note:
The spelling "Freindship" is thus in the original book.
Transcriber's Note: Some words which appear to be typos are printedthus in the original book. A list of these possible misprints follows:
If there is an image that is important to the content you may describeit at the point in the text where it appears, for example:
[Transcriber's Note: Here there is a map of three islands just West ofand parallel to a coastline running SW to NE, with a big X marked onthe North of the middle island. A spur of land extends from themainland, sheltering the islands from the north-east.]
Transcriber's Notes that apply to the whole text should be placed atthe start or end of the text—your choice. Notes that pertain to aspecific point in the text, like the map example above, should beplaced at the point where in the text where they are relevant, but notinterrupting a paragraph except where it cannot be avoided.
V.98. Should I keep page numbers in the e-text?
No. But there are exceptional cases . . .
In general, the page numbers of the original book are irrelevant whenmaking a reader's edition for PG; they are annoying and intrusive foranyone trying to read it, and if you did keep them, they wouldprobably be removed by anyone converting it. Get rid of them!
But there are a few books where page numbers are appropriate.Non-fiction books that use page numbers as internal cross-referencesare the prime example; if, on page 204, the text reads
"Our studies of plants (see pp. 141-145) show that this is true."
and this kind of cross-reference is frequent throughout the text,then it is probably best to keep the page numbers, since it isotherwise very difficult to honor the author's intent.
In the more common case where cross-references exist, but are notfrequent, and not essential to the text, you have several choices:leave the cross-references in, meaningless though the page numbersare, remove the cross-references, change the cross-references tosomething relevant (like "Start of Chapter 12" instead of "pages141-145"), or, if you can make it work in context, insert referencesin the text for the cross-references to point to, like [Reference:Plants] and then reformat the cross-reference like "Our studies ofplants (see [Reference: Plants]) show that this is true."
There are a few other cases, where the text you create is likely to bethe subject of study or reference, in which it may also be desirableto retain page numbering.
When there are pages at the end of the book with notes referring to pagenumbers, the simplest answer is to change the page number references tochapter numbers, and add a quote from the page referred to if it's notalready in the book's end-notes. That way, a reader can search for thephrase.
V.99. In the exceptional cases where I keep page numbers, how should
I format them?
Within brackets of your choice, with one space either side, simplyadded to the text at the exact point of the page break. Unless thereis some [142] special reason, you shouldn't insert a line break or newparagraph when indicating a page number; just insert it in the text,as I did with "142" above.
You should use whichever of round brackets, (143) square brackets,[144] or curly brackets {145} is not used (or least used) within themain text itself, and then use it consistently. Try to make sure thatyour page numbers cannot be confused with anything else.
Don't run your[146]page[147]numbers right up against words with spacesomitted; this just makes the text hard to read. Use spaces before andafter.
Where the page break is at the start of a chapter or headed section,you can put it on a line of its own, for example:
Where a paragraph begins on a new page, you should put the page numberat the start of the paragraph, as:
[149] With the extinction of the dinosaurs . . .
V.100. Should I keep Tables of Contents?
Yes, but just keep the contents themselves, and not the page numbersfor each chapter or section, except where you have kept the pagenumbers in the whole text. When you have removed the page numbers fromthe book, it doesn't make much sense to leave them in the TOC.
Here, for example, is a typical TOC. In the original text, each chapterhad a page number beside it:
1 When the duch*ess was Dead
2 Lady Mary Palliser
3 Francis Oliphant Tregear
4 It is Impossible
5 Major Tifto
6 Conservative Convictions
8 He is a Gentleman
9 'In Media Res'
10 Why not like Romeo if I Feel like Romeo?
11 Cruel
12 At Richmond
Note that I have indented the lines here, to give a sign to automaticconverters that these lines should not be wrapped into one paragraph.
V.101. Should I keep Indexes and Glossaries?
If you are working from a pre-1923 publication, then yes.
If you are working from a modern reprint, you must be careful not totake any of the text that might have been added by the modernpublisher. If you have any doubt about whether the index or glossarywas part of the original printing, you should leave it out. Often withreprints, under your Clearance Line [V.37], you may see an instructionnot to use indexes. In such cases, or if there is any doubt at all,don't.
V.102. How do I handle a break from one scene to another, where the
book uses blank lines, or a row of asterisks?
Use a blank line, followed by a line of 3 or 5 spaced asterisks ordashes, followed by another blank line.
In a printed book, where the point of view switches from one characterto another, or some other break in the narrative is made without a newchapter or headed section, the publisher will often denote the breakjust by a couple of blank lines. This gives the reader a cue to noticethat the point of view has switched, and avoids confusion.
However, a printed book cannot be edited or changed, while an eBookwill be edited and converted over its lifetime, and it is likely thatif you denote this break just by a couple of blank lines, as in thebook, your break may be lost. For example, in automated conversion toa PDA reader format, it is common to merge multiple blank lines intoone.
In making a PG e-text, you may indicate this break by a couple ofadditional blank lines, but, if your text is later converted intoanother format such as HTML, the extra blank lines may get lost in theediting or rendering. Or the person doing the conversion may simplythink that the extra blank line was a mistake, and remove it. To guardagainst this, you should add an unambiguous visual break such as aline of spaced asterisks:
* * * * *
The exact layout of your break is not really important, and you canuse whatever format you prefer. Blank line followed by five spacedasterisks followed by another blank. Or you could use two blank lines,and dashes instead of asterisks. Just make sure that future readerscan be in no doubt that you intended to indicate a break that wasreally in the original printed text.
V.103. How should I treat footnotes?
In a printed text, the most common treatment for footnotes is to putthem at the end of the page to which they refer. Sometimes, editorsgather them all at the end of the book. Footnotes are a realformatting problem for an eBook without defined physical pages; thereis no agreement between readers about which is the best way to renderthem.
There are three basic ways of rendering footnotes in an e-text:
You can insert them right into the text, in brackets, at the point inthe paragraph where they occur, with or without an indication thatthey were originally footnotes. This is only reasonable in a text withvery short footnotes.
You can insert them after the paragraph to which they refer, eithercontiguous with the paragraph or as a new "paragraph" of their own, asI am doing with this one. If the text contains any footnotes longerthan a line, [1] you should not try to just append them to theparagraph; you should make a new "paragraph" of them, with a blankline before and after.
[1] Some footnotes can go on not only for several lines, but forseveral pages!
You can gather all footnotes at the end of the e-text, or to the endof the chapter to which they refer.
Of these three, gathering all footnotes to the end of the chapter orthe end of the whole text is probably the friendliest option, since itpreserves the original intention of allowing the reader to continuereading the main text without interruption. However, it may involvesome renumbering and general note-keeping on your part, and may not beneeded where there are only a few short footnotes. You can see anideal example of this kind of footnote marking in our edition ofDarwin's "The Voyage of the Beagle", file vbgle10.txt from 1997, Etextnumber 944, which you can get from:<>
V.104. My book leaves a space before punctuation like semicolons, question marks, exclamation marks and quotes. Should I do the same?
If you look closely at these "spaces", you will see that they are notas wide as a normal space—they tend to be half to three-quarters aswide. These don't actually represent spaces as such; they were just aconvention used by typesetters to make the text feel less cramped, andthey did not express any specific intent on the part of the author.
OCR software tends to see them as full spaces, and one of the jobs youtypically have to do when editing a text that has been OCRed is toremove them.
In some texts, this also happens following an opening quote, so your
OCR might read a sentence as:
" Hello ! How are you to-day ? "
which you should correct to:
"Hello! How are you to-day?"
Samples of this can be seen in the images used for the FAQ
"Why am I getting a lot of mistakes in my OCRed text?" [S.17]
V.105. My book leaves a space in the middle of contracted words like
"do n't", "we 'll" and "he 's". Should I do the same?
Unlike the pseudo-spaces before punctuation, these really wereintended as spaces indicating the break between words—that is, wherewe would nowadays contract two words into one, the author or editorhas made the contraction, but left them as two separate words.
Since this effect was intended, it is usual to leave the spaces in.Some people who really do n't like this style of spelling do removethem, but generally volunteers want to preserve the text as printed.
V.106. How should I handle tables?
Just line up the information neatly in columns. If you use anon-proportional font [W.5] you will be able to do this reliably. Youcan also use the dash character "-" , the underscore "_" and the pipecharacter "|" to make borders if you really need to, but it's usuallybetter to omit them. It is, though, often good to indent your table alittle, to set it off from the main text, and to avoid the danger ofhaving it automatically wrapped by some converter later. For example,from "The Albert N'Yanza, Great Basin of the Nile" by Sir Samuel WhiteBaker:
TABLE No. 1.
Table for Increased Reading of Thermometer, using 0 degrees 80 as the
Result of Observations for its Error.
Month. 1861. 1862. 1863. 1864. 1865.
January. . . — 0'143 0'314 0'487 0'659
February . . — '157 '328 '501 '673
March . . . 0'000 '172 '344 '516 '688
April . . . '014 '186 '358 '530 '702
May . . . . '028 '200 '372 '544 '716
June . . . . '043 '214 '387 '559 '730
July . . . . '057 '228 '401 '573 '744
August . . . '071 '243 '415 '587 '758
September. . '086 '257 '430 '602 '772
October . . '100 '271 '444 '616 '786
November . . '114 '285 '458 '630 0'800
December . . 0'129 0'300 0'473 0'645 —
V.107. How should I format letters or journal entries?
Make them look like they are in the printed book. If the signature isindented in the book, indent it in the letter. For example:
No consideration would induce me to
change my resolve in this matter, but I am
willing to engage your services as my agent
for a fee of 100 pounds.
"H. Middleton"
When a letter appears in the middle of lots of prose, using shorterlines for the letter is an effective way of making the letter standout, without resorting to indenting the whole thing.
When the book is largely composed of letters or entries, as happens inan epistolary novel or the publication of somebody's letters orjournal, you might reasonably leave two or three (but whichever youchoose, keep it consistent throughout the book!) blank lines betweenentries to give the reader a visual cue that the next is not just anew paragraph, but a new entry, for example:
10 pm.—I have visited him again and found him sitting in a corner brooding. When I came in he threw himself on his knees before me and implored me to let him have a cat, that his salvation depended upon it.
I was firm, however, and told him that he could not have it, whereupon he went without a word, and sat down, gnawing his fingers, in the corner where I had found him. I shall see him in the morning early.
20 July.—Visited Renfield very early, before attendant went his rounds. Found him up and humming a tune. He was spreading out his sugar, which he had saved, in the window, and was manifestly beginning his fly catching again, and beginning it cheerfully and with a good grace.
I looked around for his birds, and not seeing them, asked him where they were. He replied, without turning round, that they had all flown away. There were a few feathers about the room and on his pillow a drop of blood. I said nothing, but went and told the keeper to report to me if there were anything odd about him during the day.
11 am.—The attendant has just been to see me to say that Renfield has been very sick and has disgorged a whole lot of feathers. "My belief is, doctor," he said, "that he has eaten his birds, and that he just took and ate them raw!"
11 pm.—I gave Renfield a strong opiate tonight, enough to make even him sleep, and took away his pocketbook to look at it. The thought that has been buzzing about my brain lately is complete, and the theory proved.
This is different from the case mentioned in the FAQ [V.102] "How do Ihandle a break from one scene to another, where the book uses blanklines, or a row of asterisks?". In that case, we added a row ofasterisks because future reformatting or conversion could causeconfusion about the scene break that was explicitly signalled by theblank lines on paper. In this case, each new letter or journal entrycannot be mistaken by a careful reader, so we don't need asterisks ordashes to signal that; we're just adding a bit of extra space to makeit more readable.
V.108. What can I do with the British pound sign?
The British pound sign cannot be expressed in ASCII, but is verycommon in the works of English novelists. It evolved as a stylizedversion of the letter L (from the Latin "Librii"), and it's entirelyappropriate to represent it as such, either like:
The horse cost L8 12s. 6d.
The horse cost 8l. 12s. 6d.
This works particularly well where an amount is expressed in pounds,shillings and pence (Librii, soldarii, denarii).
Where there is a simple number of pounds, you may prefer just to usethe word:
She was a handsome widow with 500 pounds a year.
V.109. What can I do with the degree symbol?
Just type out the word "degrees" or the abbreviation "deg."—forexample:
By the time we reached Cairo it was 115 degrees in the shade.
Geographical degrees are more awkward, but should be handled the sameway:
It was at 30 deg. 15' E, 14 deg. 45' N.
In general, any symbol can be represented in words.
V.110. How should I handle . . . ellipses?
Just as I did above . . . and here! Leave one space before and aftereach dot. Do not break an ellipsis over the end of a line. Inprinciple, an ellipsis is one symbol, like an em-dash, and should notbe broken at line end.
A special case arises when an ellipsis follows a sentence instead ofbeing in the middle. . . . In this case, put the period after the lastletter of the sentence, as you normally would, then follow the usualformat for ellipses. You end up with four dots, with spaces everywhereexcept before the first.
V.111. How should I handle chapter and section headings?
For a standard novel, you can choose either four blank lines beforethe chapter heading and two lines after, or three lines before and oneline after, but whichever you use, do try to keep it consistentthroughout.
Normally, you should move chapter headings to the left rather than tryto imitate the centering that is used in some books.
V.112. My book has advertisem*nts at the end. Should I keep them?
Most people seem to think "no", and "no" is the safe choice, butopinions vary.
The typical arguments are: "The ads are not part of the author'sintent, so you should remove them." vs. "They give a flavor of theoriginal book, so you should keep them". This latter is particularlycogent when the ads are for other books by the same author.
Decide which of these statements best fits your own views in the caseyou're looking at; after that, it's up to you!
V.113. Can I keep Lists of Illustrations, even when producing a
plain text file?
Yes. As in the case of the Table of Contents, there is no point inincluding page numbers when your text doesn't have them, but the listof illustrations itself may go in.
V.114. Can I include the captions of Illustrations, even when producing a plain text file?
You can format them as short paragraphs of their own, in brackets,with the word Illustration: followed by the caption, something like:
[Frontispiece: A Flash of Light]
[Illustration: Goldsmith at Trinity College]
Don't interrupt a paragraph to insert one, unless the reader reallyneeds to know that the original illustration was in the middle of theparagraph; place the note between paragraphs instead.
V.115. Can I include images with my text file?
Yes, as I have done with the zipped version of the plain-text formatof this FAQ, but in general it makes much more sense, if you want toinclude images, to make a HTML version of the book and include themthere, where they are anchored into the text in a predictable way, andleave them out of the text version. But there are exceptional cases,such as this—I included images with this plain-text FAQ because Iwanted you to be able to experiment with them using your own OCRpackage.
If you do include images with plain text, they will be included withthe ZIP file, but not downloadable separately with the plain textfile; for example, if your file gets named abcde10.txt, and youinclude images pic1.gif, pic2.gif and pic3.gif, then willinclude all four files, but only and abcde10.txt will beposted, so the images will be available only within the zip file, so,even if you are including images, don't assume that the reader will beable to see them.
If you do include images with plain text, be sure to mention them byfilename in a note at the appropriate places in the text file;otherwise readers may not even realize they're there. For example:
[Illustration: Goldsmith at Trinity College—see goldtrin.gif]
If you do include images with a text file, don't make them too big.Readers downloading zip files of plain text expect them to berelatively small; don't burden them with huge downloads they don'twant. Use the same kind of rules and processing that you would fora HTML file, or better still, include the images only with the HTMLversion.
About formatting poetry:
V.116. I'm producing a book of poetry. How should I format it?
Make it look like the original.
The only formatting change that you might consider is to limit theamount of centering. Often, in a poetry book, the title of a poem maybe centered, when the body of the verse isn't. This can work on paper,particularly when the page is narrow, but "centering" the title on a70-column line can mean that the title ends up far to the right of thebody of the poem, which looks untidy. And even if you center the titlecorrectly over the body of this poem, the next poem may have longerlines, and so its title may not have the same center as the firstpoem, and the title of one will be off-center with the title of thenext!
If you have this kind of formatting in your book, you should considermoving all of the poem titles to the left margin rather than try tokeep compensating for different line centers. It's more consistent,and easier to read, if you just left-align all titles. To see anot-quite-successful attempt at centering the titles over the poems,take a look at the Poems of Emily Dickinson, available from<>
In that case, it would have been better to left-align the numbers andtitles. Centering isn't really an effective formatting choice in etexts.
V.117. I'm producing a novel with some short quotations from poems.
How should I format them?
As nearly as possible like they look in the book, with the exceptionthat you should indent the whole verse anywhere between 1 and 4 spacesfrom the left. This is to give a signal to automatic conversionprograms that these lines should not be wrapped.
For an example of a novel with many differently formatted quotationsembedded, see the "a" version of Clotel, file clotl10a.txt, Etextnumber 2046, from the year 2000, which you can find at<>
Some of these quotations touch the left-hand column; today, we wouldthink it better to insert at least one space before every line.
About formatting plays:
V.118. How should I format Act and Scene headings?
Pretty much like chapter headings. You can use 4 blank lines betweenacts, and 3 blank likes between scenes, or 3 between acts and 2between scenes. If your book has "END OF ACT/SCENE" footers, leavethem in the etext.
You may center act/scene headers and footers if they are centered inthe book, but it's usually best to left-align them, for the samereasons it's usually best to left-align poem titles in poetry.
V.119. How should I format stage directions?
Generally, in brackets.
In printed texts, it is common to show stage directions as italicsinside brackets. You don't have the option of italics in plain text,and you shouldn't need to use underscores or /slants/, and certainlynot CAPITALS, to indicate italics for stage directions. Normal textwithin the brackets is all you need. It will be immediately clear to areader that bracketed text consists of stage directions.
[Square brackets] are most common for stage directions, but (round) or{curly} brackets will work too, if there's a reason why they arepreferable in the case of your text. Just make sure that you use thesame kind of brackets consistently and only for stagedirections—don't use round brackets for stage directions ifcharacters' speeches also contain text in round brackets.
Some printed plays follow the convention of not closing brackets whenthe direction is at the end of a speech or scene. For example:[Exeunt.
Where the book doesn't close the bracket in a case like this, youshouldn't either.
V.120. How should I format blank verse?
Just like normal verse in poetry. Make it look like the printed book.Left-align it, and make one line of etext the same length as one lineof print.
Sometimes in blank verse, a speech may start mid-line, and the printreflects that by leaving a space on the left, and starting mid-way. Ina case like that, do the same in the etext.
About some typical formatting issues:
V.121. Sample 1: Typical formatting issues of a novel.
Look at the image novel.tif. It shows a page of a novel, with severaltypical formatting decisions to be made.
We note that there is no end-quote on the first paragraph, but that'sOK, since the second paragraph is a continuation by the same speaker,so the first paragraph doesn't need a closequote. There is also anitalicized "I", which will end up with underscores, but there isnothing else to give us any difficulty.
In the second paragraph, we have an ellipsis, an italicized Frenchword with an accented letter, the British pound symbol, and anitalicized "Here".
The ellipsis is simple.
Let's assume we're making this into a 7-bit text, so we're going toconvert the non-ASCII character a-circumflex and the pound sign. Thea-circumflex just goes to an "a", but we have several choices we canmake about the pound sign.
The italicized "Here" is clearly for emphasis, so we will mark thatup. The word "flaneur" is italicized because it is not English, butpossibly also for emphasis . . . if the sentence had read "The Majoris a fool", with the word "fool" italicized, it would clearly beemphasis. As it stands, we don't know whether emphasis is intended.This doesn't matter if we are just using underscores or /slants/ torender italics, but if we use CAPITALS, we're going to have to imposeour best guess on one side or the other.
The third paragraph shows some vaguely familiar squiggles—Greekletters! We hit the PG transliteration guide at<> and spell it out . . .rough-breathing upsilon = hu; beta = b; rho = r; iota = i; finalsigma = s. So the Greek word transliterates as "hubris". Sincehubris is a familiar word, we don't need to make a fuss about it,though we may italicize it.
We then have a note, which we will format a little differently fromthe main text to help it stand out, and a new chapter heading.
We should certainly indent the second line of the Byron quotation topreserve its original form, but we have the option whether or not toindent the first line a little to signal to any future automaticconverter that this is not to be rewrapped.
In the first paragraph of the new chapter, we need to get rid of thehyphenation of "Wentworth" at line-end and fix the two em-dashes.
In the second paragraph of the new chapter, we have a long dashbetween "d" and "l", clearly meant to denote "devil", so we will fillit in with three dashes, and we see a three-em-dash after "Lord H", sowe can use six, or possibly four, dashes for that.
Finally, we have a table, a list of money values against names.
Depending on the standards we've chosen to use throughout the book, wecould render these details in a variety of ways. For illustration,here are two acceptable possibilities:
"I shall go down to Wokingham", said Middleton, "a few daysbefore the election, and the Major will stay here. Iunderstand that there will be no other candidate, and Ishall take the seat.
"The Major is a . . . flaneur. He has no interest beyondhis own advancement. I can buy him for a hundred pounds.Here is his answer."
Wallace wondered at the hubris of his friend, andexamined the note Middleton thrust upon him.
"Sir, No consideration would induce me tochange my resolve in this matter, but I amwilling to engage your services as my agentfor a fee of 100 pounds. H. Middleton"
Now hatred is by far the longest pleasure;
Men love in haste, but they detest at leisure.
On hearing of Middleton's visit, Mr. Wentworth began hispreparations. Meeting with Thomas Lake and Riley at theback of the tap-room of The Bull—where the landlord sawto it that they remained undisturbed—he laid out theirplan of campaign.
"That d—-l Middleton shall not have the seat," he raved,"not for Lord H———; no, nor for a hundred Lords! Weshall see to it that every man's hand is turned againsthim when he arrives."
Lake unfolded a paper from his vest-pocket and smoothed it
on the table. "Here are the expenses we should undertake."
Doran L13 10s.
Titwell L 8 7s. 6d.
St. Charles L25
* * * * *
"I shall go down to Wokingham", said Middleton, "a few daysbefore the election, and the Major will stay here. Iunderstand that there will be no other candidate, and Ishall take the seat.
"The Major is a . . . flaneur. He has no interest beyondhis own advancement. I can buy him for L100. HERE is hisanswer."
Wallace wondered at the hubris of his friend, and examinedthe note Middleton thrust upon him.
"Sir, No consideration would induce me to change my resolvein this matter, but I am willing to engage your services asmy agent for a fee of L100. H. Middleton"
Now hatred is by far the longest pleasure;
Men love in haste, but they detest at leisure.
—— Byron
On hearing of Middleton's visit, Mr. Wentworth began hispreparations. Meeting with Thomas Lake and Riley at theback of the tap-room of The Bull—where the landlord sawto it that they remained undisturbed—he laid out theirplan of campaign.
"That d—-l Middleton shall not have the seat," he raved,"not for Lord H——; no, nor for a hundred Lords! Weshall see to it that every man's hand is turned againsthim when he arrives."
Lake unfolded a paper from his vest-pocket and smoothed it
on the table. "Here are the expenses we should undertake."
Doran 13l. 10s.
Titwell 8l. 7s. 6d.
St. Charles 25l.
V.122. Sample 2: Typical formatting issues of non-fiction
While non-fiction is not in principle any more difficult to formatthan fiction, many non-fiction books have lots of features likeillustrations, tables, section sub-headings and footnotes, thatrequire some extra work on the part of the producer. If theillustrations are essential, you should consider adding a HTML formatfile to allow you to present them.
See the page image nonfic.tif. This presents many formatting changes:the centered title will go to the left; the italicized chaptercontents will become regular text, and the em-dashes will become "—";the degree symbol needs to be replaced with ASCII "deg.", and ofcourse we need to render the table readably. After all that, we haveto deal with the footnote.
Here is a reasonable rendering of this page:
Strait of Magellan—Port Famine—Ascent of Mount Tarn—Forests—Edible Fungus—Zoology—Great Sea-weed—Leave Tierra del Fuego—Climate—Fruit-trees andProductions of the Southern Coasts—Height of Snow-lineon the Cordillera—Descent of Glaciers to the Sea—Icebergs formed—Transportal of Boulders—Climateand Productions of the Antarctic Islands—Preservationof Frozen Carcasses—Recapitulation.
An equable climate, evidently due to the large area of sea comparedwith the land, seems to extend over the greater part of thesouthern hemisphere; and, as a consequence, the vegetation partakesof a semi-tropical character. Tree-ferns thrive luxuriantly in VanDiemen's Land (lat. 45 degrees), and I measured one trunk no lessthan six feet in circumference. An arborescent fern was found byForster in New Zealand in 46 degrees, where orchideous plants areparasitical on the trees. In the Auckland Islands, ferns, accordingto Dr. Dieffenbach [82] have trunks so thick and high that they maybe almost called tree-ferns; and in these islands, and even as farsouth as lat. 55 degrees. in the Macquarrie Islands, parrotsabound.
On the Height of the Snow-line, and on the Descent of
the Glaciers in South America.
[For the detailed authorities for the following table,
I must refer to the former edition:]
Height in feet
Latitude of Snow-line Observer
Equatorial region; mean result 15,748 Humboldt.
Bolivia, lat. 16 to 18 deg. S. 17,000 Pentland.
Central Chile, lat. 33 deg. S. 14,500 - 15,000 Gillies, and
the Author.
Chiloe, lat. 41 to 43 deg. S. 6,000 Officers of the
Beagle and the
Tierra del Fuego, 54 deg. S. 3,500 - 4,000 King.
In Eyre's Sound, in the latitude of Paris, there are immenseglaciers, and yet the loftiest neighbouring mountain is only 6200feet high. Some of the icebergs were loaded with blocks of noinconsiderable size, of granite and other rocks, different from theclay-slate of the surrounding mountains. The glacier furthest fromthe pole, surveyed during the voyages of the Adventure and Beagle,is in lat. 46 degrees 50 minutes, in the Gulf of Penas. It is 15miles long, and in one part 7 broad and descends to the sea-coast.But even a few miles northward of this glacier, in Laguna de SanRafael, some Spanish missionaries encountered "many icebergs, somegreat, some small, and others middle-sized," in a narrow arm of thesea, on the 22nd of the month corresponding with our June, and in alatitude corresponding with that of the Lake of Geneva!
In this case, I made some decisions. I made the lines in the contentsat the top a bit shorter than usual, to help them stand out. I decidedto use the full word "degrees" rather than "deg." where I could, butnot in the table, where I shortened the entries as much as possiblewhile preserving the sense. Since I was using the full word "degrees",I decided to go the whole hog and use the word "minutes" for theminutes symbol as well, (though the minutes symbol, a single quote, isin the ASCII set) since it seemed to make the text more readable thanusing the word degrees with the minutes symbol. I also made a choiceabout the table layout.
You might prefer different choices in some of these cases, and, as inour example of fiction above, there was more than one way to do it.However, this is a reasonable rendering.
What happened to the footnote? and how did it become [82] rather thanthe [1] of the original? In this case, I decided to put all footnotesat the end of the whole text, and renumber them accordingly. So thefootnote on this page became number 82 in the overall text, and downat the end of the whole text, I would put:
[82] See the German Translation of this Journal; and forthe other facts, Mr. Brown's Appendix to Flinders's Voyage.
I could also have transcribed this as:
. . .Forster in New Zealand in 46 degrees, where orchideous plants areparasitical on the trees. In the Auckland Islands, ferns, accordingto Dr. Dieffenbach [*] have trunks so thick and high that they maybe almost called tree-ferns; and in these islands, and even as farsouth as lat. 55 degrees. in the Macquarrie Islands, parrotsabound.
[*] See the German Translation of this Journal; and forthe other facts, Mr. Brown's Appendix to Flinders's Voyage.
if I chose to put each footnote with its own paragraph.
V.123. Sample 3: Typical formatting issues of poetry
Poetry is easy to format: just be sure to use a non-proportional font,and make it look as much like the text as possible. To avoidragged-looking centering, left-align titles.
In a whole book of poetry, there is no need to leave an indentationbefore every line; unlike a verse lost in fields of prose, there islittle danger that someone will wrap it by mistake.
Look at the image poetry.tif. On this page, we have an enlarged firstletter to start each poem, and capitals following—we can remove allthat. The titles are centered, so we will move them left.
There are line-numbers at every fifth line, and these are common inpoetry, especially where footnotes reference lines. We will keep theseout on the right-hand margin.
The third poem obviously intends the centering of its last linesin each verse as a feature, so we will keep that as best we can.
The resulting etext looks like:
Mistress Mary
Mistress Mary, quite contrary,
How does your garden grow?
With co*ckle-shells, and silver bells,
And pretty maids all in a row.
I met a traveller from an antique land
Who said: Two vast and trunkless legs of stone
Stand in the desert. . . . Near them, on the sand,
Half sunk, a shattered visage lies, whose frown,
And wrinkled lip, and sneer of cold command, 5
Tell that its sculptor well those passions read
Which yet survive, stamped on these lifeless things,
The hand that mocked them, and the heart that fed:
And on the pedestal these words appear:
'My name is Ozymandias, king of kings: 10
Look on my works, ye Mighty, and despair!'
Nothing beside remains. Round the decay
Of that colossal wreck, boundless and bare
The lone and level sands stretch far away.
9 these words appear: in some editions : this legend clear.
The Rosary.
The hours I spent with thee, dear heart,
Are as a string of pearls to me;
I count them over, every one apart,
My rosary.
Each hour a pearl, each pearl a prayer, 5
To still a heart in absence wrung;
I tell each bead unto the end—and there
A cross is hung.
Oh, memories that bless—and burn!
Oh, barren gain—and bitter loss! 10
I kiss each bead, and strive at last to learn
To kiss the cross,
To kiss the cross.
V.124. Sample 4: Typical formatting issues of plays
Look at the image play.tif. Stage directions are indicated by italicsand square brackets. We don't have to do much special work withthis—lose the italics, but keep the square brackets. The setting forscene I, act II is also italicized, but without square brackets. If wewanted to emphasize this, we could use shorter lines or add squarebrackets, but it probably isn't necessary here. We're using 4 blanklines between acts and 3 between scenes, so we mark these accordingly.We leave one blank line between speeches. And following these simpleconventions, we get:
JACK. There's a sensible, intellectual girl! the only girl I evercared for in my life. [ALGERNON is laughing immoderately.] What onearth are you so amused at?
ALGERNON. Oh, I'm a little anxious about poor Bunbury, that is all.
JACK. If you don't take care, your friend Bunbury will get you intoa serious scrape some day.
ALGERNON. I love scrapes. They are the only things that are neverserious.
JACK. Oh, that's nonsense, Algy. You never talk anything butnonsense.
ALGERNON. Nobody ever does.
[JACK looks indignantly at him, and leaves the room. ALGERNON lightsa cigarette, reads his shirt-cuff, and smiles.]
Garden at the Manor House. A flight of grey stone steps leads up tothe house. The garden, an old-fashioned one, full of roses. Time ofyear, July. Basket chairs, and a table covered with books, are setunder a large yew-tree.
[MISS PRISM discovered seated at the table. CECILY is at the backwatering flowers.]
MISS PRISM. [Calling.] Cecily, Cecily! Surely such a utilitarianoccupation as the watering of flowers is rather Moulton's duty thanyours? Especially at a moment when intellectual pleasures await you.Your German grammar is on the table. Pray open it at page fifteen.We will repeat yesterday's lesson.
About problems with the printed books:
V.125. I found some distasteful or offensive passages in a book I'm producing. Should I omit them?
Please don't. Readers understand that books are works of their timeand place, reflecting the opinions and prejudices of the people whowrote them, and the people they observed. We shouldn't try to pretendthose prejudices out of existence. It may be, in a century or two,that our descendants are repulsed by our prejudices.
It is perfectly normal, for all kinds of reasons, not to want toproduce a particular book, but producing one while deliberatelyremoving passages is censorship, and is unfair to our readers.
If you find it too disturbing to handle the content, you can of courseabandon the book, or pass it along to some other volunteer.
V.126. Some paragraphs in my book, where a character is speaking, have quotes at the start, but not at the end. Should I close those quotes?
Probably not.
When one character is making a speech that spans more than oneparagraph, it is usual not to close the quotes until thespeech is finished. This avoids confusion about whether the nextparagraph is the same speaker or another—once a character hasstarted speaking, there are no closequotes until the speech isfinished. However, there are openquotes at the start of eachnew paragraph during the speech. This makes the quotes unbalanced,but it isn't a misprint; it's deliberate.
If this is not the case, if the same character is not continuingthe speech in the next paragraph, then you may have found a typoin the book. [R.26]
V.127. The spelling in my book is British English (colour, centre).
Should I change these to American spellings?
Stay true to the edition you have. And this applies the other way, aswell: if you have an American edition of a work by an English author,please leave the spelling as it is.
V.128. I'm nearly sure that some words in my printed book are typos.
Should I change them?
The first thing to be aware of is that typos in books are not as rareas most people think. You may never have noticed typos in your normalreading, but under the kind of scrutiny that a book gets while beingproduced for PG, they often do become noticeable. It's quite common tofind anything up to ten typos in a book.
Before you decide it's a typo, though, check that the same worddoesn't occur elsewhere in the book with the same spelling. Often, thewords or spelling used by pre-20th Century authors may just not befamiliar to you.
When you find something that you believe to be a typo, you have fouroptions: pretend you didn't see it :-), change the typo and add atranscriber's note [V.97], change the typo without a transcriber'snote, or leave the typo as it is and add a transcriber's note. If youare adding a note, do it at the top or bottom of the file; don't tryto work it into the text, and don't use the [sic] convention, sincethe reader won't know whether the [sic] was added by you or an earlierpublisher.
In general, it's safest to leave the typo in place and add a note atthe end of the file, listing the words you believe to be typos; thatis the least contaminating and intrusive method. When adding the note,you don't need to leave a mark in the main text. You can just saysomething like:
[Transcriber's Note: "haw" near the end of chapter 15 appears to be amisprint for "hawk".]
The danger in making changes is that you may be wrong, and we reallydon't want to corrupt the text. This is particularly so in some oldbooks where archaic usages, now obsolete, may look downright wrong tomodern eyes. Sometimes, though, a typo is just so blindingly obviousthat it warrants immediate replacement. Even in these cases,conscientious people will sometimes add a note, something like:
[Transcriber's Note: in chapter 12, I have changed "he stood on thetock", to "he stood on the rock".]
V.129. Having investigated what looks like a typo, I find it isn't.
Do I need to do anything?
Often in PG work, you come across an odd word or usage. Might be atypo; might not. You check it out, and find that it isdeliberate—perhaps a word from local dialect that just happens toresemble a different word, perhaps the author is using an odd word orspelling to make a point with the language. Especially if it's anisolated incident, and especially if it's not obvious, you can add atranscriber's note to the end noting that the word is thus in youredition, and that it is probably right. This may prevent somewell-intentioned converter from changing it.
It's rare that you will need to do this; you may encounter such a caseonly once in a hundred PG books, but it is an option.
V.130. Aarrgh! Some pages are missing! Do I have to abandon the book?
No. It happens more often than you might think, and we're quite usedto dealing with it.
Finish the book, and ask other volunteers to help by finding anothercopy of the book to fill in the missing section. For something likethis, you can try asking on [V.12] the WebBoard, or gutvol-d, or askMichael Hart to put a note in the Newsletter asking for assistance. Wecan post the book incomplete, and put a Transcriber's Note [V.97] inthe header asking any future reader who has a copy to fill in the gap.
V.131. Some words are spelled inconsistently in my book (e.g. sometimes "surprise", sometimes "surprize"). Should I make them consistent?
English spelling didn't really standardize until the start of the20th Century (and even then it fractured; e.g. "standardize" vs."standardise") and the further back you go, the more inconsistent itbecomes. Shakespeare, for example, signed his own name with severaldifferent spellings.
Where your printed edition genuinely uses alternate spellings of thesame word, you should preserve them.
Word Processor FAQ
W.1. What's the difference between an editor and a word processor?
An editor shows you the characters you type, exactly as you type them.It puts new-line characters in when you hit the Enter key, and onlywhen you hit the Enter key. Its ultimate aim is to give you exactcontrol of plain text. EDIT in DOS, Notepad in Windows, vi andemacs in *nix, Tex-Edit Plus and BBEdit Lite in Mac, are all editors.
A word processor, in addition to entering the characters, also letsyou change the font, the size of individual words, and whether theyare italic or bold. It doesn't generally want individual line-ends putin on each line; it just rewraps the text as you change it. Itsultimate aim is to print your document on paper with full formattingfacilities. WordPerfect for MS-DOS and Windows, MS-Word for Windowsand Mac, AbiWord for Windows and Linux, and Nisus Writer for Mac areall word processors.
W.2. Should I use an editor or a word processor?
For dealing with plain text, which is what PG is about, you might expecta text editor to have the edge, since the formatting features of wordprocessors can get in the way of making a clean text.
However, if you use a word processor, and you ignore all of the layoutand formatting that have to do with fonts and paper, it will workequally well. There are a few common problems associated with WordProcessors mentioned below.
W.3. Which editor or word processor should I use?
The one you like best!
Any of them will do the job. Even the most primitive editors of 1971will do the job. The most feature-bloated word processor of tomorrowwill do the job. No editor or word processor affects in the slightestthe "quality" of the text produced.
For PG purposes, therefore, the only difference between them all ishow easy you find them to use, and what facilities they have forhelping you—and those are decisions that only you can make.
If you already have a favorite editor or word processor, stick to it.If you don't, there's a huge selection available for you to consider,on any type of computer.
Sometimes, using a word processor, you may encounter some problemsin saving your book as plain text. You have to figure out how to getit right just once, and then use that same method thereafter. Ifyou have problems with this, ask other volunteers or one of thePosting Team for help.
W.4. How can I make my word processor easier to work with for plain text?
First, switch off everything called "Smart ———" or "Automatic".Modern word processors commonly offer lots of typical typingsupport features—"Smart Quotes", "Auto Correct", automaticallycapitalizing the first word in each sentence, anything like that. Byall means, leave on any informative highlighting of misspelled wordsor other errors that it offers, but switch off any feature thatchanges what you type without asking you. Older books contain textthat doesn't sit comfortably with modern rules, and we don't want yourword processor deciding what Chaucer really wrote!
Now, choose a non-proportional font, and apply it to the wholedocument. It's important to work in a non-proportional font, becauseyou may have to line words up underneath each other and it is notpossible to do this consistently in non-proportional fonts like Timesor Arial.
If you work in Courier, size 10, 11 or 12, and your word processor isset for a normal page size, about 7 inches across excluding margins,then what you see in your WP is a pretty good approximation to how thetext will look in PG plain text format. One formula, suggested by JohnMamoun in the Volunteers' Voices section, is to Select All the text,choose Courier New font, 10 point size, and set the margins at 5.5inches, then Save As "Text with layout".
W.5. What is the difference between proportional and non-proportional fonts?
A non-proportional, or "monospaced", or "typewriter" font, is one whereall of the letters take up exactly the same amount of space on screen:a capital "W", a lower-case "i" and a space are all equally wide. TheCourier family of fonts is commonly used for this.
A proportional font is one where each letter takes up just the amountof space it needs, so that a capital "W" is much wider than a small"i".
Unfortunately, the different sizes of the letters in differentproportional fonts means that it's not possible to line up lettersconsistently: a "W" may be equivalent to three "i"s in oneproportional font, and to four "i"s in another. This means, forexample, that it is not possible to use a proportional font to formatplain text tables or poetry correctly—lining up the spaces and wordsusing one proportional font will cause it to look skewed usinganother.
You should always look at PG texts in a non-proportional font, even ifyou prefer to work mostly using a proportional font, because readersand automatic converter programs will assume that you meant to yourtext to be viewed using a non-proportional font.
W.6. I can't get words in a table or poem to line up under each other.
You are using a proportional font. You should always use anon-proportional font like Courier for PG work. Change the fontof the entire document to Courier and try again.
About using Microsoft Word:
PG volunteers use many different word-processors, but Microsoft Wordis the one we hear most queries and problems about.
W.7. I've edited my book in Word—how do I save it as plain text?
First, make sure that all text is using Courier or Courier Newand is at the same point size (usually 10-12). Move your rightmargin so that you see roughly the right number of charactersper line (usually 65-70). Then choose File / Save As and thenchoose the format "Text Only with Line Breaks". Save your file withthe extension ".txt" to distinguish it from your Word format file.
After saving, open your text file using Notepad or some other simpletext editor and look at the results. You should see a typical PGlayout of the text—lines up to 70 characters long, a blank linebetween paragraphs and no indentation at the start of each paragraph.If so, you're done.
W.8. Quotes look wrong when I save a Word document as plain text.
You may have left "Smart Quotes" on in Word options. This tells Wordto use left- and right-slanted quote marks at the beginning and end ofa quote instead of the plain ASCII straight quotes. When you save adocument that contains these angled quotes as plain text, they comeout as non-ASCII characters that look wrong on most editors andviewers. The solution is to turn off Smart Quotes in Word and/orreplace the ones it has already created.
W.9. Dashes look wrong when I save a Word document as plain text.
When Word recognizes an em-dash as such, it may try to use a specialcharacter for it. This may appear as a black square, an empty box,or a funny accented letter when you Save As text and look at it ina different editor.
You can usually do a Find and Replace on this character either in Wordor in another editor after Saving As text to change it to two dashes.
For those interested, the "funny character" is character 151 (97H),and is specific to Codepage 1252 [V.76].
W.10. I saved my Word document as HTML, but the HTML looks terrible.
Yes. Word is not unique in having this problem, but HTML saved fromWord is the case we hear most about. Microsoft themselves offer a freeplug-in to Word that saves the file in "Compact HTML", which is a bitbetter. You can fix it by hand, or you can use Tidy<>, a handy utility, which will do someclean-up on the HTML. If you're working with HTML, you really need acopy of Tidy anyway, because it's such a great way to do a check onthe correctness of your HTML.
Tidy is also embedded in some Windows GUI tools, like Tidy-GUI,
HTML-Kit and NoteTab.
Scanning FAQ
S.1. What is a scanner?
A scanner is a machine that makes an image, a picture of the page thatis fed to it, and sends that image to your computer. It only makes animage, like a camera does; it doesn't turn that image into text.
S.2. What types of scanners are there?
The most common type of scanner, the kind you're likely to find inyour local computer store, is a flatbed scanner. It has a glass bedusually a bit bigger than Letter paper size (or A4 if you live inEurope! :-) and most of the common models are optimized for typicaloffice correspondence. One of these may cost anything from under $100to $400, depending on its features, or you can pick them up cheapersecond-hand. You use this by placing the paper or book face-down flatonto the glass, and scanning from there. This is the kind of scannermost commonly used by PG volunteers.
Some stores will call sheetfed scanners a different category. These areflatbed scanners with Automatic Document Feed (ADF), but they arefundamentally the same machine, and the ADF sheetfeeder unit may oftenbe bought as an accessory to the flatbed scanner. Recently, a fewsheetfed scanners have appeared that are very small, without a fullflatbed, just a narrow strip that the paper rolls through. Avoid thesefor PG work; you often need to be able to scan the book flat.
Hand scanners, as their name implies, are much smaller, and typicallyvery cheap, or even thrown in free. You use these by holding them inyour hand and running them along the text like a brush. These arereally not intended for PG work; you need a very steady hand movementto get them to scan a page of text into a readable image, and theyshouldn't be considered as an option for a 400-page book—scanning andOCR is tough enough without that!
You can think of production scanners as industrial-strength flatbedscanners. The basic mechanisms are the same, but a production scannerwill certainly have ADF (sheetfeeder), more features and speed, and berated for very high volume scanning. Production scanners are used bypublishers, businesses with high-volume paper processing needs, andprint shops. This last is useful, because you may be able to get somescanning done by a print shop. It can't hurt to ask. If you're thinkingabout buying one of these babies (and who among us hasn't? :-), be sureyou have $2000 or more to spend.
Drum scanners are mostly used by publishers for professional,high-quality artwork. The paper is placed on the surface of a drumthat rotates past a fixed scanning head. The drum can be very large.Because the sensors don't have to move, the electronics and optics canbe of higher quality, and produce very accurate, high-definitionimages. They are exactly what you would want for making professionalquality scans of old movie posters, but they're expensive, and notvery useful for scanning War and Peace to OCR.
Planetary scanners are a different breed to all the others. They arereally not scanners at all, but a very high-end digital camera on astand. You place the book face-up with the pages open, with the cameralooking straight down on it. It takes a picture, and passes it on tothe connected computer. Planetary scanners are ideal for old, fragile,valuable books that can't be exposed to the stress of normal scanning.They typically come supplied with specialized software, sometimes eventheir own dedicated computer, and they are very, veryexpensive—$20,000+.
S.3. Which scanner should I get?
For most people, the answer is simple. Unless you have a lot of moneyand are sure you will be scanning a lot of books, you should get anormal, consumer-or-office type flatbed scanner, with or without anADF sheetfeeder.
Having decided that, you're faced with the question of which scannerto buy. More good news! The market in scanners is very competitive,and there are many top-line vendors all watching each others' featureslike hawks, eager to deliver the highest-spec machine they can. Thereare only a couple of critical factors in this decision—most of it isabout getting the best buy.
For PG work, you really need an optical resolution no less than 300by 300 dpi (dots per inch), and 600 by 600 is very desirable.Obviously, more is better, but it would be very rare to need more than600 dpi for PG work. Pay no attention to the "interpolated" or"enhanced" resolution, where the software "guesses" what dots shouldfill in the gaps—you're only interested in the optical resolution.The good news is that it's very difficult to find modern scanners witha maximum optical resolution of less than 600 dpi, but if you'rebuying second-hand, you should check this out first.
You will also need a scanning surface on the glass big enough toplace your book with two facing pages flat. Again, the good news isthat it's very hard to find a flatbed whose scanning surface is toosmall for PG work, since these scanners tend to be designed to handleoffice paper, which is about the right size. Most flatbed scannershave scanning surfaces of about 8.5" by 11.5", and this is standardfor PG work. If you're working on books with very large pages, you mayneed to resign yourself to scanning one page at a time, but buying ascanner with a big flatbed for these rare occasions will be much moreexpensive.
You must make sure that you get a scanner that will connect correctlyto your computer. There are currently (mid-2002) three main types ofconnections commonly available: SCSI, USB, and parallel.
SCSI (Small Computer Systems Interface) is the highest-quality option,but it means that you need a SCSI card in your computer, and bewilling to figure out how to install it. If you're already a SCSIenthusiast, you don't need to read further; if you're not, I suggestyou avoid it unless you enjoy tinkering. Production scanners mostlyrequire SCSI.
Parallel-port connections used to be common, as a cheaper, easieralternative to SCSI. Since the introduction of USB they have becomerarer, but you will still see them for sale second-hand. These pluginto your printer port, and don't require any further engineering skills.
Most new scanners hook up using a USB (Universal Serial Bus)interface, which is a no-muss, no-fuss "plug-in and go" option, but besure, if you have an old PC, that it actually has a USB port and thatyour operating system supports it; some older Windows PCs and Macs maynot. If your PC doesn't support USB, you should probably look atParallel-port scanners.
By the time you read this FAQ, FireWire and USB 2.0 interfaces mayalso be common. For your purposes, these are like more advancedversions of USB. Just make sure that your computer has the rightsupport to match the scanner.
If you're buying second-hand—and used scanners can be verycheap—make absolutely sure that you're getting the original softwarethat came with the scanner, and that that software will work with yourcurrent operating system on your PC.
Having ensured that your choice of scanners passes these tests, you'renow free to indulge your tastes for any extras you like. Color isnice, but rarely used, since we mostly transcribe older books thathave no color printing. Higher resolutions are comforting to have,both since you may occasionally find them useful and because it showsthat the optics are of higher quality than you actually need for yourPG scans.
If you are nervous about your choice of scanner, or how easy it is toget one working, feel free to contact other PG volunteers for theiropinions, as described in the FAQ "How do PG volunteers communicate?"[V.12].
S.4. What is ADF?
ADF stands for Automatic Document Feed, and it's just a jargon termfor a sheetfeeder, where you put in a stack of pages to be scanned andgo away while that's happening instead of putting in each pagemanually.
S.5. Should I get ADF?
That depends. Yes, ADF is a great idea, and can be a huge work-saver,and if you have the cash to spend, it may well be worth it. But ADFhas a dirty little secret: like any other gizmo with moving parts, itoccasionally jams. The sheetfeeders built into these low-cost machinesare aimed at handling typical office paper straight from the laserprinter—large, smooth, good quality, with perfectly-cut,perfectly-aligned edges. In your PG work, you will be dealing withhundred-year-old pages of various thicknesses and textures, usuallymuch smaller than the sheetfeeder was designed to work with. And youwill have to have cut the pages, and may leave ragged edges in doingso.
Under these conditions, you may find that paper often jams in yoursheetfeeder, and it defeats the purpose if you have to stand over thescanner while it works, or if you end up having to lift the cover anduse your scanner as an ordinary flatbed, or, worse, if your paper getsscrunched up as if a dog had been playing with it.
And of course, in order to feed the pages through, you will have tocut them out of the book, destroying it. (It may be possible, with thehelp of a bookbinder, to have the pages professionally cut, and laterre-bound.)
With ADF, you probably won't actually scan much faster than scanningflat, but you won't have to keep turning over the pages during thattime.
So when you're making that choice, think carefully. If money isn't aproblem, or you do expect to be working with cut sheets, then go aheadand get a sheetfeeder—it's great when it works! But don't bedisappointed when it doesn't work all the time.
S.6. What's a "TWAIN driver" and why do I need one?
A TWAIN driver (see <>) is a piece of softwarethat installs onto your Windows PC or Mac and controls your scannerfrom there. With any modern scanner, there will be a TWAIN driverincluded in its software package. Once installed, you shouldn't haveto think about it again, or even know it's there.
A modern OCR package will usually find your TWAIN driver and use it tocontrol the scanner. This is very handy. There may also be a smallscanning package with your TWAIN driver, which will provide a screenwhere you can make fine adjustments to scanner settings, and startscans. You probably won't need this, since your OCR package willprobably do it for you, but it may be useful for semi-manual controlof the scanner.
Unix-based systems like Linux use SANE <>rather than TWAIN drivers.
S.7. How do I scan a book?
This depends on whether you have cut the pages out, or whether you areworking with an intact book.
If you have cut the pages out, and you have an ADF, then you willobviously feed them through that.
If you don't have an ADF, there usually isn't much point in cuttingthe pages. Most modern OCR will recognize a "dual-page" or "two-up"scan, and, if yours does, then that's normally the best option.Scanning the uncut book, open and flat, is the most common scanningmethod used in PG.
Take the book and place it open, flat on the scanner glass. To fitboth pages on the glass, you may need to position it lengthways, at 90degrees to its natural angle. Most OCR software will recognize thatthe image has been rotated through a right-angle, and will correct itwhen it reads the text.
A common problem with scanning an opened book is "guttering", whichhappens when the spine of the book is not pressed flat enough, and theinside of each page, where it meets the spine, is curved against theglass. There's more about this, and an example, scan3, in the FAQ[S.17] "Why am I getting a lot of mistakes in my OCRed text?". To avoidguttering, make sure that the spine is held down throughout the scan.(Some people put a weight on the spine to hold the spine down on eachscan; others just press their hand against it.)
Another common problem is light scattering, when too much light getsinto the scanner. The scanner head detects light, and you want theonly internal light source to be from the scanner itself, not ambientroom light or sunlight. Scanners have covers, that are intended to beclosed while scanning, for a controlled light level, but when you'rescanning a book held open and flat, you can't close the cover fully.In a bad case, this can lead to a condition of the scan likeoverexposure of film and you can see an example in scan4 of the FAQ[S.17] "Why am I getting a lot of mistakes in my OCRed text?". If thishappens, just make sure that your room is dim while you scan—don'thave a ray of bright sunlight bouncing around the inside of thescanner!
Occasionally, when scanning cut pages with very thin paper, you mayget a shadow of the text on the other side showing through. If thishappens, you can try covering the inside of the scanner lid, which isnormally white, with a piece of black paper.
Many modern OCR packages will control the scanner automatically, andyou may be able to set your OCR so that it does an automatic timedscan every, say, 30 seconds. This is a great timesaver, since youdon't have to go back and forth between the scanner and the screen.Just set your timer, hold down the book for the scan, take the bookup, turn the page, put it down again, and wait for the next scan tostart. Set the timer for whatever interval you are comfortable with.Highly recommended, if your OCR or scanning package can do it.
By default, most scanners will always scan the entire area of theflatbed, but usually, your book will occupy only about half of it.Look for a setting on your OCR or scanning package which allows you toreduce the area that the head scans. Just scan enough to get the imageof your pages. This makes the time for each scan and subsequent OCRrecognition shorter, and in a really good case can cut your totalscanning and OCR time in half.
Scanning all pages together is usually fastest, but you may preferto scan each double-page, then correct it in your OCR package'seditor, then scan the next. This is a more leisurely approach favoredby some volunteers.
S.8. My book won't open flat enough for a good scan, and I don't
want to cut the pages.
Well, then, you have a difficult choice to make, but you do still haveseveral options:
You can accept a poor-quality scan, and spend a lot of time fixing upthe guttering on the margins.
You can bite the bullet, and cut the pages.
You can type the book, or find a typist who will work on it for you.
You can find a print shop or bookbinder who will cut the pagesprofessionally, and re-bind the book when you're done. You may evenreplace it with a fresh new binding that will give the book a newlease of life.
Take your choice.
Most books will open flat enough for an adequate scan, though you mayhave to put stress on the spine to do it.
If you have a really precious book, and you can't find a typist, youmight consider the options of a digital camera [S.11] or findingsomeone with a planetary scanner [S.2] to scan it for you.
Michael Hart said: "I would give up every book I own, including myfirst edition of the OED, my Civil War edition of the MerriamWebster's Unabridged, etc., etc., etc., so everyone could use it anytime they wanted rather than that only I or my friends could use it. . . and obviously I could use it too."
Fortunately, it rarely comes to that.
S.9. How long does it take to scan a book?
Putting the book flat on the glass means that you scan two pages at atime. A reasonable modern scanner will scan the area of two typicalpages at 400dpi in anywhere from 20 to 40 seconds—let's call it 30seconds for two pages. That's four pages a minute, or 240 pages anhour. You could reasonably get through a 400 page book in two hours,even allowing for an occasional break or glitch.
Of course, you should also allow time for scanning a few trial pageswith different settings before you start, to decide which settings touse. Ten minutes spent here can save you hours of proofreading time.
There are two big tips that can save you a lot of scanning time:
If your OCR or scanner control package has a timer setting, thatautomatically keeps scanning without user intervention, you can forgetabout the screen and just keep turning the pages as needed.
You should set your scanner just to scan the area the book covers onthe glass. By default, your software will probably scan the full areaof the glass, and usually, your book won't need that. By scanning onlywhat you need, you may typically save anything from 20% to 70% of thetime taken to scan the full area. If your book is small enough to openflat across the scanner instead of "down" the side, 400 pages anhour is not out of the question with this trick.
S.10. What scanner settings are best?
For a given book, scanner, PC and OCR software, there must be some"ideal" scanner settings, but if you change any of these components,the ideal scanner settings will change with them. Some OCR packagesrecognize greyscale better than black and white; some don't likegreyscale at all. Some books have small print needing higherresolution; some are speckled so that higher resolution leads tomore errors.
Obviously, the best settings also depend on the individual book,and some books will require you to get downright creative withthe settings, but most PG books are scanned in Black and Whiteor greyscale, somewhere between 300dpi and 600dpi.
This decision is a trade-off between speed and accuracy, and anillustration of the difference between principle and practice. Inprinciple, a true-color, 9600dpi scan is a much better rendering ofthe page than a B&W 400dpi scan. In practice, all that extrainformation doesn't usually help the OCR make better distinctionsbetween letters, and the larger and more detailed the scan, the longerit takes to make the scan, the more disk space the image file takes,and the more processing time and memory the OCR package needs torecognize it.
A further paradox emerges when considering higher vs. lowerresolutions: depending on the paper and ink quality, you may seemore errors start to appear on very high resolution scans. These arecaused by small imperfections in the paper or ink spots that show upon the high-res scan, and that the OCR tries to interpret as lettersor punctuation.
So, in summary, bigger is better, but only up to a point.
Brightness is a setting often neglected, that can make quite a bigdifference to your results. Look at the scanned image: if you see lotsof dark patches, make your scan lighter; if your letters appear thinand faded, make your scan darker.
See the FAQ [S.17] "Why am I getting a lot of mistakes in my OCRedtext?" for some typical scans and results.
S.11. Can I use a digital camera in place of a scanner?
Digital cameras are getting better resolution all the time, and somevolunteers have experimented with making a kind of home-made planetaryscanner from a digital camera and a stand. So far, the results don'tquite match a dedicated scanner, but as digital cameras improve, thismay become a common option. One problem, which planetary scanners usespecialized software to correct, is that the natural curve of thepages near the middle of the book tends to give a foreshortened aspectto the letters there, which can cause problems for OCR software, likeguttering.
Whatever the current problems, the prospect of using digital camerasis exciting, because it will mean that non-typists will be able toproduce old books borrowed from libraries without worrying about scanquality vs. damage to the spine.
S.12. What is OCR?
OCR stands for Optical Character Recognition. This is very importantsoftware that looks at the picture of the page that your scanner hassupplied, and turns it into text.
When the scanner delivers the image of the page, that image is only apicture. You can't, for example, search for text in it, or edit thetext to add a blank line. Your editor or word processor can't workwith it. The OCR program does the job of "reading" and "typing" theimage for you. OCR packages call this "reading" or "recognizing".
S.13. What differences are there between OCR packages?
One word: huge. All OCR packages do the same job, but they do it indifferent ways, with different features, and with different levels ofaccuracy. OCR can save you a lot of time, or cost you a lot of time.It's really worth putting some effort into making sure you get theright OCR package, and, once you have it, into understanding how touse it. It'll save you time in the long run.
S.14. How accurate should OCR be?
OCR packages commonly say that they are "99%+" accurate, or somethinglike that. Let's analyze what that actually means: say there are 1,000characters (letters) on each page, then with 99.9% accuracy, you wouldexpect to have to make 1 correction per page. With 99% accuracy, thatwould be up to 10 corrections per page. And in a 400-page book, thisall adds up.
But there's a "Your Mileage May Vary" clause built into that.Typically, the manufacturers test their OCR on fresh, laser-printed orpress-printed copy with perfect scans, and this is fair, since theyare aiming their products primarily at businesses that process thesekinds of materials. You are not dealing with fresh print; you'redealing with old books, yellowed, spotted, marked, imperfectly printedin the first place, and possibly using unfamiliar fonts. And it'sunlikely that you will have the patience to get a perfect scan onevery page. The result is that the accuracy of OCR for typical PG workdoesn't match the accuracy on images of perfect, fresh paper.
Apart from the scan quality, OCR also has to contend with differentfonts and sizes for the letters.
However, if you're getting more than 10 errors per page, you shouldlook at some examples of OCR in the FAQ [S.17] "Why am I getting alot of mistakes in my OCRed text?".
S.15. Which OCR package should I get?
The accuracy of OCR software has improved enormously in the last fewyears, and OCR technology looks likely to keep improving even fasterthan software in general. Further, there is competition in this area,and products leapfrog each other with new versions regularly. Thebrands most commonly mentioned by PG volunteers (mid-2002) areAbbyy, OmniPage and TextBridge [P.1], and trial versions of all threehave been available for download over the Web, and may still be whenyou read this. [Warning: these are big downloads—40MB or more.]
Most common OCR packages will offer two main working options: to scana page and view/edit the resulting text on the spot before saving, andto scan a whole batch of pages together and view/edit them all later.Some people like to fix up one page at a time; others prefer to getall of the OCR work done at once, then get the whole text into theireditor. Most OCR software will cater for both, and if this isimportant to you, you should check that the OCR you're buying supportsthe way you want to work.
If you intend to work in a language other than English, make sure thatthe OCR you buy supports the characters in your language.
Some OCR software has a "training" or "learning" mode. Using thismode, it scans and "reads" or "recognizes" a page, then you correctthat page, and the OCR "learns" from its mistakes and tries to dobetter on the letters it misread when it recognizes the next page.If you're dealing with a very rare font, this can make a differenceto your OCR quality, but modern OCR packages come with enough inbuiltfont knowledge for most languages, and you probably won't need this.
If possible, try a couple of OCR packages before you decide. If youwant opinions on specific versions, contact other PG volunteers andask for their opinions, as described in the FAQ "How do PG volunteerscommunicate?" [V.12].
S.16. What types of mistakes do OCR packages typically make?
Each text has its own peculiarities, but there are a number ofwell-known scanning errors you will be dealing with all the time.
Punctuation is always a problem. Periods, commas and semi-colons areoften confused, as are colons and semi-colons. There are also usuallya number of extra or missing spaces in the e-text.
The problem of quotes can assume nightmarish proportions in a textwhich contains a lot of dialog, particularly when single and doublequotes are nested.
The numeral 1, the lower-case letter l, the exclamation mark ! and thecapital I are routinely confused, and often, single or double quotesmay be mistaken for one of these.
Lower-case m is often mistaken for rn or ni.
The letters h and b and e and c are commonly mis-read, and these areprobably the hardest of all to catch, since ear/car, eat/cat, he/be,hear/bear, heard/beard are all common words which no spell-checkerwill flag as problems.
For example:
" Hello1' caIled jirnmy breczily. 11Anyone home ? "
There seemed to he no-oneabout. Only tbe eat beard him."
should read:
"Hello!" called Jimmy breezily, "Anyone home?"
There seemed to be no-one about. Only the cat heard him.
S.17. Why am I getting a lot of mistakes in my OCRed text?
If you're new to OCR, you may have come with the idea that OCR isalmost perfect, and just makes a few mistakes now and then. No. It'sslightly amazing that OCR works at all, and when it does, it isn'tperfect.
You might reasonably expect to average anything up to 10 errors perpage for typical PG work; if you're seeing more, then there is aproblem with
a) your printed book b) your scan, or c) your OCR package
Problems with the printed book fall into three categories: badprinting, age, and unusual fonts. Bad printing consists of problemslike too much or too little ink on the press at the time the book wasprinted, and irregularities in the print where the metal type wasdamaged. Age causes yellowing—even browning—of the paper, and fadedprint. Unusual fonts may be hard for OCR to recognize, and verytightly-spaced print may make adjacent letters seem to touch, whichconfuses OCR software.
There are many ways for you to have problems with your scan.Obviously, if your scanner is defective or the glass is dirty, youwill notice it immediately, but there are many mistakes you can makethat will result in a poor-quality image, and cause later problems foryour OCR.
You may not be able to control the quality of the paper you have towork with, but there is a lot you can do about the quality of yourscan.
The two mistakes that people inexperienced with scanners most commonlymake are not holding the spine down firmly enough to get a flat imageof the paper, and not setting the brightness correctly, or letting toomuch light get in. In your early scans, watch out for these problems.
First, if you haven't already, read the FAQ "How do I scan a book?"
[S.7] and check that you're following the basic recommendations there.
Now let's look at some samples, and see the kinds of problems youmight encounter.
A disclaimer about these samples: specific OCR packages are named, butyou should not take these as a fair and comprehensive comparativereview of the software. The object of this exercise is to show typicalscanning conditions and problems, and the resulting OCR output. OCRpackages have quite a range of variance within themselves, may workbetter on some texts than others, may improve with "training" ordifferent settings, and I have even seen the same OCR package producedifferent text from the same image with the same settings! Further,since OCR quality is improving rapidly, and packages leapfrog each otherin quality, the next version of a particular brand may be vastly betterthan any of the software mentioned here. Of particular interest in thiscontext is the leap in quality between OmniPage 10 and OmniPage 11.
* * * * *
Scan 1—A perfect Scan
Scan1 is as near to a perfect scan as you can expect in PG work. Itcomes from "The Founder of New France" by Charles W. Colby. It is onlya 300 dpi image, but given the quality of the print and of the scan,300dpi is all we need. Ironically, it comes from Gardner Buchanan, whocomplains about the age and infirmity of his scanner in hisdescription of how he produces a text. The moral is that you don'thave to have the latest equipment to get good results!
The actual scan is in the image file scan1-3.tif
It doesn't really need any comment, and all of the packages exceptgocr rendered it perfectly. Note the fake "space" before thesemicolon—if you look closely at the image, you will see why the OCRpackages mistook it for a full space, as discussed in the FAQ [V.104]"My book leaves a space before punctuation like semicolons, questionmarks, exclamation marks and quotes. Should I do the same?"
Champlain was now definitely committed to the task of gaining for France a foothold in North America. This was to be his steady purpose, whether fortune frowned or smiled. At times circ*mstances seemed favourable ; at other times they were most disheartening. Hence, if we are to understand his life and character, we must consider, however briefly, the conditions under which he worked.
gocr 0.3.6 converted this as:
Champtain was now definitely committed to the task of gaining for France a foothotd in _orth America. This was to be his steady purpose, whether fortune frowned or smiled. At times circ*mstances seemed favourable ., at other times they were most disheartening. _ence, if we are to understand his life and character, we must consider, however brieRy, the conditions under which he worked.
* * * * *
Scan 2—A Typical Scan
Scan2 is a paragraph from Baroness Orczy's "Castles in the Air".Notice the ink-splotch above the capital "I" in the first line, whichwill give our OCR some problems. The page is also unevenly inkedelsewhere, and I have scanned it with the brightness level a bit toohigh.
I have made two separate scans, one at 300dpi and one at 400dpi, bothBlack and White, named scan2-3.tif and scan2-4.tif respectively. Thepage was cleanly cut, and carefully placed straight onto the scannerglass with the cover down. The original print is somewhere between thesize of Times New Roman 10 and 11, with capital letters about 2.2millimeters high, but better and more clearly spaced. These scans arefairly typical for PG work. Because of the relatively large letters,and the reasonable scan, there isn't much difference between the textproduced from the 300 dpi scan and the 400 dpi scan.
I actually cut this book to get the pages out so that I could feed itthrough my ADF, but the paper is so thick and textured that it stickstogether, and jams when feeding through. The thick, absorbent paper,combined with the uneven inking, means that, no matter how good thescan, any OCR has to contend with the irregular edges of letters,which are clearly visible even at 300dpi.
Here is the output for these scans from some OCR software packages. Ichanged just one thing: Abbyy recognized the em-dashes as such, andoutput them as a special character in Codepage 1252 for em-dashes,which isn't available in ASCII, so I converted that to the PG standard2 dashes.
Abbyy FineReader 6:
Yes, indeed, I was on the track of M. Aristide Fournier, and of one of the most important hauls of enemy goods which had ever been made in France. Not only that. I had also before me one of the most brutish criminals it had ever been my misfortune to come across. A bully, a fiend of cruelty. In very truth my fertile brain %vas seething with plans for eventually laying that abominable ruffian by the heels: hanging would be a merciful pun- ishment for such a miscreant. Yes, indeed, five thousand francs—a goodly sum in those days, Sir—was practically assured me. But over and above mere lucre there was the certainty that in a few days' time I should see the light of gratitude shining out of a pair of lustrous blue eyes, and a winning smile chasing away the look of fear and of sorrow from the sweetest face I had seen for many a day.
Yes, indeed, Twas on the track of M. Aristide Fournier, and of one of the most important hauls of enemy goods which had ever been made in France. Not only that. I had also before me one of the most brutish criminals it had ever been my misfortune to come across. A bully, a fiend of cruelty. In very truth my fertile brain was seething with plans for eventually laying that abominable ruffian by the heels: hanging would be a merciful pun- ishment for such a miscreant. Yes, indeed, five thousand francs—a goodly sum in those days, Sir—was practically assured me. But over and above mere lucre there was the certainty that in a few days' time I should see the light of gratitude shining out of a pair of lustrous blue eyes, and a winning smile chasing away the look of fear and of sorrow from the sweetest face I had seen for many a day.
gocr 0.3.6:
__e_, indeed, f___as on_the track of h_. hristide Fournier, 3nd of one of the most im__ant hau1s of enem)_ goods ___hich had e__er been made in France. h?ot onl3_ that. I had a1so before me one of the most brUtish crimînat_s it h__4 e___er been m31 misfortune to co_me acro__3. A bu113_, a tiend oí cruelt__. In very truth m3_ fertiIe brain ___as s_e_1_::_g __-ith planS for e__entua113_ _ay:ng that abominab1e ru_iin b.__ t1_e hee1s . hanginig __ou1d be a n_erciful pun- i;__,i__gnt íor such a miscreanf. yes, in_i__ee3, fj_1e thou3and francî-a b_ood13_ sum in those days, _ir-vas practica1l3
a3_ured me. _ut o___er and above n_ere lucre there was the certaint_v that in a few_ da3_s' ti_e I shou1d see the lib_ht of gratitude shininb_ out of a pair _f _usLtrous btue e3_e3_, and a ___inning smi1e chasing a__ay the Ioo_ of _ear and of sorrow from the s__eetest iace T had Seen fof man)_ a day.
Yes, indeed, f___as on the track of h__. Ariseide Fournier, and of one of the most important hau1s _f enemy goods ___hich had ever been made in France. NoEUR on1y that. I had also before me one of the most brutish crimina1s it h_ad ever been my misfo__tune to come acros__. A bu11y, a fiend of crue1ty. _n very truth my fertib brain vas seeî3:i_g __ith plans for e__entua11p 1aying _at abom_in_ ab1e ru_an by the heels. hanging _____ou1d _ a merciful pun- iï_h_ment for such a miscreant. Yes, indeed, five thou__and f_ancs-a b_ood1y sum in those days, _ir-_vas practica1ly a3îured me. But over and above mere _ucre th.ere was th_e certainty that in a few days' ti_e _ shou1d see the 1i__t of gratjtude shining out of a pair o_, _userous b1ue b . e__es, and a __inning smi1e chasing away the l_k of _,ear and of sorrow from the s___,eetest face _ ad .een o many a day. . .
Recognita Standard 3.2.7AK:
~'es, indeed, ~w-as on the track of ltT. Aristide Fournier, and of one of the most important hauls of enemy goods "=hich had ever been made in France. ~Tot only that. I ha~i also before me one of the most brutish criminals it had ever been my misfortune to come across. A bully-, a fiend of cruelty. In very truth my fertiIe brain was s; ething w-ith plans for eventually iaying that abominable ruffian by the heels : hanging ~-ould be a merciful pun- ishment for such a miscreant. ires, indeed, five thousand franes-a goodly sum in those days, Sir-was practically as~ured me. But over and above mere lucre there was thP certainty that in a few days' time I should see the light of gratitude shining out of a pair of lustrous btue ey·es, and a winning smile chasing away the hk of fear and of sorrow from the sweetest face I had seen for many a day.
Yes, indeed, l~was on the track of h~i. Aristide Fournier, and of one of the most important hauls of enemy goods w~hich had ever been made in France. lVot only that. I had also before mP one of the most brutish criminals it had ever been my misfortune to come acrass. A bully, a fiend of cruelty. In very truth my fertile brain was seething with plans for ez~entually laying that abomin_ able ruffian by the heels : hanging ~~.-ould be a merciful pun- ishment for such a miscreant. Yes, indeed, five thousand f:ancs-a goodly sum in those days, Sir-was practically assured me. But over and above mere lucre there was the certainty that in a few days' time I should~ see the Iight of gratitude shining out of a pair of iEustrous blue eyes, and a w inning smile chasing away the Iook of fear and of sorrow from the s"-eetest face ~ had seen ~'or rr~any a day.
OmniPage Pro 10:
Yes, indeed, twas on the track of 11T. Aristide Fournier, and of one of the most important hauls of enemy goods which had ever been made in France. Not only that. I ha(i also before me one of the most brutish criminals it had ever been my misfortune to come across. A bully, a fiend of cruelty. In very truth my fertile brain was seething with plans for eventually laying that abominable ruffian by the heels: hanging would be a merciful pun- ishment for such a miscreant. Yes, indeed, five thousand francs-a goodly sum in those days, Sir-was practically assured me. But over and above mere lucre there was the certainty that in a few days' time I should see the light of gratitude shining out of a pair of lustrous blue eyes, and a winning smile chasing away the look of fear and of sorrow from the sweetest face I had seen for many a day.
Yes, indeed, fwas on the track of h-I. Aristide Fournier, and of one of the most important hauls of enemy goods which had ever been made in France. Not only that. I had also before me one of the most brutish criminals it had ever been my misfortune to come across. A bully, a fiend of cruelty. In very truth my fertile brain was seething with plans for eventually laying that abominable ruffian by the heels: hanging would be a merciful pun- ishment for such a miscreant. Yes, indeed, five thousand francs-a goodly sum in those days, Sir-was practically assured me. But over and above mere lucre there was the certainty that in a few days' time I should see the light of gratitude shining out of a pair of lustrous blue eyes, and a winning smile chasing away the look of fear and of sorrow from the sweetest face I had seen for many a day.
OmniPage Pro 11:
Yes, indeed, twas on the track of AT. Aristide Fournier, and of one of the most important hauls of enemy goods which had ever been made in France. Not only that. I had also before me one of the most brutish criminals it had ever been my misfortune to come across. A bully, a fiend of cruelty. In very truth my fertile brain was seething with plans for eventually laying that abominable ruffian by the heels: hanging would be a merciful pun- ishment for such a miscreant. Yes, indeed, five thousand francs-a goodly sum in those days, Sir-was practically assured me. But over and above mere lucre there was the certainty that in a few days' time I should see the light of gratitude shining out of a pair of lustrous blue eyes, and a winning smile chasing away the look of fear and of sorrow from the sweetest face I had seen for many a day.
Yes, indeed, fwas on the track of h-I. Aristide Fournier, and of one of the most important hauls of enemy goods which had ever been made in France. Not only that. I had also before me one of the most brutish criminals it had ever been my misfortune to come across. A bully, a fiend of cruelty. In very truth my fertile brain was seething with plans for eventually laying that abominable ruffian by the heels: hanging would be a merciful pun- ishment for such a miscreant. Yes, indeed, five thousand francs-a goodly sum in those days, Sir-was practically assured me. But over and above mere lucre there was the certainty that in a few days' time I should see the light of gratitude shining out of a pair of lustrous blue eyes, and a winning smile chasing away the look of fear and of sorrow from the sweetest face I had seen for many a day.
Textbridge Millennium Pro:
Yes, indeed, rwas on the track of M. Aristide Fournier, and of one of the most important hauls of enemy goods which had ever been made in France. Not only that. I hail also before me one of the most brutish criminals it had ever been my misfortune to come across. A bully, a fiend of cruelty. In very truth my fertile brain was seething with plans for eventually laying that abominable ruffian by the heels: hanging would be a merciful pun- ishment for such a miscreant. Yes, indeed, five thousand francs-a goodly sum in those days, Sir-was practically assured me. But over and above mere lucre there was the certainty that in a few days' time I should see the light of gratitude shining out of a pair of lustrous blue eyes, and a winning smile chasing away the look of fear and of sorrow from the sweetest face I had seen for many a day. - - -
Yes, indeed, f was on the track of M. Aristide Fournier, and of one of the most important hauls of enemy goods which had ever been made in France. Not only that. I had also before me one of the most brutish criminals it had ever been my misfortune to come across. A bully, a fiend of cruelty. In very truth my fertile brain was seething with plans for eventually laying that abominable ruffian by the heels: hanging would be a merciful pun- ishment for such a miscreant. Yes, indeed, five thousand francs-a goodly sum in those days, Sir-was practically assured me. But over and above mere lucre there was the certainty that in a few days' time I should see the light of gratitude shining out of a pair of lustrous blue eyes, and a winning smile chasing away the look of fear and of sorrow from the sweetest face I had seen for manyaday. -
* * * * *
Scan 3—Guttering and Smaller Print
Scan3 is a paragraph from "The Egoist" by George Meredith. It wasscanned in a dim room, with the scanner cover open and the book heldopen, flat against the scanner glass. However, the spine was notpressed firmly enough against the glass, and as a result you can seethat the words on the left-hand edge (which were near the spine)appear to be slanted, a bit distorted, and not well lit. This problemis familiar to people who scan for PG—everybody gets distractedsometimes, and fails to keep enough pressure on the spine. As you seefrom the results below, it caused problems for all of the OCR packageson the words affected. If you find this kind of "guttering" regularlyin your own scans, where the characters near the spine are not beingrecognized correctly by your OCR, you need to make sure that your bookis down as flat as possible before making a scan. Because of thesmaller size and the guttering problem, the 400dpi scan made forbetter quality text in this case.
Here's the output from the sample OCR:
Abbyy FineReader 6:
NEITHER Clara nor Vernon appeared at the mid-day table, n Middleton talked with Miss Dale on classical matters, like a good-natured giant giving a child the jump from stone to stone across a brawling mountain ford, so that an uncdified audience might really suppose, upon seeing her over the difficulty, she had done something for herself. Sir \Villoughby was proud of her, and therefore anxious to soltlo her business while he was in the humour to lose her. He hoped to finish it by shooting a word or two at Vernon before dinner. Clara's petition to be set free, released from him, had vaguely frightened even more than it offended hia nrido.
NEITHER Clara nor Vernon appeared at the mid-day table. Dr. Middleton talked with Miss Bale on classical matters, like a good-natured giant giving a child the jump from stone to stone across a brawling mountain ford, so that an unedified audience might really suppose, upon seeing her over the difficulty, she had done something for herself. Sir "VVilloughby was proud of her, and therefore anxious to settle her business while he was in the humour to lose her. He hoped to finish it by shooting a word or two at Vernon before dinner. Clara's petition to be set free, released from him, had vaguely frightened even more than it offended his pride.
gocr 0.3.6:
__,,,____,_ Cl,_I._c nor Vernon a__e_Ped _t tl_le _id_da_ tab1e_ _, ii(__etoiI f,,_lk(;cl with _MiSs _ale _U_1d_ abS8iG_l I_i_t_t_l.__ i,i,;, .,, _(u-i,L_t_ii.e(l 6iiLIblt 6'7_V. ill_ _ C 'll . tf e__Ul__b rU_l gt(),ii_, tu _fj(),I(, ,_uruSS.,__ T__ Illl_ g UlOUUt_lU o_ 8O .t ' t_ail u,,_,_ifj(;il ;,_i((ic,IGG l_i_' lt re_ y 8UE)OB'_ U_Oll 8eelll6 lttr _,__i. t_ic (li__icu1ty, SIIe t1_d iluI_e 8ol_eth_ng_ fo_ be_.Self. _i__ _ji___()_i___lIl)y w,,s prui_il of heT_ and k__eTefope an_iouS to ((.__u l___i. i)i__, ii,ess wIlile he Wa8 in the hU_ouT to luse Iier_ j__ l_())((l t() tiiIish it b_ ShOOtiltg a WOTd o__ t_O &t Verno_ _o__(),__ (li,iIci. Cl__T_'S _eti_tio_ tO be Set fTee_.Te1ea8ecl fro_ )ii))),, lIL_Ll v_b__uely f_.ighteUe eVen OTe kba lt OfEe_ded hi_ pi_i..(l_u- . _ , , —.___ ,- - -__-
________ Cl__i.a nop Vernon appeared &t t'h_e _id_day t__le_ D_. _id(lle_oi_ t_lked with Miss ale ,on _Ssi__l __i tt_r_'_ iij_e _ 6ood-n___tLi_.ed 6iai_t 6_i_ing & Ghild the ___np '.on_ _tune to _tone aGro_S a braWlin( __ inOU__taiß foPd So t2_at a__ u__p,(_ified ___idiei_Ge ni62it real y 8uppO.8e upon _seeii_6 l_e_ o______ the difhculty_ she had done _o_neth_n6 fop ber_elf_ i _viljoli____k)y w__s proud of heT, and the_efo_e an_iouS to ___.tle li__i. i)u__inesS Whike he W_S î_ the hum'ou_ to_ lose her_ __e l_op(d to finish it by 8hooting a wopd o tWo ak Verno__ _ eforR _(in_icr_ Clara's petition to _ Set free, releaSed fro )ii__, h_d va6uely frigbte_ed eve_ _ore tban it o_e_ded hiD pi.icle. -. - - - - - '
Recognita Standard 3.2.7AK:
~rFr~rrmx Clara nor Vernon apneared at the mid-da~'table.
Dr. bLidrlleton talkc;d Miss Dale vn elassieal matters,
like a ~n~a-mZtured giant gi.ving a child th© jucnp frvm
stonc to stone across a brawling mounta,in ford, so that au
uiicilificd .ruciicucc mil;·ht really suppasc, upon seeixig hor
·n~er thc ciillicul.ty, she had clouo something for herself. Sir
~Villcm;;lrlry wvs proua of her, and therefors angiaus to
sct.tla lrur tn~sincss while he was in the humoar to lose her.
lle lu,hcot to iinish it by shooting a word ar two at Vernon
bol'ore ~linncr. Clara's petition to bo set froe, released £rom
JGGnt., hvd vagucly frighteued even more than it offended hia
NEITfi~R Clara nor Vernon appeareci at the xnid-day table. Dr. Middleton talked with Miss Dalo on classics,l rnatters', like a good-natured giant giving a child the jtimp from stone to stone across a brawling mountain ford, so that an unedified audience might really suppose, upon ~ seeing her over the difficulty, she had done something for herself. Sir yillon ;hby was proud of her, and therefore anxiotis to scttle luer business while he w~as in the hurxiour to lose her: He hoped to finish it by shooting a word or two at Vernon before dinner. Clara's petition to be set free, released from jcLm, had vaguely frighteued even more than it offended his pride.
OmniPage Pro 10:
NF r~rn,Px Clara nor Vernon appeared at the mid-dap table.
Dr. Middleton talked with Miss Dale on classical matter,
like .t good-natured giant giving a child the jump from
stone to stone across a brawling mountain ford, so that an
uneVified audience might really suppose, upon seeing her
over the difficulty, she had done something for herself. Sir
jV;llo,r;;lrl>y was proud of her, and therefore anxious to
set.tlo lror Uusiness while he was in the humour to lose her.
Ile. lropcol to finish it by shooting a word or two at Vernon
bol'ore dinner. Clara's petition to beset free, released from
)zinc, had vaguely frightened even more than it offended his
NEITHER Clara nor Vernon appeared at the mid-day table.
Dr. Middleton talked with Miss Bale on classical matters',
like a good-natured giant giving a child the jump from
stone to stone across a brawling mountain ford, so that an
unedified audience might really suppose, upon ~ seeing her
over the difficulty, she had done something for herself. Sir
yillou ;hby was proud of her, and therefore anxious to
settle her business while he was in the humour to lose her.
He hoped to finish it by shooting a word or two at Vernon
before dinner. Clam's petition to be set free, released from
him, had vaguely frightened even more than it offended his
OmniPage Pro 11:
NF f,rnMR Clara nor Vernon appeared at the mid-day table.
Dr. Middleton talked with Miss Dale on classical matters,
like .t good-natared giant giving a child the jump from
stone to stone across a brawling mountain ford, so that an
une(lifie(l audience might really suppose, upon seeing her
over the difficulty, she had done something for herself. Sir
jVillon;hl)y was proud of her, and therefore anxious to
setale leer business while he was in the humour to lose her.
lle hoped to finish it by shooting a word or two at Vernon
bofore dinner. Clara's petition to beset free, released from
)lint, had vaguely frightened even more than it offended his
-.2 ..1_ - ____
NEITHER Clara nor Vernon appeared at the mid-day table. Dr. Middleton talked with Miss Dale on classical matters', like a good-natured giant giving a child the jump from stone to stone across a brawling mountain ford, so that an unedified audience might really suppose, upon,seeing her over the difficulty, she had done something for herself. Sir Willoughby was proud of her, and therefore anxious to settle her business while he was in the huniour to lose her. Il"e hoped to finish it by shooting a word or two at Vernon before dinner. Clara's petition to be set free, released from hint, had vaguely frightened even more than it offended his pride. - -
TextBridge Millennium Pro:
NErr'!'~~ Clara nor Vernon appeared at the table. pr. ~1id(lIeto11 talked with Miss Dale on classical matters, like a good-natured giant giving a child the jump from stone to stone across a brawling mountain ford, so that au ~1edifi~ tLU(llCIlCC might really suppose, upon seeing her over the (hjiheulty, she had done something for herself. Sir wiflouighby was proud of her, and therefore anxious to settle her business while he was in the humour to lose her. lie ho1)ed to finish it by shooting a word or two at Vernon before dinner. Clara's petition to be set free, released from him, had vaguely frightened even more than it offended his prú~t~.
NEITHER Clara nor Vernon appeared at the mid-day table. Pr. Middleton talked with Miss Dale on classical matters, like a good-natured giant giving a child the jump from stone to stone across a brawling mountain ford, so that an une(lified audience might really suppose, upon - seeing her over the difficulty, she had done something for herself. Sir Willoughby was proud of her, and therefore anxious to settle hier l)uSifleSS while he was in the humour to lose her. lie hoped to finish it by shooting a word or two at Vernon before dinner. Clara's petition to be set free, released from hirn~, had vaguely frightened even more than it offended his pri(le.
* * * * *
Scan 4—A Really Bad Case!
Scan4 is a paragraph from Pope's translation of Homer's "Odyssey".This is a very, very tough one. It was obviously a cheap printing tobegin with, using thin, poor-quality paper in a page size of 6" by4.5", with capital letters about 1.5 mm high, a little bigger thanTimes New Roman size 8. Text this small really needs ahigher-resolution scan. The book was falling apart when I got it, theink was fading and flaking, and there was no point in even thinkingabout trying to scan it flat, so I cut the pages. To add an extrachallenge, I scanned the sample with the cover open in a medium-litroom for the 300 and 400dpi scans, but closed the cover for the 600dpito show the best quality I could possibly get. (I was pleased to notethat Abbyy, while recognizing the page in the 300dpi and 400dpiimages, flashed up a suggestion that I should lower the brightness ofthe scan.)
This particular book was one I sporadically tried to produce, withoutsuccess, on an older scanner and a bundled OCR program over a periodof two years, back in 98/99. Eventually, in 2000, it was the firstbook processed through Charles Franks' Distributed Proofreaders site.The initial text produced by the OCR was very poor, but the humanvolunteers made up for it! Thanks, guys! Today, just two years later,with a better scanner and better OCR, I could have done it myself, asyou will see from the best of the results of the 600dpi scans. That'show much things have improved recently.
A separate point to note here is that you can see the "three-quarterspace" effect before the exclamation mark and semi-colon that wasdiscussed in [V.104].
The results of the OCR are:
Abbyy FineReader 6:
" Ah me ! on what inhospitable coast,
On new region is Ulysses toss'd ;
Possess'd by wild barbarians fierce in arms ;
Or men. whose bosom tender pity warms ?
What sounds are these that gather from the shores ?
The voice of nymphs that haunt the sylvan bowers,
The fair-hair'd Pryads of the shady wood ;
Or azure daughters of the silver flood ;
Or human voir-e? but issuing1 from the shades,
AVhv cease I straight to learn what sound invades?"
" Ah me ! on what inhospitable coast,
On what new region is Ulysses toss'd ;
Possess'd by wild barbarians fierce in arms ;
Or men, whose bosom tender pity warms '?
"What sounds are these that gather from the shores ?
The voice of nymphs that haunt the sylvan bowers,
The fair-hair'd Dryads of the shady wood ;
Or azure daughters of the silver flood ;
Or human voice? but issuing from the shades,
Why cease I straight to learn what sound invades?"
" Ah me ! on what inhospitable coast,
On what new region is Ulysses toss'd ;
Possess'd by wild barbarians fierce in arms ;
Or men, whose bosom tender pity warms ?
"What sounds are these that gather from the shores ?
The voice of nymphs that haunt the sylvan bowers,
The fair-hair'd*Dryads of the slrady wood ;
Or azure daughters of the silver flood ;
Or human voice? but issuing from the shades,
Why cease I straight to learn what sound invades?"
gocr 0.3.6:
[The 300 and 400 dpi scans produced nothing recognizable.
The result of the 600 dpi scan is below.]
'' _hh i_3e ! o_1 ___l_at_ i__l__sl__ it_nble CoaSt_ On ___l_,__ _)e_v i_e_io__ i__ ___ _._____ses toss'd ; _(3s3gs3_d l3.__ ___iiíi l3_3__b___i_c_i3_ fie_Ce in il__S- _ Or i11pn, __-i)c3se l_osonl te_1de_ _it____ __ai_n3__ ? ___l_at __o__i1ds Qre tlipse tliat g__tl_p_r fE_oi33 the shoTes ? '_ilie __oi__e of i)____ E1)l3l3s tl3nT 1i_n__nt the s__l__inn bo_Ye_5_ 3'l_e fni___i____ir'd _____-ads of' il_e sli__d__ i___oOd _ Op az(_pe da_____litc__s of _tlie sil __?r t1ood ; Or l___i31_nn ___)i___? l3__t i3____ii_6 fi_oi11 tlie __hiade__ _ __'!3.__ _ea___e _ s_rai__li.t to l_ar_i1- i_—li__t so_nd- in__ad_S___''
Recognita Standard 3.2.7AK:
.: lh nt"'. on w-hat inlu,;y:t, I,:e co;;~t, On ~cli^t ne~- re~ion i.. 1= 1-.-:.:e~ tm:'d ; Possea'd 1n- wil~l L;,rba~:c, .~ fierce in arm~ ; Or u.~u. w-Ln.e bossum tender pit~- warna'? ~l-u:lt .<,:~;;::;3s are tll~ce that ~atl:er from the shnre~ ? 'I'l.e -;;o'.re :,; nwtthil: tW ,t l:aa;nt the s~-l:c 1llJOR'er5, 'lhe :a,:~-h ~;r'd~It.wa~i~ ot' tl:e ~Il;;dv vood; Or az.lre dau~~l.ts~: oY tl:c ·:iv-~~r floo;:3 ; C?r humnn ~-<:i: e'? l,~:tt i~~; from tl:c· ~had~~, 11-lts- cea~e I ctrai rlit to learn ~s-l:, t socud incades %"
" ~h me ! ou "-Mat iuMospita~le coast,
On ~i-lmt ne~c reyion is L 1~-~ses to~s'd ;
Pos:e;s'd 1"~ w-iMl lrvrbaria:ns fiet~ce in arms ;
Or m~ n, "-hose hosom tender pit~- warm5 ?
~~~hat ~ounds are tlmse tMat ~;atMer from t:he shores ?
~t'I~e ~-oi~~e of n~-Inhhs t.hat liaunt the s~-l~~a n howers
Tlie fair-hnir'd D~ vads ot tl:e shad~- "-ood ;
Or aznre dau~liters of tMe sil~-~r fiood ;
Or lmman ~-oi:~e'? but iauin~ frotn the shades, a
lVly cea.~e I straibht to learn "-Mat souud in~ad°s?"
" Ah me ! on what inhospitable coast
On ~~-hat new r e~ion is L;1 ~-sses toss'd ~
Possess'd 1J~- "-ilil I:OII'uai'la ils fierce in arms_ ·
Or men, whose hosom tender pit~l ~varn~s ?
~'G'l~at somnds are these tliat ~atl~er from the shores ?
~I'Iie v oice of n~-mpl~S that ~munt the sy Ivan bowers,
Tlie fair -hair'd D~~~-ads of tl~e slmdy wood ;
Or azure daylltcrs of tlle silver flood ;
Or lm:nan voice? uut issL~ing from the shades,
~~'lm cea~e I strai~ht to Iearn ~~-lmt so~nd inv ades ?"
OmniPage Pro 10:
,. lh in- ' on "-hat inh-slit al.:e coast,
On "M.^t new reion is 1=1;-a:e~ to-s'd ;
P"::e:~'d hw "ild Larba.:an~ fierce in arms ;
Or inn. "-hnse bo.,om tender pity warms
What <m-,n ds are thFSe that gather from the shores?
'1-l.e vo,e o2 u~vnhit: thm hn,,-,nt The sylvan bowers,
The is ;r-ha;r'd h.-;-ads of the liz-Ay iNood
Or azure dau_ht;- of tl:c o=1 cr flooj ;
Or hnnmn wire? l,11t i—rii:g from the shadP3,
Al-ly cease I straiAlit to learn what sound invades?"
'Wh me ! on what inhospitable coast,
On what new region is L fusses toss'd ;
Possess'd br wild barbaric ns fierce in arms ;
Or men, whose bosom tender pith- warms
AN-hat sounds are these that gather from the shores ?
The voice of nymphs that Haunt the sylvan bowers,
The fair-hair'd IWvads of the shady -wood ;
Or azure daughters of the silver flood ;
Or human voice? bat iauina from the shades,
Why cease I straight to learn what sound invades?"
" Ah me! on what inhospitable coast,
On what new region is Ll ysses toss'd ;
Possess'd bv -wild barbarians fierce in arms ;
Or men, whose bosom tender pity warnis ?
AVlia± sounds are these that gatller from the shores
The voice of nYI11pliS that haunt the -sylvan bowers,
The fair -hair'd D.-yads of the shady wood ;
Or azure daughters of the silver flood ;
Or human voice? lout issuing from the shades,
Why cease I straight to learn what sound invades?"
OmniPage Pro 11:
.` lh in-' on what inhospital,le co-st,
On xclznt near region is t 1:-sse~ toss'(: ;
Possess'd bY Mild barbarians fierce in aims ;
Or inn. whose boson tender pity warms
What <m-,n ds are tlipse that gather from the shores ?
'_I-I.e 1-o=,- of nv:npii? that haunt the sylvan bowers,
She ra;r-ha;r'd 1):, ads of the shad- wood ;
Or az.ire dau_lit~- of tl:e silo-:-r flood ;
Or human voice? l,,tt i?snina from the shadpq,
Al-lry cease I straiAit to learn shat sound invades?"
''' :Ah me ! on what inhospitable coast,
On iyhat new region is Ulysses toss'd ;
Possess'd br wild barbarimis fierce in arms ;
Or men, whose bosom tender pity warms
AN-hat sounds are tliese that gather from the shores ?
The voice of nymphs that haunt the sylvan bowers,
The fair-hair'd D~ yads of the shady -wood
Or azure dau.L-hters of the silver flood ;
Or human voice? but issuing from the shades,
Why cease I straight to learn what sound invades?"
" Ah me! on what inhospitable coast,
On what new region is Ulysses toss'd ;
Possess'd by -wild barbarians fierce in arms ;
Or n1en, whose bosom tender pity warnis ?
AVliat sounds are these that gather from the shores
The voice of nyniplis that haunt the sylvan bowers,
The fair-hair'd Dryads of the shady Wood ;
Or azure daughters of the silver flood ;
Or human voice? but issuing from the shades,
Why cease I straight to learn what sound invades?"
TextBridge Millennium Pro:
no on what inhe~ptaEie coast, On what new realun is hivs,e' to5sd ,s~s Ä-~d liv wild lie il)~m.ihI fir see in al-rn~ Or u~,-n. w'linse bo,uuiu tender pity warnls Wl at ~ are t1ie~e that ~atler from the shores ? 'n.e a oro of imvntpirs tint he~nt the sad van bowers, 'flie tah'-ha~r'd D~vahs ct the shady wood 1)1' az Ire dauul~t ~ of tl,e shvr flood Or liunian vi i 'I ? h'tt is- eng from the shades, \VIiv cea-~e I straight to learn w hat sound invades 1"
Ah me on what inhospitable coast,
On what new region is U vases toss'd
Possess'd by wild barbarians fierce in arms
Or men, whose bosom tender pity warms ~
What sounds are these that gather from the shores?
The voi'e of nymphs that haunt the sylvan bowers,
The fair-baird Prvads of tl~e shady wood
Or azure daughters of the silver flood
Or human vuiae? but issuing fi'om the shades,
Why cease I straigl~t to learn what sound invades?"
Ah me on what inhospitable coast,
On what new region is Ulysses toss'd
Possess'd by wild barbarians fierce in arms
Or men, whose bosom tender pity warms?
What sounds are these that gather from the shores?
rfhe voice of nymphs that haunt the sylvan bowers,
The fair-hair'd Dtyads of the shady wood;
Or azure daughters of 'the silver flood
Or human voice? but issuing from the shades,
Why cease I straigl~t to learn what sOund invades?"
What can we conclude from this?
Small mistakes in scanning, like letting too much light in, gettingyour scanner settings wrong for the page, or not pressing the paperflat enough, can make a major difference to the final quality of thetext that you will have to correct.
Sometimes, no matter what you do with your scanner, problems with thepaper or the print will make it difficult for your OCR package to givegood output.
Generally, bigger is better within the range 300dpi-600dpi, but youonly need higher resolution with more difficult material.
Different OCR packages will produce widely differing texts from thesame images. Given a really good image, most OCR software will workacceptably, but when you have lower quality material to work with, thegap between OCR packages shows clearly.
S.18. I got an OCR package bundled with my scanner. Is it good enough to use?
That depends on how well your package performs on the actual scansthat you do, and how much you value your time vs. money. Most scannersare bundled with OCR software, but these OCR packages are often olderor "brain-damaged" versions, with their functionality deliberatelylowered. It's unlikely that you'll get a current-version,top-of-the-line OCR package thrown in for free.
You may have to pay extra for better OCR, but it means that you spendless time making corrections. The question is how much better you wantyour OCR to be.
Save the images from the FAQ "Why am I getting a lot of mistakes in myOCRed text?" [S.17] and try processing them with the OCR you have.Compare the quality of the text produced with the quality of thesamples. This should give you some idea of how your OCR compares toothers.
Try a few pages from your book with your OCR. How many mistakes do yousee on each page? Do you find that acceptable?
S.19. I want to include some images with a HTML version. How should I
scan them?
We don't often see color prints in our books, but if you do have one,then scan it in color. Otherwise, try both greyscale and B&W, and seewhich gives you the best image.
It's usually better to scan images in a higher resolution than you'regoing to use, and then use an image manipulation package to reducethem [H.10] to a size appropriate for your HTML file. An initial scanat 600dpi is often good. Image manipulation programs will also allowyou to "clean up" the pictures, by increasing contrast, despeckling,or other filtering.
S.20. I want to include some images with a HTML version. What type of
image should I use?
GIF, JPEG and PNG images are supported by current browsers, and youshould stick with those unless you have a specific reason not to.
GIF and PNG tend to be more efficient—provide better quality at agiven file size—for simple line-drawings; JPEG is usually better forphotographic images.
S.21. Will PG store scanned page images of my book?
No. Or, at least, not yet.
The idea has been kicked around a bit. There's no question ofreplacing etexts with page images, but many volunteers who havealready scanned the book anyway like the idea of saving page images aswell—for general information, and as a means of checking futurecorrection suggestions against the original. Some volunteers alreadykeep their page images, stored for possible future use.
Working some back-of-the-napkin figures: a page of text might take up1KB of space on a computer as plain text or HTML or XML. The same pagemight take 70KB if stored as a black-and-white image, of just enoughquality to serve as a reliable guide to making corrections. Pages withpictures, or stored with enough resolution to allow some futureresearcher to write a paper on the changing shape of serifs in the18th and 19th centuries, would start at around 350KB per page, and goup from there.
A 300 page book thus becomes
about 300KB as plain text (and around 150K zipped) about 20,000KB as minimal-quality images about 100,000KB as high-quality images
and with the images, we won't save much space on the zipping, becausethey're already compressed.
On a normal "56K" modem, getting about 4KB / second, it would take:
75 seconds to download the text file (40 for the Zip) 80 minutes to download the minimal images over 5 hours to download the high-res images.
Someday, the disk and bandwidth capacities that we will take forgranted will be such that uploading images, when we have them, will bequite natural, just for the few people who will want them. But we'renot quite there yet.
Late flash! As of late 2002, the Internet Archive is providing spaceto volunteers for storing page images. To see the images, and findout more, go to <>
H.1. Can I submit a HTML version of my text?
H.2. Why should I make a HTML version?
Well, you can make one just because you want to, but on some textsthere is special reason to.
If you want to preserve the pictures that accompany the text, making aHTML version means that you can specify where and how those imagesappear.
If there is particular meaningful information in the layout of thetext that can't be expressed in ASCII, like special characters orcomplex tables or fonts, HTML may offer an open format alternative.
H.3. Can I submit a HTML version without a plain ASCII version?
You can submit it, but the Posting Team will then consider whetherwe should also make an ASCII, or perhaps ISO-8859 or Unicode versionof it. We really do want our texts to be viewable by everybody, underevery circ*mstances, and we do not want to start posting texts thatare in any way inaccessible to anyone.
See also the FAQ [G.17] "Why is PG so set on using Plain Vanilla
H.4. What are the PG rules for HTML texts?
1. The only absolute rule is that the HTML should be valid accordingto one of the W3C HTML standards.
You can verify that your HTML is valid at the W3C's HTML Validator at<>
For a more convenient and friendly, though less official, check of thecorrectness of your HTML, you should use Dave Raggett's Tidy programat <>, which not only points out anymessiness in your HTML code, but also has some neat modes to clean itup and standardize the formatting.
After that, we have some requirements and recommendations. Compliancewith the requirements might be waived if there is a really good reasonto make an exception in this case.
2. Requirement: File names and extensions
If you want your text to work within 8.3 filename conventions, you mayuse .htm as the extension for your HTML files; otherwise, use .html asthe extension. If you are working to 8.3 conventions, all of yourimages as well as your HTML files should have 8.3-compliant filenames.
All file names and extensions should be in lower-case throughout. Yes,we know this is not strictly necessary, but we don't want to have tocorrect every file that comes with "image.gif" referenced in the HTMLaccompanied by a file IMAGE.GIF.
3. Requirement: HTML and plain-text
Project Gutenberg does publish well-formatted, standards compliantHTML. However, we insist that a plain text version be available forall HTML documents we publish (even if images or formatting areabsent), except when ASCII can't reasonably be used at all, forexample with Arabic, or mathematical texts.
4. Requirement: Archive format for posting
If the HTML book contains more than one file (including images), createa ZIP (preferable) or TAR archive containing all of the files in thebook. The ZIP file may, if you wish, unzip to a subdirectory named forthe book. For example, a book called 'The Humour of Mark Twain' mightunzip in a directory called 'mthumor'. Make sure directory namescontain only alphabetic and numeric characters, no spaces, and are 8characters or less, even if you're not sticking to 8.3 conventions forfilenames.
5. Recommendation: Simplicity
Make your HTML as simple as possible. HTML is an evolving standard,and one that may be completely obsolete in the long term. Use ofadvanced features may just mean that your version will be obsolete orunreadable that much faster.
6. Recommendation: Images
Images included with your HTML should be in a format that Web browserscan read: GIF, JPEG or PNG. Images should be edited for high qualityin a reasonably small file size. Make the best decision you canconcerning the image size and placement in the text. Every imageincluded must be linked into (referenced by) the HTML.
7. Recommendation: Line lengths
If it is reasonable to do so, try to wrap paragraphs of text at aroundthe normal PG margin of 70 characters. Ideally, your HTML should be asnear as possible identical to your text version except for the HTMLtags and entities. People who open your HTML won't all be usingbrowsers, people will need to make corrections, not all editors canhandle very long lines, and even with editors that can handle longlines, it's easier to work with short lines.
Apart from these rules and recommendations, we also have a rule aboutthe PG header, but that will normally be handled by the PostingTeam. Where your HTML is all in one file, the header text will beinserted within PRE tags in that file. Where the HTML is split intomultiple pages, the header will be put into a separate file namedindex.htm or index.html, and will link to the first page of your HTML.
H.5. Can I use Javascript or other scripting languages in my HTML?
We don't want our readers to have to worry about any potential formalicious or just plain buggy code.
H.6. Should I make my HTML edition all on one page, or split it into
multiple linked pages?
For a typical novel, one page or HTML file is appropriate, but whenthat single HTML file gets up around 2 megabytes in size, it may beworth considering a split because of the difficulty of loading it insome browsers.
In some other cases, where the content requires different styles ondifferent pages, or different pages need different character sets, orthe page, with images, just gets too heavy, you may need to split theHTML even if the HTML itself isn't technically too big.
When we post a HTML eBook containing multiple files, whether theycontain text or images, we post them only in zipped format, so if youdon't have images, and want your text to be directly accessible, youshould stick to one file where possible.
H.7. How can I check that I haven't made mistakes in coding my HTML?
There are two kinds of mistakes you can make in coding HTML:you can produce invalid HTML, or you can produce HTML thatdoesn't do what you want.
Checking for invalid HTML is straightforward. The W3C site<> will formally validate your fileand point out any mistakes, and this is the official standard.However, it is not always convenient to use, especially whenyou're in a cycle of fix-and-retest. For this, you should trythe program Tidy <>, which runson your computer, tells you about errors, and has other usefulfunctions as well. Tidy is available for just about everyoperating system, and there are several Windows utilities thatinclude Tidy. The links on the main Tidy page will lead youto the right version for you. Tidy is fast and friendly,compared to validation over the web, but it is not the lastword. The W3C Validator may find formal errors, such asDOCTYPE mismatches with HTML tags or entitles, that Tidymay not. The best solution is to complete your HTML testsusing Tidy, and then, when Tidy finds nothing further togripe about, submit it to <> for theofficial seal of approval. Please run these checks beforesubmitting your HTML; we can generally fix it for you, butit may take us a lot of work.
Producing HTML that actually does what you want is equallyimportant. If you've converted the eBook from text, you mayhave created inconsistencies, or closed an italics tag in thewrong place, or used the wrong tag at some points. The only wayto check this is by reading through the HTML in a browser.
H.8. Can I submit a HTML or other format of somebody else's text?
This question has several complications. First, you mustunderstand that it is quite possible, even likely, that yourHTML file will eventually be overwritten by better information.
The value of a HTML file, as opposed to a plain text file,lies in its ability to capture elements of the original thathave been lost in the plain text. A plain text file, usingextended character sets like ISO-8859 [V.76] or Unicode [V.77]and underscores for italics, can capture all of the author'sintent in almost all cases. Sometimes, images and other importantfeatures of the original cannot be captured in plain text alone,but can be captured in HTML, or other markup.
When Michael Hart stopped posting books, in September 2001, wehad HTML formats of about 1.6% of all our eBooks. At the end of2002, that has risen to nearly 11% of all our eBooks. If youhave a clearable copy of an existing posted book, with extrafeatures not included in the original plain text, we wouldencourage you to make a new edition, or version, or format,correcting any errors in the original, and adding any newinformation not included there.
If, on the other hand, you just want to make a "blind formatchange"—making your best guess at what the HTML, or other format,layout should be for a book you've never seen, based on the originalproducer's work—your best bet is to get in touch with the originalproducer, and ask whether they can supply more material for you towork with. Otherwise, you are at best just rearranging informationrather than contributing something new.
A blind format conversion can be done in anything from 2 minutes[R.33] to an hour. It just doesn't make sense for us to keep postingthese files when they contain nothing new, and especially when twopeople may want to convert the same text. It is likely that, at sometime in the next couple of years, we will start on a large-scaleconversion project, to add some form of markup to all of the existingtext files for ease of serving, and having a mish-mash of existingmarkup styles to deal with at that point won't help either.
H.9. How big can the images be in a HTML file?
The images should be as big as necessary, and no bigger.
Sorry, but there is no clear number to give here. Web page designerssweat blood to save an extra 20K on a page; so should you. If you'rean experienced HTML maker, you know this stuff; if you're not, take itas a guideline that you should generally aim to keep your images inthe 30K to 50K size range, with occasional forays into 70-80Kterritory. That's generally big enough for a clear picture, unlessyou're reproducing fine artwork.
H.10. The images I've scanned are too big for inclusion in HTML.
What can I do about it?
This is a common problem, where images from the book occupy a full orhalf page. Your images should be of an appropriate size fordownloading, and 2 megabytes of high-quality scan per image is notreally an appropriate size for most PG texts!
You should reduce the size, and maybe the quality, of the originalscan for simple viewing purposes. There is lots of image-manipulationsoftware to do this. For Windows, you might look at the freewareIrfanview, and for both *nix and Windows there is ImageMagick [P.1].Look for the words "resize" and "resample" in the Help.
Apart from simple converters, which do enough for this purpose, youcan also manipulate the images in full imaging creation and editingpackages like Paint Shop Pro, Adobe Photoshop and The Gimp [P.1].
Different image encoding methods can make a huge difference to thefilesize. Any of the packages mentioned above can encode images asGIF, JPEG or PNG, and, particularly for black and white line drawings,these can encode to very different sizes. So, for example, a 60K JPEGmay save as a 30K GIF, because the GIF encoding works better for thatparticular image. Try your images out, and see what works.
When manipulating images, always work from your original. Don'tconvert your original to a JPEG, and then shrink that and convert itto a GIF. Depending on the format, images may lose definition as theyare converted (search for "lossy compression" in your favorite searchengine to find out more about this), and they certainly losedefinition as they are resized, and you end up with the "imperfectcopy of an imperfect copy of an . . ." effect. When you'reexperimenting, take your original, resize and Save As GIF, then goback to your original, resize and Save As JPG, and so on.
You can also use an image optimizer. These are specialist softwareprograms that try to make image files smaller without sacrificingresolution or detail.
H.11. Can I include decorative images I've made or found?
Please include only the images you got from the book. If you want tomake an edition of the book for your own web site, you can of courseuse whatever you like there, but for PG purposes, we want the book,the whole book, and nothing but the book.
H.12. How can I make a plain text version from a HTML file?
You can edit out the HTML by hand, of course, but there are severaleasier ways to convert.
You can view the HTML in a browser, Select All text, and just Copy andPaste into your editor. This is easiest, but doesn't handle formattinglike tables very well.
You can use the Lynx [P.1] browser to convert your text with the command lynx -dump myfile.html > myfile.txt
Bruce Guthrie's HTMSTRIP for MS-DOS [P.1] is very configurable.
<> has a list of other HTML toplain text converters.
H.13. How can I make a HTML version from my plain text file?
This is not a course in HTML, but, for most books, you don't reallyneed a course in HTML. Making a HTML format of most books is veryeasy, and doesn't take long, once you have mastered basic HTML. Let'sassume you have your completed PG plain text file ready, and walkthrough the steps commonly needed to make a HTML version. We'll dothis by successive approximation, doing the major things first, andthen dealing more and more with the detail.
There are lots of specialized HTML editors out there, but you don'tactually need any of them. The same editor that you used to createyour text will also create your HTML. HTML is just text, with twotypes of special instructions added: tags and entities.
A tag is an instruction to the browser, usually to display somethingwith specific rules. Tags are shown within angled brackets: forexample, <p> is the instruction to start a new paragraph.
An entity is a named special character that might not be availablein your character set. Entities are shown starting with an ampersand"&" and ending with a semi-colon ";" : for example, — is therepresentation of an em-dash.
I'm marking up a made-up short text as I write these steps, looselybased on the sample page from question [V.121]. You can see thechanges made at each stage by looking at the files
htmstep0.txt (text before starting) htmstep1.htm (after adding the HTML header and footer) htmstep2.htm (after adding paragraph marks) htmstep3.htm (after marking main headings) htmstep4.htm (after adding special line breaks and indents) htmstep5.htm (after adding italics and bold) htmstep6.htm (after adding accents and non-ASCII characters) htmstep7.htm (after adding an image) htmstep8.htm (showing some extra techniques)
Before you start, make sure that you can see these files bothin your browser and in your editor. In your editor, you shouldsee the HTML codes; in your browser, you should see the textas it is intended to be viewed.
Note for people who already know HTML: yes, this example omitslots of possible ways to do things, and lots of refinements. Youalready know how to do what you want to do—skip onwards, andgive the beginners room to learn in peace! :-)
Step 1. Add the HTML header and footer information
Add the following lines at the top of your text file:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head><meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>The Project Gutenberg eBook of My Book, by A. N. Author</title></head><body>
Let's explain these one by one:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
says that your file is HTML 4.01 Transitional, which is the latest version, allowing the widest range of tags and entities.
denotes the start of the HTML
denotes the start of the HTML header information.
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
says that the characters are text, using ISO-8859-1 encoding. If you need to use a different character set, you should change ISO-8859-1 to whatever you intend to use. ISO-8859-1 is good for lots of PG books in English that use French or German words.
<title>The Project Gutenberg eBook of My Book, by A. N. Author</title>
You should obviously change this to the actual title and author you're producing. The
denotes the end of the HTML header information and
denotes the start of the actual text itself - the body of the book.
At the very end of the file, you should append these two lines
these denote the end of the body of the book, and the end of the HTML.
At this point, you actually have a valid HTML file! OK, if you view itwith a browser, it doesn't look anything like the way it's supposed to,but it is HTML. Save it with a name like MYFILE1.HTM or STEP1.HTM andget a copy of Tidy for your DOS, Unix, Mac or Windows system from<>. Run Tidy on your file, telling it justto look for errors (tidy -e if running from a command-line; if you'reusing a GUI version, there should me a menu option or tickbox forshowing errors only). Tidy should tell you that there are no errors.Yay!
If it does say that there are errors, deal with them now, before youcontinue. Make sure, at each step, that you have cleaned up anyerrors; it's a lot easier now than later. Also, when you've finishedeach step, save your file with a number in its name, so that if yourun into problems later and get confused, you can, at worst, dropback to the correct version at the end of the previous step.
The most likely error you might have at this point relates to thecharacters "<", ">", or "&". These are the characters used by HTMLto indicate tags and entities. If these characters are used in thetext of your file, (and ampersand is likely to be), you shouldreplace them with entities, so that HTML will know that they areto be displayed as characters, not interpreted as commands.
Replace & with & < with < > with >
There is an example of this in the file htmstep1.htm
Step 2. Add paragraph marks.
For novels and general prose, paragraphs are the main logical anddisplay unit. Paragraphs are marked in HTML with the sign <p> atthe start, and </p> at the end. You don't actually need the </p>at the end, but adding these is a good habit to get into. You do,very much, need the <p> at the start.
The line-lengths within a <p> </p> pair are irrelevant; the browserin which the text is viewed will ignore extra spaces and line-ends,and will wrap text to fit the screen. This is bad for poetry andtables, but we will discuss those later. For this step, all youneed to know is that you can leave your text exactly as it is,and just add the paragraph marks.
Put a <p> at the start of the line before the first letter of everyparagraph, and a </p> just after the last letter or punctuation ofevery paragraph. If you can do macros in your editor, this willjust take a minute; otherwise, it may be rather boring, but atleast it is simple. For this step, put the paragraph marks aroundeverything that has a blank line after it, even poetry or chaptertitles. We'll come back and change that later.
Now save your text as something like MYFILE2.HTM or STEP2.HTM.
Again, run Tidy to check for errors, and fix them before continuing.
If you now look at the file htmstep2.htm in your browser, you willsee that it is starting to take shape. Look at it in your editor,and you will see the paragraph marks.
Step 3. Add marks for headings.
We want to indicate to the reader that certain lines are for chapteror other headings. HTML provides the tags <h1>, <h2>, and so on forthis. <h1> is for the biggest heading, and usually, you will reservethis for the title, and use <h2> for chapter headings. If you findthese too big, you could choose <h2> for main headings, and <h3>for chapters. Whenever you use one of these header tags, you mustclose it with its equivalent end tag. So a chapter heading mightlook like:
<h2>Chapter XI</h2>
Since there won't be many headers, and most headers are only on oneline, this is usually not hard. Look at the file htmstep3.htm tosee how our sample is improving, and if you're working along withme, don't forget to save your file under a new name and check it.
In our example, we have marked some lines with paragraph markswhere we now want to put headings, so we will change those <p>sinto <h2>s, since we don't need or want to mark a line as both.
Step 4. Line up verse, tables of contents, and other lists.
The HTML tag <br> tells the browser to force a line break withoutstarting a new paragraph. We use this when we don't want text allwrapped together, but not separated with blank lines either, forexample in verse and tables of contents.
In our sample, we add the <br> tag to the end of each line in thetable of contents and the end of each line of the verse. If we wereworking on a whole book of poetry, the same principle would apply,but we'd be using the <br> tag a lot more.
Where we want to indent a line of poetry, we can use " " atthe start of the line. Normally, however many spaces you leavebetween words, HTML condenses them to one space, so normalindentation doesn't work. But the "non-breaking space" entity willcause the browser to show one space for each character, so thatyou can indent as much as you need.
The file htmstep4.htm shows the effect: this is now an entirelyreadable HTML text!
Step 5. Add back in italics and bold.
The HTML tag <i> tells the browser to start displaying italics,and the </i> tells it to stop. Similarly, the <b> tag tells itto display bold, and </b> marks the end of the bold text. Seehtmstep5.htm for the changes.
Step 6. Restore accents and special characters.
Since we declared our HTML file to use ISO-8859-1 back at the start,we can use any of the common accented characters for Western Europeanlanguages, but we may also use HTML entities. For example, for the"a circumflex" in "flaneur", we can use either the ISO-8859 characterdirectly, or the HTML entity name "â" or number "â".
There is a trade-off between characters and entities: entities do notlimit you to any particular character set, but characters are directlyreadable when looking at the HTML source.
Within entitles, there is also a trade-off between entity names andnumbers: older browsers may not recognize some of the entity names, butthe entities do make the text work in multiple character sets. Which youchoose is entirely up to you, but it's best to be consistent; if youlike entities, use them everywhere. Entities can be represented by theirnames—for example, ——or by their number, derived from theirISO-10646 (see Unicode) number—for example, —.
There are other special character entities you may choose, to replacethe ASCII equivalents in the main text. Here are some of the commonones:
We've already seen
& & ampersand replaces "&" < < less than replaces "<" > > greater than replaces ">"   space replaces a space when you want to indent
and these are also very useful for many PG texts:
— — em-dash replaces "—" ° ° degree replaces "deg." or "degrees" £ £ British pound replaces "L" or "l" or "pounds"
There are many others. <>has a fuller list. Please note that you don't have to use theseentities in your HTML; if you're happy with the text reading"500 pounds", there is no need to make that "£500".
I've made a couple of entity changes in htmstep6.htm.
Step 7. Link Images into the text.
First, you need to have your image ready. You should already haveresized your image to the size you want it to be viewed at. Youshould also have saved it as a GIF, JPG, or PNG image, since thoseare the formats most supported by current browsers.
If your image is named front.gif, and it is a picture of thefrontispiece of the book, you should add the line
<img src="front.gif" alt="Frontispiece">
to your HTML at the place where you want it displayed.
The "alt" text gives a label to the image, and is displayed ifthe image can't be shown, or in the case of a browser forvisually impaired people.
You don't have to add images with your HTML file, unless youwant to. In many older books, there are no images at all tobe added.
My final HTML text is now in htmstep7.htm. You need to havethe image front.gif in the same directory in order to see it.When your HTML text is posted, the images will be zipped withit, so that future readers can see them.
Step 8. Over to you!
This is enough to make a reasonable HTML format of most PGtexts, but it doesn't begin to cover everything that can bedone in HTML. If you've gone this far, I recommend the W3C'stutorials:
which cover the ground we've just crossed, and go a bit further.
Here are a few more things you might want to know, but don't gonuts adding tags just because you can! Use them only when youreally need them. The file htmstep8.htm shows some of thesetechniques. Personally, I think that this is a bit overdone,and I prefer the effect of htmstep7, with left-alignedchapter headings, but that's a matter of taste.
Once you're used to the basic HTML needed for most PG eBooks,you'll probably be able to convert one in under an hour.
How do I force more space between specific paragraphs?
Insert a blank paragraph like this: <p> </p> oruse an extra <br> tag.
How do I make text, or image, or headings centered?
Put the <center> and </center> tags around what you want centered,like: <center><h2>Chapter 12</h2></center>
How do I make some text bigger or smaller?
Put the <big> and </big>, or <small> and </small> tags around it.
How do I lay out tabular information?
The simplest way to do it is with the <PRE> and </PRE> tags.These will cause whatever is within them to be displayed asplain text, just as it was in the original, so that spacesseparate the entries just as they did in the text version.You can also use this for poetry, though you usually won'tneed to. It's not entirely satisfactory, but it will work.
Making a full HTML table requires you to use the <table>,<tr> (table row), and <td> (table detail) tags, among others,and a full exposition of tables is beyond the scope of this FAQ.
Briefly, you start a table with the <table> tag. <table>
For each row you want in the table, you open and close a tablerow <tr> tag, like:
and then for each cell within a row, you specify a <td> tag andthe contents of that cell:
<td>This is the Top Left cell</td>
<td>This is the Top Right cell</td>
<td>This is the Bottom Left cell</td>
<td>This is the Bottom Right cell</td>
This only scratches the surface of tables. However, there are manyguides available on the Web, and they're easy to find, once youknow which tags you're looking for. A brief discussion of tablesis provided by the W3C as part of the HTML 4.01 spec at<> andthe tutorial at <>also shows how to make HTML tables.
Step 9. Some common problems
When you're just starting to code HTML, it may seem that errors arecoming at you from all sides. Tidy may spew out a stream of complaintsthat you don't recognize or understand. If it's any consolation, thisis normal!
Just take the error list one line at a time, starting at the top.Often, one actual mistake, like not closing a tag, may cause manyerrors, since an unclosed tag can cause many subsequent tags tobe reported as errors.
Common errors include:
1. Simple typos in tags, like <h2Chapter 3</h2> instead of
<h2>Chapter 3</h2>
2. Unclosed tags, like forgetting to add the </h2> in the
sample above, or forgetting the slash in the closing
tag so that you type <i>italics<i> instead of
3. Not nesting tags correctly. Get used to thinking of tags
as brackets; the first one opened should be the last one
closed. For example, you should type:
<center><p>This is centered.</p></center>
instead of
<p><center>This is centered.</p></center>
One option for making a HTML version is to use GutenMark<> to create the basic HTMLstraight from your text, and then edit the resulting HTML toadd the features you want. If you're having a lot of problemswith your main conversion, this is worth a try.
Programs and programmers FAQ
P.1. What useful programs are available for Project Gutenberg work?
These suggestions came largely from a poll of volunteers in June,2002. The programs listed are a summary of the programs we actuallyuse. There are many other programs out there that can do the samejobs, so don't limit your search just to these.
1. OCR
Abbyy <>
OmniPage <>
TextBridge <>
These are the three main commercial packages that volunteers boughtspecifically for the purpose. In a few cases, people had got olderversions of these bundled with their scanners.
Clara OCR <>
Gocr <>
These are Free Software packages. Some people who responded to thesurvey had tried them, but nobody had actually used them to produce atext.
DocMorph — a free, web-based OCR <>
This one is interesting—you can just submit your image through a webpage, and the service will return OCRed text. However, the process ofsubmission, waiting for your text, and then cutting and pasting intoyour document is slow.
Other volunteers use various OCR software that came bundled with theirscanner.
2. Editing
The main answers, given by more than one person, were:
AbiWord <>
Microsoft Word
Windows WordPad
Word Perfect
Other editors mentioned included:
Crisp for Windows <>
EditPad <>
Editplus for Windows <>
Foxpro 2.6 for DOS
Metapad <>
Windows Notepad
Programs recommended by Apple Macintosh users included:
BBEdit Lite <>
Microsoft Word
Nisus Writer <>
Text-Edit Plus <>
TextSpresso <>
Add/Strip <>
3. Checking and proofing
For spelling, most people just use the spellchecker built into theireditor or word-processor. The *nix users running emacs or vi tended touse variants of the standard Unix spell command, such as ispell oraspell. Mac users have the free spelling checker Excalibur, availablefrom <>.
Gutcheck <> was used for format checking,and a few people had written some checking procedures of their own.
4. Working with HTML
In the survey, most volunteers preferred to handcraft their HTML usingtheir normal editor. Those using a word processor edited the HTML astext, rather than composing a word processor file and then Saving AsHTML. There was remarkable unanimity on this.
Specific HTML editors that were mentioned for occasional use were:
Adobe PageMill (no longer available)
Mozilla Composer <>
HTMLKit <>
HTMLPad <>
However, not all HTML work is about editing, and the followingpackages were honorably mentioned for other functions. Especiallyimportant is Tidy, which is pretty much necessary for all but themost experienced people for quick HTML checking.<> has the original, and links toversions of Tidy for Windows (Tidy-GUI) and just about all otherplatforms.
Converts Project Gutenberg texts to HTML and TeX.
HTMSTRIP by Bruce Guthrie:
MS-DOS. Converts HTML to text
Lynx (lynx —dump):
Converts HTML to text
Dave Raggett's HTML Tidy:
Checks HTML for correctness, reformats and fixes
W3C html2txt (web-based):
Converts HTML to plain text.
W3C Validator (web-based):
The Last Word on the correctness of HTML.
A very neat utility for getting web pages
5. Working with images.
There are two main applications of images in PG—images to be usedwithin texts, like illustrations in HTML, and the management of pageimages for scanning. These packages are used by volunteers variouslyfor both of those purposes. Their typical use within PG is indicated."Advanced image processing" packages will permit you to edit andrestore damaged images, but for PG work, we mostly just need tomanage, convert, resize and crop them.
ACDSEE for Windows
For image reviewing
Adobe Photoshop
For advanced image processing
ImageMagick for *nix, Mac and Windows
Resizing and format conversion
Irfanview for Windows
Image viewing, conversion, cropping and resizing
The Gimp
For advanced image processing
Picture Publisher
For advanced image processing
VuePrint Pro
For viewing images
Proofreaders' Toolkit (PRTK)
For splitting batches of image files into individual pages
P.2. What programs could I write to help with PG work?
Look at the programs listed above in [P.1]. Can you write a betterversion of any of them? Improving OCR and editors constitutes amajor challenge, unless you're a world-class expert, but checkingand reformatting texts is an area not addressed by large scaleprograms, and you might contribute there.
Formats FAQ
F.1. What formats does Project Gutenberg publish?
In principle, there's no format that we won't publish, but, inpractice, we prefer formats that are open and editable.
An open format is one whose structure is publicly defined anddocumented, and not burdened with patent or trade secret orcopy-protection (a.k.a. "DRM") restrictions. Anyone can write areader or creator for an open format, and in 500 years' time, anyoneinterested will still be able to write a program to display the file.Closed formats, by contrast, will almost certainly be unreadable injust a few decades, when the companies now promoting them disappear,or lose interest, or decide to stop supporting them because theywant to sell a replacement.
Being able to edit the file is also important. We make corrections toour editions constantly, and it is important to us that we should beable to update our files easily. If adding one word to a sentenceinvolves a complete re-marking of the whole text and a completerebuild of the file, we have to ask ourselves whether this format isreally necessary for this text. Further, the people who re-use ourtexts should also be allowed to copy and reformat them freely, andnon-editable formats restrict their ability to do this in various ways.
F.2. What is, and how do I make or use:
[Note: Character sets and formats are both listed here. Character setsrefer to the characters you can use; formats describe how thosecharacters are put together. For non-text formats such as music files,there is no exact equivalent to a character set.]
ASCII (Character Set)
ASCII (American Standard Code for Information Interchange) is a set ofcommon characters, including just about everything that you can typein on an English-language keyboard. It includes the letters A-Z, a-z,space, numbers, punctuation and some basic symbols. Every character inthis document is an ASCII character, and each character is identifiedwith a number from 0 through 127 internally in the computer.
You can view or edit ASCII text using just about every text editor orviewer in the world.
Big-5 (Character Set)
Big-5 is a set of 13,494 traditional Chinese characters. You will needto use an editor or viewer that supports the character set.
Codepage 437, 850, 1252, etc. (Character Sets)
These codepages are Microsoft-specific character sets which allow thedisplay of accented characters and other symbols. To view a text thatuses one of these, you will have to use a Microsoft application thatsupports them. Many of the fonts supplied with Word for Windows willdisplay and edit CP-1252 correctly. For Codepages 437 and 850, you mayhave to open a Command Prompt and use a DOS editor like EDIT. A searchform <> should bring up information about thecodepage you're interested in, or you can read the excellent overviewat <>. For Unix users, iconvand recode provide translation facilities from one character set toanother, and support many or all of the MS codepages.
DVI stands for DeVice Independent, and is commonly used to store textand instructions for displaying it involving complex mathematicalsymbols and expressions, though it can be used for any content. Givena DVI file, you need a viewer to render it on the specific deviceyou're using. Specifically, DVI is used as the standard output formatfor TeX, discussed below.
HTML/HTM (Format)
HyperText Markup Language defines the standard format of web pages.You should be able to view these with any web browser, and edit themwith any text editor or a specialized HTML editor. <> isthe definitive reference.
ISO-8859/ISO-Latin (Character Sets)
ISO-8859 is a series of character sets used to represent the accentedcharacters most commonly used in European languages. There'sISO-8859-1, ISO-8859-2, and so on. ISO-Latin is just another name forthe same thing. You can read the overview at<>
LIT (Format for PDA-based eBooks)
This is a proprietary, closed format for files that can be displayedonly by the Microsoft Reader. Search <> formore information. It is not possible to edit or correct files in thisformat; it is not possible to export files from this format; they haveto be made in another format and converted.
MacRoman (Character Set)
MacRoman is an 8-bit Apple Mac-specific character set which allows thedisplay of accented characters and other symbols. To view a text thatuses MacRoman, you will have to use an application that supports it,and there are few outside the Apple fold. However, iconv and recodeare programs that convert between many character sets, and MacRomanis supported by both.
MID/MIDI (Format for music)
Musical Instrument Digital Interface is a music description language,encompassing not only file formats but definitions of interfaces. AMIDI file contains instructions for sending messages to a musicalinstrument to recreate the sounds. <> has much moreon this.
MP3 (Format for any audio file)
MPEG-1, Level 3, was defined by the Moving Pictures Expert Group as ameans for encoding sounds. Many, many MP3 players exist for allplatforms, and can be found easily with a Net search. The officialhome page of the MPEG is <> and copiesof the specification can be purchased from the ISO at<>
MPEG/MPG (Format for moving pictures)
The Moving Pictures Expert Group have released a series of formats forencoding video and audio. MPEG (pronounced EM-peg) formats arepublished and widely used. The official home page of the MPEG is<> but you will find information aboutMPEG formats, and software to play MPEG files, all over the Net. Youcan also purchase specifications through <>
MUS (Format for music)
MUS from Coda Music <> is a proprietary,closed format for editing and replaying sheet music. However, we dopost music files in this format because of its many features. We hopeto be able to post these also in more open standards at some point inthe future, but at the moment, there is no open format with similarcapabilities. You can find out more about this at<>
PDB (Format for PDA-based eBooks)
The Palm Data Base format can actually be used for purposes otherthan eBooks, and there are many possible variants of formats forPalm-based readers all using the extension PDB on PCs, and they'renot all entirely compatible. Some of them are proprietary, and itmay not be possible to edit them directly, or export files fromthese formats; they have to be made in another format and converted.Some can be converted back to text. The most common, though, is the"Palm-DOC" format, which is an open format and can be edited on thePalm itself.
PDF (Format for eBooks)
Portable Document Format is a format for storing texts, containing anyfonts or graphics. It is copyrighted by Adobe, <>but is well and publicly documented. It is sometimes referred to as akind of compiled Postscript (see PS below). It is viewable using theAdobe Acrobat Reader. It is not possible to edit files in this format.
PRC (Format for PDA-based eBooks)
This is a proprietary format for files that can be displayed only bythe MobiPocket Reader. See <> for moreinformation. It is not possible to edit or correct files in thisformat; it is not possible to export files from this format; they haveto be made in another format and converted.
PS (Format for text and graphics)
Postscript is technically a programming language, not just a format.It has conditional statements, procedures and program flow control.However, it is commonly referred to as a format. Adobe<> holds copyright on the Postscript specifications(there have been three "levels" published) but Postscript is well andpublicly documented and has wide support, not only in printing, but inscreen display as well. Apart from Adobe's official version, you canalso render Postscript files with Ghostscript, a Free Softwarepackage. Postscript can be edited directly, but any complex editingmay present difficulties.
RTF (Format for text)
Rich Text Format was originally a Microsoft specification, but it isan open format that is used by many word processors to exchange textand format information in an application-independent way. Nearly allcurrent word processors will read and edit an RTF file, and, likeHTML, it can also be edited as plain text.
TXT is a generic extension used for any plain text file, regardless ofthe character set. Thus, while most of our .TXT files contain ASCII,some contain ISO-8859 or Big-5 or Unicode.
TeX (Format for typesetting, printing and viewing)
TeX (pronounced "tech"—the "X" is actually the Greek letter chi) is apublic domain format created by Donald Knuth for typesetting, thoughit can also be used for normal printing and viewing. TeX consistsmostly of the plain text, with instructions for how it is to bedisplayed. This is compiled into DVI format (see above) which can berendered onto any device, like a printer or screen, by a program thatis aware of the device's capabilities. The Comprehensive TeX ArchiveNetwork <> is the best place to start looking forTeX-related programs for your platform.
Unicode/UTF-8, UTF-16, UTF-32 (Character Set)
Unicode is intended to be a single character set that can handle allof the characters in all of the languages that ever were, or ever willbe. It accords with the ISO-10646 standard for the characters, but, inaddition, imposes rules of implementation. UTF-8, UTF-16, UTF-32 andtheir variants are ways of expressing Unicode using different rulesfor transforming bytes into characters. Unicode is steadily gainingground, with at least some support in every major operating system,but we're nowhere near the point where everyone can just open a textbased on Unicode and read and edit it. Check <>for more.
XML (Format for . . . well, just about anything :-)
eXtensible Markup Language looks a bit like HTML, but whereas tagssuch as <p> have a standard meaning in HTML, XML allows anyone todefine their own set of tags and meanings using a Document TypeDefinition (DTD) file. Add a CSS (Cascading Style Sheets) file tothat, and you have the ability to display the text according topredefined rules. In principle, this seems to make it ideal for thestorage and processing of etexts, since a suitable DTD and CSS,together with the right programs, should make it possible to produceany format of eBook automatically from an XML original. Some PGvolunteers have looked at, and are looking at, ways to convert theentire archive using a satisfactory DTD; however, meantime we aren'tactually producing much XML, since most volunteers aren't working withit, and nobody wants to start producing many XML texts until we haveagreed on a DTD. <> is the definitive sourcefor more information about XML.
Volunteers' Voices
In this section, we asked volunteers to talk about their practicalexperiences with Project Gutenberg, how they joined, why they giveup their hours to work for Free Etexts, how they get down to thenitty-gritty of producing texts.
Some people chose an interview format for their responses, withpre-set questions; others just wrote.
Amy Zelmer
I stumbled across Project Gutenberg a couple of years ago—can'tremember just what I was looking for on the web but the idea of PGintrigued me. I was also looking for something to get me readingmaterials which I wouldn't ordinarily read, so didn't particularlywant to find a book in which I was interested—and the whole processof finding a book, finding out if it was already "in progress" andthen checking out copyright clearance seemed just a little dauntingfrom what I was able to gather from the info on the web.
Furthermore, I live in a small regional city in Australia, so thepossibilities of finding something in either the local library or in asecond-hand bookshop was next to nil.
Fortunately I also found Sue Asscher's name and figured that I'd ask afellow Aussie how to get started. Sue seems to have an inexhaustiblestock of books waiting to be entered — and got me started on ThomasHuxley's "Essays and Lectures". I've now done five other books and amcurrently working on Darwin's "The Power of Movement in Plants"—quitea variety, but it's at least met my goal of reading somethingdifferent.
Fortunately Sue was also patient about answering my beginner'squestions about formatting dilemmas and has been able to co-ordinateother aspects of the process, like getting scans of diagrams and finalproof-reading. That means all I have to do is put in the text.
I'm a reasonably good typist — and the practice with PG is certainlyimproving both my speed and accuracy! (That's meant as a word ofencouragement to others.) I generally type for about 20 minutes at atime, then take a break; both my concentration and desire to preventRSI (repetitive strain injury or occupational overuse syndrome) meanthat it's better to do shorter sessions more frequently than to carryon for too long a time. I generally use Microsoft Word 2001 forMacintosh for the first entry and spell check, then save the materialin "text only" and do a final read through, removing page numbers andcorrecting errors which the spell-checker missed as I go.
I've also done some data input for another ebook collection. However,they separate the text and send out small batches of pages to manyvolunteers. I find that rather frustrating since it's impossible tosee how your piece fits until the whole thing is finally posted.
I've done some scanning, OCR and proof-reading of material, butgenerally find the close proof-reading which is required veryfrustrating. To each his own method.
Ben Crowder
I've been a book lover ever since the day I learned to read.Several years ago I discovered Project Gutenberg while surfing thenet and was delighted to find so many good books freely available.I downloaded all the etexts I was interested in and read quite a fewof them. After a few years, I decided to get more involved, so Istarted proofing with Distributed Proofreaders. I liked that a lot— I was a newspaper editor in high school for two years — but Ifelt an itch to try to produce etexts on my own. I didn't have ascanner, however, so the only solution I could see at the time wasto find a book and start typing it in by hand. I'm a relativelyfast typist and I figured it wouldn't take that long.
So, I went to my university library, found a pre-1923 edition ofG.K. Chesterton's The Ball and the Cross (Chesterton is one of myfavorite writers), and began typing. It took much longer than Iexpected — certainly over 30 hours, perhaps even close to 50. WhenI finished, I came across a page on the PG site that mentioned thereshould be two spaces between sentences. I looked at the etext I'djust typed in and realized in horror that I'd used single spaces thewhole way through. :) [1] I had been *sure* that PG used single spaces,convinced that I'd read it in one of the PG docs, which had taken alittle while to get used to since I normally use two spaces. Butall the PG etexts I checked had two spaces between sentences, so Ibegan the monotonous task of adding an extra space between eachsentence (and being very careful not to add spaces in where theyshouldn't be). Several hours later the book was finally done. I'dgotten copyright clearance before I started, so I soon submitted itand within a few days I saw those lovely words in my inbox, "Posted(#5265, Chesterton)".
[1] Ben was right both times: people have posted advocating both one space and two. Either would have been accepted!—jt
Since then, I've been addicted to producing etexts. Languagesinterest me greatly, so I found an Old Icelandic primer that someonehad scanned in, OCRed the images using DocMorph (it didn't take aslong as I thought it would, and the output was decent enough to workwith), and realized I would have a problem entering in the foreigncharacters (o's with hooks underneath, etc.). Thank heavens forUnicode. Vim (my editor of choice) has fairly good Unicode supportand it didn't take long to make a list of the Unicode codes for theIcelandic characters.
As noted, I use Vim for all my editing. I can rewrap lines to 65characters by typing "gq", I can use regular expressions for searchand replaces (*very* handy), I can edit in Unicode when I need to,and I can speed things up greatly by making keyboard mappings forrepetitive tasks. (On one text I was working on, I had to add ablank line between each paragraph. Each was numbered, but the blanklines had somehow been taken out before I got the text, so I startedgoing through and adding them in by hand. The file was 30,000 lineslong, however, and I quickly realized it would take a *long* time.I then noted which keys I was pressing to add the blank line betweeneach paragraph, mapped them to <F9>, and held the key down while Vimzipped through the rest of the file. It sped it up by a factor ofover a hundred.)
My university library is well-stocked and has lots of old books, soI usually rely on it when I need to get TP&V's for texts I'm nottyping in myself. I still don't have a scanner, so I either findalready-existing texts on the Internet and reformat them for ProjectGutenberg (after getting permission, of course), or find page imageson the net and OCR them myself, or type the books in by hand.Typing in by hand takes a long time and so I prefer the first twomethods.
Volunteering with Project Gutenberg has been extremely satisfying.The people are wonderful to work with, the work is fun, and it feelsvery good to know that one is making a difference in the world.
Col Choat
How I got started
People sometimes ask me how I got started in preparing etexts forProject Gutenberg, and while they probably ARE interested in my storyoften they are really more interested in finding out whether it issomething that they might want to get involved with. Jim Tinsley, acolleague at PG, recently prepared a "questionnaire" as a way ofstimulating existing volunteers to document their PG experiences.Answering the questionnaire seems as good a way as any to answer thequestion, "how did you get started".
I think it was probably from a newspaper or a computer magazine. Ican't really recall, now.
Initially, I visited the site to search for books I was interested in,to see if they had been posted at PG. That was quite a straightforwardprocess. I downloaded a few texts and either read them at my computeror, occasionally, printed them out to read later.
When I became interested in volunteering, I visited the site to getsome information about how to go about it. I found it a bit daunting,really. There was a lot of information but it was difficult for me toget it sorted out in my mind. There were copyright issues, editingrules, and procedures for lodging etexts. There was a question andanswer page and some background and information for those wanting tosubscribe to the PG mailing lists. In the end, I just sent an e-mail toMichael Hart, whose e-mail address was listed on the site, and said"what can I do?" I notice that volunteers still sometimes do that.
I decided to prepare an etext from a book I had in my home library,titled "UNDER THE NORTHERN LIGHTS". It is a series of short storiesabout the Canadian North by Alan Sullivan. I had a small "hand"scanner at home, which I hadn't used much before. I didn't know anybetter, so I would scan in about ten pages and save them as "tif"files. Then I would use the OCR (Optical Character Recognition)software supplied with the scanner to convert the image to text forsubsequent editing. I recently purchased an A4 scanner withstate-of-the-art OCR software and I can't believe how I perseveredwith that hand scanner for so long.
I tried to apply the editing rules outlined on the PG site, thoughthey weren't as prescriptive as I would have liked. I wantedcertainty, as I felt that I didn't know enough to apply own editingrules. I didn't have a good text editor, either, so I probably madethe job more difficult than it needed to be. More about the "tools ofthe trade" later, though.
When I submitted the title pages of the book to PG for copyrightclearance it was rejected because the book was published in 1926. Idon't know what I was thinking about when I chose it. It must havejust LOOKED old enough. I had scanned and proofed about half of it, soI just abandoned it and looked for something else. Interestingly,Australians and residents in other countries with similar copyrightlaws, can now read it as it is in the public domain in Australia andis now on the Project Gutenberg of Australia site. I was able tofinish it and post it at PG, after all.
I think that one of the most valuable things I did was to join thevolunteer discussion group. I found that I didn't need to take part,but could just take note of all the different issues raised by othervolunteers. Some days there was no activity by the group, but then ahot topic would be raised (e.g. whether some books, such as Mein Kampfby Adolf Hitler, should not be accepted by PG, even if eligible) andthere would be plenty of comments. I realised also that I could askfor help on specific questions regarding preparation of texts andreceive prompt informative answers. Once, when I thought that I wassending to ONE of the members of the group an e-mail with a largeattachment, I was quickly made aware that EVERYONE had received it.Some weren't amused, but I am a quick learner—I didn't do it again.
Subscribing to the weekly newsletter is also worthwhile. There is alink on the main page of the PG web site to allow people to subscribeto the mailing list and discussion group. I also found a few peoplewho I began to e-mail privately, outside the discussion group. Thathelped a lot, too. Perhaps there is merit in instigating a mentorscheme, whereby a new volunteer can refer to another more experiencedone for help, guidance and encouragement. I would be interested intaking part in that.
As I mentioned earlier, my first attempt was abortive (initially, atleast). However, as I had realised that there was not much Australiancontent on PG, I decided to go in that direction. Then I found thatthere were many eligible Australian titles already on the internet,mostly in HTML format. These can only be read using a web browser, soI decided that it would be worthwhile to download them, convert themto text files, compare them with a book of the same title which waseligible for PG copyright approval, and then have them posted at PG. Ihad learned my lesson, so from then on I always got the approvalBEFORE I started work on the conversion.
I prepared a number of etexts using this method and quickly increasedthe amount of Australian content at PG. However, I still wanted tocreate an etext from a book. My sister had given me, as a gift,"Australia's Greatest Books" by Geoffrey Dutton, which reviewedapproximately one hundred books and I decided to work my way throughthem. I had already converted a number from HTML, as outlined above,so the first on the list to be scanned turned out to be the journal ofCharles Sturt who explored south-eastern Australia between 1828 and1831. I was quite pleased with myself when the two volumes werefinally posted at PG.
The simple answer is "because it is FUN". It is easy to make upjustifications, but since there is no necessity to do it, it must bebecause I enjoy it. I get a sense of achievement that the work I dowill be "out there" for a long time. We haven't begun to realise wheretechnology will lead us. The books I prepare will be able to be readby people anywhere on earth, and even beyond, by astronauts travellingto Mars. "Send up THE ODYSSEY will you Scottie, I have always meant toread it."
I have had some unexpected pleasures, too. I have "met" somewonderfully generous and interesting people and I have read somewonderful books that I would not have taken the trouble to read if Iweren't preparing them for PG.
I started out thinking that I would stick to books with an Australianflavour. But I can't help myself. If I see something that I aminterested in, and it is already on the internet, but not at PG, Ihave to do it. I have submitted etexts of James Joyce's "Ulysses", andworks by D. H. Lawrence, and Norman Douglas. I also have a long listof books I would like to scan in myself, not all of which are aboutAustralia—one day.
I think I have covered that already. I like the sense of achievement,the fun of reading the book, and the thought that it will be availableto many people who would not otherwise have access to it, possibly ina form which has not yet been invented.
Sometimes the going is not easy. Occasionally I get impatient with thelength of time it is taking and sometimes I get bored with the subjectmatter. I recently purchased a new scanner with excellent OCRsoftware, which converts the page image to text, and that has given mea new lease of life because less proofing is required. I sometimesremind myself that I don't have to do it, then I find that I want toanyway.
Local libraries have a surprising amount of eligible material. Themain difficulty is finding books with a publication date of 1922 orearlier, for PG in the US anyway. I have found a number of "facsimile"editions which are direct reprints of the original, and these areacceptable. I also look around second-hand bookshops. I recently founda battered copy of "A short history of Australia" published in about1910, and bought it for $A1.50. For books eligible for posting at thePG Australian site, cheap paperbacks are readily available. I amworking on one now, and have ripped all the pages out of it to make iteasier to scan. It only cost a few dollars. There are also a number ofsites on the internet which list second-hand books for sale.
This section might as well cover all of the "tools of the trade". Ihave noticed that volunteers have many favourite tools, and from whatI can make out most will do the job. The list below covers what Ihave settled on. I should note that I work in the Windows environment,and tools are readily available for all the things I need to do.
I recently purchased a Canon A4 flatbed scanner without a documentfeeder for under $A200. It has a hinged lid for scanning books andcomes bundled with image enhancing software and OCR software forconverting image to text.
OCR (Optical Character Recognition) Software
'Omnipage Version 9' came bundled with the scanner. I find that Idon't need any of the other software which came with thescanner—Omnipage does it all for me. I can scan, proof, spellcheckand save the output to a text file with very little effort.
I use Editplus which is available as shareware on the internet. Itenables me to read in the file produced by the Omnipage OCR softwareand reformat it to a line length suitable for PG texts (about 70characters). It also allows one to display guide lines vertically onthe page to help with checking for "long" lines. I have loaded JamesJoyce's "Ulysses" into Editplus and it handled it, so I presume thatit will handle files of any size. As with everything one wants to doat PG, there is always someone more than willing to help with problemsencountered, just by posing questions to the volunteer discussion.
FTP (File Transfer Protocol) Software
Some volunteers e-mail their submissions to PG as an attachment to ane-mail. However, it is also possible to place them at the PG site forprocessing, using FTP. Microsoft Windows Explorer has an FTP facilitywhich can handle this and that suits me. I know that there are manyothers and SmartFTP is an excellent freeware product for those whoneed Windows-based FTP software.
Other Tools
I use Microsoft Word to convert HTML files to text files. Firstly, Icut and paste the html document into word, then I convert any italicsto upper case, since italics are not supported in plain text files;then I save the document as a text file. Then I use Editplus,mentioned above, to reformat the line length. Sometimes it isnecessary to add an extra "carriage return" at the end of eachparagraph, to comply with the preferred style for PG texts. This canbe done from within Word or Editplus by replacing characters. Newvolunteers may need to ask for information about this process.
I have tried a few different methods. I don't have a notebook computeror etext reader so I must either read it on a PC or print it out.There is a spellchecker with Editplus, which allows one to add newwords, so I use that to begin with. I also use GUTCHECK, a programdeveloped by Jim Tinsley, which picks up many errors. One would needto contact him via PG, if one wanted a copy. I travel by train towork, so I often make a printout and read that for the final proof, orco-opt my wife if it is something I can interest her in. I have achecklist, which I have developed over time, that I use to ensure thatI have covered all that I need to—but then I AM one for lists.
I think I have covered most of my methods already. I sometimes findthat "dashes" within sentences need attention. I like to show them as"—" so I try to be consistent and not let them slip through as " - ".I think we at PG could get together a more or less prescriptive listof editing rules for new volunteers to follow. Once they gainedexperience they could change them if they wanted to. I do like toplace an end marker ("THE END") at the end of my progressing work, sothat I don't inadvertently lose any of it and I make several rotatingbackups of the file I am working on. I have "lost" computer files onceor twice over the years and don't want to get that sick feeling in mystomach EVER again.
As I said earlier, I do have a checklist, and it could help if PG(that includes me, as PG is "us") provided a downloadable list ofthings which need to be done to get an etext posted e.g. copyrightapproval, scanning, editing, proofing, placing relevant information atthe beginning of the etext, etc. All the information is there already,it just needs bringing together into one document.
Obviously it depends on the number of pages, efficiency of the scannerand the number of hours one puts in. The two volumes of Sturtmentioned above probably took me six months, but I was doing manyother things in the meantime. To scan in and edit, say, "The Prophet"by Kahlil Gibran would only take a fraction of that time as it isquite thin and easy to read. If one were concerned about getting anidea of the time it would take to complete an etext, I would suggestthat he/she do a little casual proofing at the "DistributedProofreaders" site first, to get an idea of what is involved.
I generally work alone, however my wife will proof sometimes. She hasbecome interested in the book that I am working on at present and iswaiting for me to supply her with more pages. When I was gettingstarted, a new volunteer agreed to proof something for me (sheapproached me) but then she never did any of it and didn't even e-mailme to advise that she had changed her mind. Editing and proofing isnot for everybody and one needs to find out if one likes doing it.However, courtesy costs nothing.
All of the above at different times. I am not an avid televisionwatcher and would rather do some "work" (or should I say "pleasure")for PG much of the time.
Because I have converted many books from work already on the internet,I have covered quite a range, though I haven't actually scanned andproofed too many books. Those that I have done have been Australianhistorical works. But I have rounded up books on philosophy,aboriginal legends, and several novels. Since many internet sites comeand go, I am interested in "grabbing" etexts and posting them at PG incase the site disappears from the internet. It has become a pastime initself. I recently discovered "South Wind" by Norman Douglas, a bookwhich caused quite a sensation when it was first published because itportrayed a bohemian lifestyle. Ironically, I used to have the book inmy home library, but dispensed with it when I needed space. Now it isat PG and I can get it whenever I want it.
The democratic, helpful, friendly approach of all the people involvedis one of the things I like best. I have "met" so many wonderfulpeople, without having to "live" with them, if you know what I mean.Not long after I started associating with PG, Michael Hart posted ane-mail to the volunteer discussion group, advising of the death of along-time volunteer. It seemed like she had been one of the "family".
One really needs to be indifferent to praise and the prospect ofreward to start volunteering for PG. There is certainly no money init. However, one quickly finds that there is a community of people outthere with a common interest, and with the same outlook and the sameinterest in doing a job well, without tangible reward. There is nolack of praise though, and one soon finds that one is not indifferentto it.
There isn't much that I don't like. Nothing worth mentioning, anyway.
There are a few things, however since I don't know all the reasons forsome things being done the way they are, and because everything isdone by volunteers anyway, I wouldn't like to canvass them here. Tohave produced nearly 5,000 etexts over more than 30 years is testamentto the fact that most things are being done "right".
I would spend some time with him/her and work through some of theissues. I know that I would have benefited from that approach. I wouldgradually introduce her(him) to the different issues which need to beaddressed and find out exactly what her expectations were, and try tohelp her in fulfilling them.
Much the same as it is now, I hope. After all, the goal will continueto be to provide "fine literature digitally re-published". Though Iexpect that, like other organisations, it will continue to evolve inresponse to new challenges and opportunities. Ten years ago, who wouldhave thought that there would be 5,000 etexts posted; that there wouldbe volunteers operating an online proofreading site; and that therewould be a volunteer writing free software to read PG etexts? Therapid growth of PG over the last few years will present manychallenges for the future.
Writing of etext readers, I am reminded that I recently joked to avolunteer that I wanted him to write software for reading etexts,whereby a hologram would appear on the inside of my eyelids so that Icould read etexts with my eyes closed. Who knows, it might bepossible. However, whatever advances in technology occur over the nextten years, one thing is certain: the work of all the volunteers todate will ensure that there is an amazing library of ebooks availablecovering creative works by some of the greatest minds who have everlived. Future readers of PG ebooks will have been given a wonderfulgift by the many volunteers who have contributed to PG over thedecades.
Project Gutenberg of Australia
On the wall in a colleague's office was pinned a piece of paper onwhich was written a quotation. I don't recall now what it was and thecolleague has been gone for some time and has taken the paper withhim. However under the quotation the author was acknowledged as"Prince Machiavelli". I had a vague idea that the quote actually camefrom "The Prince" by Nicolo Machiavelli, and wondered how I couldsatisfy my curiosity. Then I remembered reading about ProjectGutenberg and decided to see if the book was posted on the PG site,though I didn't really expect that it would be. Needless to say, theetext WAS there and I was able to download it and read it in itsentirety, due to the time spent by John Bickers and Bonnie Sala (theirnames appear at the beginning of the etext) in preparing it for PG.Interestingly, there were other works by Machiavelli there, which Ihope to get back to one day.
Later, when I e-mailed PG and expressed an interest in volunteering Iwas, because I said that I was Australian, referred to Sue Asscher,the Australian Production Director for PG. Sue asked me to proofread"A Vindication of the Rights of Women" by Mary Wollstonecraft. Also,about this time, a journalist had contacted Sue with regard to a storybeing prepared for PG. He wanted to contact some volunteers to ask whythey were interested in PG. Sue referred the journalist to me, with mypermission of course, and one of his first questions was "Is theremuch Australian content on PG?" After I had checked the PG etext listI could only reply "not much".
So I decided to start creating etexts by Australian authors, for PG.Sue Asscher pointed out that there were many eligible Australian worksalready in the public domain as etexts, so I started rounding upetexts and matching them with books which had been published before1923, so that they could be posted at PG. Then I started creatingetexts myself, for works I could not find already on the internet. Mysister had given me, many years ago, a book by Geoffrey Dutton titled"Australia's Greatest Books", so I decided to start working my waythrough the eligible titles from the list of about one hundred booksreviewed by Dutton. I had already found a number of them on theinternet and some were already at PG. But there were still a "few" tobe done. There still ARE a few to be done, if anyone is interested inhelping.
Then Sue Asscher again had a hand in setting the direction I wouldtake by asking me to proof an etext of "Animal Farm" by George Orwell,whose work had recently entered the public domain in Australia. Wedidn't know where we would post it, as it is not in the public domainin the US, but I agreed to proof it as I had read it many years agoand enjoyed it.
About this time, I also decided to make up a personal web site. Beinga software developer, people were always asking me about the internetand web sites, in the mistaken belief that I knew ALL about computers.I decided to get an idea of how web page design and web sitemanagement worked by creating a site that listed all of the"Australian" content at PG. When I couldn't find anywhere to put theOrwell, which I had recently proofed, I decided to create a page on mysite for etexts in the public domain in Australia, so that Australiansand internet users in other countries with similar copyright laws,could read and/or download them.
Michael Hart, the founder of PG, was quick to interest me in creatingan "official" PG site in Australia. After registering a businessname, getting a domain name and finding a sponsor to host the site,Project Gutenberg of Australia was up and running.
It all happened very quickly, and as with many things which happen inone's life, it all seems to have come about by serendipity. Even thesite's motto "A treasure-trove of literature" was stumbled upon bychance when I looked up, in connection with another unrelated matter,the word "treasure-trove" in a dictionary, to ascertain if the wordwas hyphenated. Imagine my surprise to find treasure-trove defined as"treasure found hidden with no evidence of ownership". That EXACTLYdefined the literature found on PG.
My own association with PG resulted from the culmination of alife-long interest in books and literature and an equally stronginterest in computers. Every volunteer brings his/her own particularinterests and skills to PG and that, together with the democraticapproach taken by the small executive team, is what makes PG thestrong, co-operative organisation that it is. My interests and skills,and a generous dose of serendipity, led to the creation of ProjectGutenberg of Australia.
I discovered Project Gutenberg in 1996 and immediately wanted to helpbecause I love books and wanted everyone to have access to all thewonderful books that, even today with Internet searching, aredifficult to find or very expensive when you do locate them.
I began by proofing a few works but what I really wanted to do wasshare my Balzac collection with other fans. I discovered Balzac in the1970s and recall my frustrations in trying to find more than a dozenstories of the over one hundred Balzac wrote. It was over a decadebefore my husband discovered a complete set at a used bookstore whileon vacation. Unfortunately, not everyone is so lucky.
With the first few stories I typed for Project Gutenberg I worriedabout everything: should I correct a type-setting error, leave it,footnote it, etc. This took a long time and involved a lot ofcorrespondence. Now, my idea is to make the text as readable aspossible. For me that means correcting type-setting errors I notice.Others prefer to leave them intact. In the end, I don't believe thereaders care. I have found them generally to be very grateful to havefound some treasure they had been seeking. In some cases of anauthor's more obscure works, they didn't even know the book existed,a rare find indeed for them.
It is so satisfying to receive an e-mail from someone thanking you forall your hard work. Most readers don't take the time to write but truefans often do and they make it all worthwhile. I have even met peoplein this way that went on to become a Project Gutenberg volunteerthemselves because they wanted to give something back to the Projectfrom which they had received so many pleasurable hours.
Gardner Buchanan
First of all, there is the issue of what texts I choose to do. For me,this is fairly simple. I'm a bit of a small-time book collectoralready, and have a personal theme: "Canadian English Literature" and"Canadian English-Language History". I have no trouble whatsoever incoming up with submissible editions of works that fit this themesomehow. Nevertheless there are specific authors and works that I'mnot having luck with, so I'm still making the rounds of the used bookshops regularly and picking up all sorts of stuff.
Eligible volumes have typically cost me $10.00-$150.00 for acollectable edition, or $0.50-$15.00 for a recent paperback edition orgarage-sale item. I paid $0.50 for a eligible, but not verycollectible copy of Glengary School Days by Ralph Connor at a garagesale. As it turns out someone has beaten me to it—it has been in thecollection since 2001. Sometimes if I'm contemplating picking up amore expensive book that I don't already have a personal interest in,I'll go back and double-check The Online Books page to see if someonehas already submitted the book.
Another way I obtain texts is from the Early Canadiana Online archive.They host page images of quite a large collection of old books writtenin or about Canada, or written by Canadians. The page images arereasonably well suited to OCR.
I tend to produce E-texts two different ways. One way is to submitpage images to Charles Franks who runs Distributed Proofers and lethim worry about bulk-OCR'ing. I then manage the distributed proofing,which is a fairly low-effort business. The other way is to scan, OCRand proof all by myself. I'm currently averaging two of my ownprojects to every Distributed Proofer one.
I have an very slow parallel-port scanner, a UMAX Astra 2000P. Itsucks mightily. I'd rate it a 2 out of 5, if it wasn't actingup—creating a black bar across the page, part way along—so I have toscan books a certain way around to avoid having the bar land in thetext. As it sits now, it's in 0.5-1 territory. It is glacially slow atthe best of times, and due to being a parallel port model, locks up mywhole computer during the scan.
Nevertheless, it is completely adequate to my needs for PG work. I'vescanned more than a dozen books on it, and it's done yeomanservice—despite its warts. Scanners like this one can be picked upused for $30, and are worth the money.
The way I work when I'm producing a book myself, is scanning andproofing page by page. I do the scans two-pages-up, then OCR, proofand copy the pages to a working document, before going on to scan thenext pair of pages.
My scanner came with two OCR "packages": Omnipage something-or-otherwhich I was never able to install, and Recognita Standard 3.2.7. I useRecognita, and for 300dpi scans I do, it is adequately fast andaccurate. It is a no-frills package, and DOES make many mistakes, butit is entirely useable for my purposes. I rate it 2 of 5.
I've used the Abbyy FineReader 5.0 try & buy. This is a magnificentOCR system. It handles huge batches and is fast and astoundinglyaccurate. I rate it 5 out of 5. Unfortunately it costs about $millionto patriate a web-bought item into Canada, and while priced at a veryreasonable US$100.00, would cost me about CAN$600 after exchange-rate,brokerage fees, shipping, more fees, taxes,service charges and more taxes (on the fees).
I could buy Omnipage off-the-shelf here, but frankly if I can't get
Abbyy, I'll stick with Recognita.
As I scan each page, I paste it into Windows-95 Wordpad. Sometimes Ialso do some proofing in Wordpad, but mainly I proof, fix quotes,M-dashes and paragraph breaks in the OCR program before copying toWordpad. I like to keep the page boundaries intact, and I mark them inmy Wordpad document like this:
kjdk ldjd ll;llkj dklj dklj
kjdk ljd llllkj klj dklj
page 354
kjdk ldjd lll;;llkj dklj dkljkjdk ldd lll;;llkj dklj dkljkjdk ldjd ll;llkj dklj dkljkjdk ljd llllkj klj dklj
page 355
kjdk ldd lll;;llkj dklj dkljkjdk ldjd ll;llkj dklj dkljkjdk ldd lll;;llkj dklj dkljkjdk ljd llllkj klj dklj : :
At this point I also fix-up hyphenated words that straddlepage-boundaries. I note paragraphs that start in a new page and markthem with <p>, and I note indented or block-quoted sections and markthese with <in>..</in>. This helps when I go back to format it since Ican easily see where the special cases are.
Wordpad handles large documents reasonably well and will grok UNIXfiles (ie: <LF> only, not <CR><LF>). For this it rates 3.
When the whole text is assembled, whether by myself or by DistributedProofers, I use about the same process for formatting and finalproofing.
I use MS-Word 95 to do a spellcheck. This I rate 3 out of 5. I do aselect-all, and language appropriately - for me, usually UK ratherthan American English. I wish I had a Canadian English dictionary forWord 95, but have not needed one badly enough to actually look. Wordhas a pretty good spell checker and the custom dictionaries are easyto muck around with. I use a custom dictionary for any big project - Ihave one for Chronicles of Canada, and different one for all the JohnRichardson books I've done.
At this point in my personal process, I abandon Windows and go over to
I use vi (rated 9 out of 5) to do a number of hacks. I search for andfix up hyphenations that were broken (peer- less) and such like. Ialso search for and fix some OCR special case errors like 'you'->'yon'and 'be'->'he'. This latter sometimes requires a while, just to stepthrough all the be and he's to see if they're right.
Still in vi, I next use some incantations to run the UNIX 'fmt'command on each paragraph to get it reformatted. I use:
fmt -55 60
Fmt gets a 3 out-of 5 for what I need it for. It double spaces aftersentences, which—although it is probably the right thing to do—isnot the PG convention (for me at least). It also adds a space whenjoining lines with an M-dash. I go back and fix both of these usingvi. I take into account the <in></in> tags and manually formataccordingly at this point.
As I reformat, I give the text it's final proofing. I'll have theoriginal text in-hand at this point, and will use the page markers(remember them) to figure out where I am. As I reformat, I delete thepage markers and other markup. When I'm finished this step, the bookis almost done.
Next, I use Gutcheck 0.2 (5 of 5, for intended purpose - way to goJim!) to check for all the things it checks for. At this point Iusually get something like 50 hits, of which 30 are real. I'm thenback in vi, and fix up all those problems. Finally, I'm done.
As I go along, I tend to keep various versions of the document. I'm atversion 27 of 'The Imperialist' right now. Each scanning editing,spell checking or whatever type of session gets a new version:imperialist_12.txt, imperialist_13.txt,… At various times I mightfind it useful to use 'wc', 'grep' and 'diff' to figure out what isgoing on, where a word appears or whether I deleted something I didn'tmean to.
I mentioned above that I sometimes work from page images that I obtainfrom the web. There are several archives around that hold eligiblematerials as page images that you can easily download and OCR. Ipersonally have worked mainly with the Early Canadiana Online archive.
After a bit of poking around with the web interface to thiscollection, I have been able to work out how the individual pages arenumbered and organized. I have written some shell scripts that I canuse to fetch all the pages of a volume and convert them from GIF toTIFF format. Harvesting a 200 page book takes a few hours.
Once I have all the pages, I have to do some work with an image editorto get them ready for OCR. I use Corel PhotoPaint 7 to crop each imageto just the text area and to remove the black bands at the sides dueto the spine or whatever. The page images are often made frommicrofiche, and dust marks are common as well. These I can sometimesedit out with PhotoPaint.
Because some of the page images, or certain sections thereof, can becompletely unreadable, I often find myself either tracking down amodern edition or visiting a local university library to find a copyof the book to look up a few paragraphs or passages that are notreadable in the images. Even having to do this, I find that thecapture of images from the archive is still a big time saver, andallows me access to an edition that would otherwise be totallyinaccessible.
Having gathered the images and prepared them for OCR, I next submitthem to Charles at Distributed Proofers, or handle them myself, usingthe same process as if I were scanning them.
I've done several books using Charles Franks' most excellentDistributed Proofers web application. I tend to choose DP when I don'thave the personal time to read and proof a volume myself, or when thepoor quality of the text defies the ability of my (not very good) OCRpackage.
When scanning for DP, I still scan images two-up. I then have acollection of shell scripts that cut the page images in half toproduce single-page TIFF files. I then use a manual procedure withCorel PhotoPaint 7 - if required - to fix up skewed pages or ones withblack margins. For the most part, page images that I scan myself areregistered exactly enough in my scan area that the page images don'tneed to be edited.
Page images that I've harvested from a web archive do have to be fixedup before they can be used by DP.
Charles, I believe, prefers that as a project manager I would dealwith my own OCR. He has, however, been kind enough to run severalbatches of page images through his OCR setup for me to good effect. Ibelieve he uses Abbyy Finereader, and my procedure for submittingpages to Charles is to run a subset of the pages I intent to send himthrough a demo copy of Finereader to make sure that the results arevaguely acceptable. If everything looks good, off it goes.
When the project has run its course with DP, I download the completedtext and proceed to format and re-proof it, for the most part, as ifI'd scanned and OCR'd it myself.
Jim Tinsley
How I (eventually) got started.
Five years ago, I was the most clueless newbie ever to tryvolunteering for PG. If you're feeling lost about how to help PG, youcan be sure that you're not alone! And if I can write PG's firstcomplete FAQ after my bad start, you can surely do better! :-)
Back in 1997, the web site existed, but there were no FAQs, noVolunteers' Board, no gutvol-d, no Distributed Proofing sites. Istarted by making a donation and e-mailing Michael, suggesting that Icould help out with small jobs, or programming. I didn't get any, andI had no idea what, if anything, I could usefully do by myself.
I looked up the in-progress list at the time, and e-mailed a fewpeople who were listed as working on books, offering to help. None ofthem were still working on the books. (We no longer show people'se-mail addresses on the InProg list.) I still had no idea how to geteligible books, no scanner, and no idea how to approach producing anetext.
I subscribed to the monthly Newsletter, and just read it for a year.
In a "Project Gutenberg Needs YOU" edition, Dianne Bean, the U.S.
Director of Production at the time, was given as a contact. I
e-mailed her, and finally things started happening.
She sent me a short piece to second-proof, and explained that I shouldjust fix whatever needed fixing. I returned it, and she introduced meto Bill Brewer, who was, at the time, scanning Wisters like they weregoing out of style. He and I formed a scanning/proofing team for awhile.
How I began producing, and my problems with scanning and OCR.
I had some ideas for books I wanted to produce, but I couldn't findthem locally, so I turned to the Internet, and discovered how easy itis to find and buy used books on-line.
I bought a HP flatbed scanner. It came with freebie OCR software—
"PrecisionScan"—with images and OCR all in the same interface.
I scanned my first book, which fortunately had large, clear text, andthe OCR made a reasonable job of it, according to my standards at thetime, which were that getting any text at all without typing was aform of magic :-)
I now know that I could have made a better job of it if I had pressedthe spine down hard, either closed the top to keep out ambient lightor darkened the room, and made each scan a bit more exact. I'm muchbetter at flatbed scanning now.
My PrecisionScan software did recognize two facing pages, and dealtwith them correctly, though IIRC it put some garbage charactersbetween the pages that I had to remove by hand.
It did require a lot of editing, though, and recently I've gone backover my original text and found lots of mistakes. Partly because ofthe scan, partly because of my inexperience.
Throughout the editing, I kept having to make formatting decisions ina vacuum, reinventing wheels and applying rules from a HowTo. Now,having read and formatted and proofed and produced so many texts, Ijust know how to format a text without thinking, and just reading oreven skimming a few texts before producing my own would have given mea lot of background and saved a lot of time. I had proofed severalbooks, but never thought to look closely at formatting decisions.
That text took me a month of working most evenings, and a lot ofsticktoitiveness. I can really appreciate the effort that a volunteerhas to put in to produce their first text by casting my mind back tothat month. I think it's the not-quite-knowing-what-you're-doingthat's the worst part. I remember being soooo relieved when I sent itoff for second proofing.
The guy who took it for second proofing didn't get back to me for amonth, and then said that he wasn't going to do it. This wasdisappointing. I sent it to another guy for proofing. He came backafter a few weeks asking some questions. I answered them. After a fewmore weeks, I followed up with another e-mail. No answer. A few weeksafter that, I gave up, and just submitted the file for posting.
The next book I produced didn't have such nice, clear, large type, andthe scan was what I would today call abysmal. I'd guess that I retypeda quarter of the book. The less said about that one, the better.
My third book just would not OCR sensibly. The print was very small
and faint, and the OCR produced gibberish. Even with my low standards,
I couldn't kid myself that this was working. I tried 400dpi, 600dpi.
No dice. I might get 10 complete words on a page.
It was at this point that I bought TextBridge. I really had no ideaabout the difference between the freebie OCR programs they give awaywith scanners and a genuine commercial product, but I was trying indesperation to get something different that would read this image.
Textbridge was an eye-opener for me. It still didn't make a good jobof the bad images, but it made a decent shot at maybe half of them,and having bought it, I tried it on the two books I had worked so hardat before—it gave hugely improved results. The book that had onlybeen about 75% OCRed became 100%, but with some errors. I cursed thetime I had wasted making up for the deficiencies of my freebiepackage.
Since then, I've kept upgrading my TextBridge (I think I started onversion 8, now on Millennium) and bought OmniPage and Abbyy as well. Imostly use Abbyy 6 now.
Last time I looked, there were downloadable trials of Abbyy,
TextBridge, and OmniPage. Big downloads though.
Last year, I got a new Epson Perfection 1640 scanner to replace my oldHP Scanjet. I never had any complaint about the Scanjet itself—itserved me well—but the new Epson is faster, has higher resolution,and ADF.
Even better, I now know how to scan. I know how to process 200+ pagesan hour while scanning the book flat, two pages at a time. I know howto adjust the settings to scan only the area covered by the book. Itry different settings for each new book to see what works.
So much for scanning and OCR. I was a very slow learner in thisarea.
How I prepare a text now.
I was never quite so bad on the proofing end of things. As an editor,I use Brief in DOS and Crisp (a Brief clone) on Windows. (I mostly usevi on *nix, but I do very little-to-no PG work on *nix apart from anoccasional scripting thing that I can do in one line of Perl, butwould be annoying on MS).
Now, I'm all for tolerance and equality and respect for the faiths ofother people, :-) but I gotta say that for someone who has used apowerful editor, editing with Word or any standard Windows editor islike scratching your nose with a rake.
When I first get the text off the OCR, I have many pages with breaksbetween them, and usually no line-spacing between paragraphs, but eachparagraph indented.
I whip out Crisp, and run a macro to search and destroy allpage-breaks and page-numbers and blank lines between, and then anotherto put line breaks between paragraphs and unindent them. Since I watchthis process carefully to avoid messing up quotations, it takes memaybe 15 minutes.
Now I have a basically formatted text. The line-lengths are usuallytoo short, and there are hyphenated words at line-ends that I willneed to rejoin, and some that I need not to rejoin. Another macrofixes up the hyphenation. At each hyphen, I just decide whether torejoin or not. Say 20 minutes, max. Then I rewrap. Another 15 minutes.
So in maybe an hour I have a proofable text, and the really nice partabout it is that I've had a flying tour of the text three times, soI've already noticed any peculiarities.
If I've noticed any unusual features like letters or poems that needspecial treatment, I do it at this point.
To prepare the text for proofing, I just flick through it in Crispwith spellquery on, in US or UK English as needed. This puts a redline under queried words, just as Word does. I spend maybe 5 or 10seconds per 50-line screenful. I don't expect to catch them all; thisis just a quick pass to thin 'em out. I may also catch some formattingissues, but I'm not looking for them.
Now I proofread.
I've tried lots of ways of proofreading. Often it's just sitting atthe screen. Sometimes I print out the texts or parts of it, and markerrata with a pen. Occasionally, I get the computer to read the textto me, and I follow along in the book, noting any errors. (This isgood when you want very high accuracy - do a replace of ":" with"colon", "," with "comma" and so forth before you start the reader.)Recently, I've tried reading the text on a PDA, and bookmarking theproblems.
Whatever way I do it, it takes time. I'm better at it now than I was,but I still tend to miss things like he/be.
Some people swear by particular fonts for proofreading, saying thatfont X shows "1"/"l" differences more clearly than font Y. I just useArial or Verdana for printouts and Courier or Fixedsys on screen; thespecial fonts don't seem to make a difference to me.
So I've finished proofing and made my corrections. Now I leave it sitfor a few days. I need to get my mind off it, so that I won't miss thesame errors I missed before.
When I come back to it, I'm looking at what software people would calla Release Candidate, and something changes in my head . . . I'mthinking of it in a different mode, not as a work-in-progress, but asa potential finished project. This makes me much more critical, andless willing to accept mistakes.
Usually there are dash-problems to fix up (emdashes as " - " insteadof "—") and other minor stuff like that. I do global searches for" -" and "- " and "…".
I do a quick skim though it, sampling paragraphs here and there as atest of its quality. I make any formatting adjustments like chapterline spacing or indenting letters that I might notice.
Then I run gutcheck. Gutcheck is a little program I wrote / write /will-write over the years that complains about common problems in a PGtext . . . bad line-lengths, common typos, numbers within words (likethe "1" in "wor1d") unbalanced quotations, spaced or unspacedpunctuation, non-ASCII characters. I fix the problems that Gutcheckpoints out.
Again, I switch spellquery on in Crisp, and skim through, more slowlythan the first time. This time, I'm looking for anything thatshouldn't be in a PG text.
I run gutcheck again, just to be sure.
And off it goes!
The Posting Team
For a couple of years, I churned out a text regularly every two months,spending about 40 hours on each, and took on some occasional proofing,but after I became moderator of the Volunteers' Board, people startedreferring texts to me for checking or reformatting. This took up moreand more of my available PG time, and my own production slowedaccordingly.
It was in response to these requests that I wrote gutcheck, whichembodies all the standard non-spelling checks I would run on a file.Gutcheck allowed me to spend less time on each text, but still feelreasonably sure that there was nothing glaringly wrong with it.
When Michael formed the Posting Team last year, I volunteered, and itwas a natural progression for me, since I was already used to doing alot of last-minute work on texts.
I found posting to be disorienting and confusing at first; peoplebombard you with half-scraps of information about books to be posted;some texts need serious work; some texts haven't been cleared, andneed to be referred back; some people want special treatment fortheir texts, which may conflict either with my views or with PGprecedents, or both; there are lots of questions. But like everyother new job, it just takes time to learn the ropes.
The actual process of posting now takes very little time: I can gothrough the necessary steps in 3-5 minutes. But posters are the lastline of defense against errors, and even the most careful volunteersmake them (and yes, we do too!). It takes a minimum of 15 minutes torun standard checks on a perfectly clean file, and it can take severalhours to fix up a file that needs help. On average, it takes me aboutan hour to do my reasonable best for every text submitted.
Apart from posting proper, there are a lot of queries to be answered,many of which I hope I've dealt with in this FAQ, "special cases"that eat as much time as I'm willing to give them, corrections to bemade to existing texts, and interminable debates about whether PGshould do this or that.
Now that the learning curve is past, the problem with posting isthat it generates a lot of e-mail and discussion, and eats a lotof time, and is a 7-day-a-week commitment. Having posted over athousand texts, I'm now particularly interested in ways to improvetext quality.
John Mamoun
How to create an e-text efficiently or automatically is an interestinglogistical problem. Here is my procedure, which I recently used tomake an e-text in about a week, with maybe 6 man-hours of work on mypart:
I take the book, and use an x-acto blade to cut out all of the pages.I then feed the pages into an HP 4C scanner with an automatic documentfeeder accessory attachment that I got from e-bay for $200. I feed itup to 50 pages at a time, and it automatically scans them in.
I work the scanner using software called scan2000, (30-day shareware trial period, $50 to register).This program automatically works with the scanner to save each imageas a CCITT4 standard format TIFF file. Most importantly, itautomatically numbers each page, starting with an initial value youspecify (typically 001.tif) and increasing the number of the file nameby an increment you specify (typically by 2 pages, since you scandouble sided pages; you scan the evens first, then flip the pages overand scan the odds, but you want the page numbers in order, right?). Sothe scanner outputs, say, 001.tif, 003.tif, 004.tif, etc., then youflip the pages over and re-feed them into the scanner; the even pagesare saved as 002.tif, 004.tif, etc., after you tell the program tobegin the first of the even page files with 002.tif.
So now I have a bunch of consecutively numbered CCITT4 TIFF files. Atthis point, I could use a freeware program called cc42 (search for itat to combine all of the sequentially numbered CCITT4TIF files into a single PDF file with the pages in order.
Or, if making e-texts, not PDF files, I OCR the pages and save them ascorresponding pages like 001.txt, 002.txt, etc. I also use Paint ShopPro (shareware 30 day trial) to batch-convert the tiff files into GIFfile format. I can then upload the GIF files and the correspondinglynumbered text files to the Distributed Proofreaders page( to have them rapidly proofread bynumerous proofreaders, who finish the task at a rate of 50-100 pages aday per book, very roughly speaking. When done, I then download thetext files as a single text file combining all of the files. Theupload function on the DP site is tedious, requiring one to uploadeach file one-by-one, but I spoke to the webmaster recently, and hesaid there are, with special arrangements, ways to FTP them or evene-mail them to him on CD.
Now, hard returns. It was once a grave problem to fix hard returns sothat the text outputted to 65 characters per line. Then I got afreeware program called Clipcase at With Clipcase,you select a body of text (about 20 pages or so; any more, and theprogram crashes) in your word processor, copy the text to theclipboard, then load up Clipcase, paste the text into the Clipcasewindow, the process the text.
When this happens, all of the hard carriage returns within the textare eliminated, EXCEPT for returns between paragraphs. Then, youselect the text, copy it, and paste it into any word processor toprocess it. I use Microsoft Word. After pasting all of the text intoit, I select all of the text, choose Courier New font, 10 point size,and set the margins at 5.5 inches. With this setup, when the text issaved as "Text with layout," the resultant text is 65 characters perline, every line. Setting hard returns is automatic.
Then I spell-check the text, and also skim through it to look fortypos and "categories" of errors to tend to occur repeatedly withinthe text. One common error is having a single dash instead of twodashes, for example:
He opposed to: He lingered—slowly.
Another common error is a space between a period, exclamation mark orother punctuation mark, and the letter that came before it, such as:
Hey !instead of Hey!
or " Hey, "instead of "Hey,"
I then use the "Find/Replace" command within Microsoft Word toefficiently get rid of these. For example, I might tell it to look for^w", where ^w means "a white space" and " is a quote. This looks forwhite spaces before quotes. "^w looks for white spaces after quotes.^w! means a white space before an exclamation mark. I can also have itlook for "any letter"-"any letter," so that it finds single dashesbetween letters, and then I can decide if I want to replace these withdouble dashes. By using these kinds of find/replace tricks, it becomeseasier to remove typos.
When done, I save as "text with line breaks" and it is done.
That's basically my procedure. 1 week turnaround time and 6 man-hourson my part for a 190k text file…
Ken Reeder
The Story of My Life (as pertains to PG) by Ken Reeder
June, 2002
I am currently finishing up my fourth etext, with two more etexts inprocess, another seven books sitting on the shelf waiting, and a lotof additional books that I would like to do when those are done.
Sixteen months ago I was blissfully unaware of PG and of the world ofonline books. A couple of things seemed to come together to lead to myinvolvement with PG. I spent some time helping one of my sons, for aschool project, in an unsuccessful search for an online Englishtranslation of Pliny's Historia Naturalis. About a year before that Ihad been tinkering, for no particular reason, with trying to type oneof my favorite older sci-fi books into a text file. And I had beenthinking, occasionally over the course of a few years, about a seriesof books to which I was avidly devoted when I was about twelve orfourteen years old, which was widely available then but is relativelyscarce now. It was a web search on the name of that author, JosephAltsheler, which happened to lead me to some couple-year-old messageson the PG volunteers' bulletin board.
I poked around the PG web site a little and thought, hey, I think Icould be interested in this. Only a few months before I had, for noparticular reason, picked up a clearance-model parallel flatbedscanner (for which I paid $36, including shipping). The scannerpackage included some OCR software, so I already had the basics neededto scan a book to produce an etext.
So I rummaged around on the PG web site a good bit more, and lurked onthe volunteers' board, and figured out that I could find the booksthat I wanted on Ebay or ABEbooks, and bought a couple of books for$10 or $15 each. I scanned a chapter or two and tried out the OCR,which worked very well. (The OCR software that came with my scanner isTextBridge Pro, which it turns out is one of the more highly-regardedOCR packages, so I was just lucky in that respect because I had noclue. I could see that the OCR software was clearly much better thansome DOS software that I had used at work about 15 years ago.)
What appealed to me was that, firstly, it seemed like this was aworthwhile thing to do, with a big plus being that you can do the workfrom your own home, in your pajamas if you want, in whatever time youcan spare. And I thought that, being a detail-orientedsoftware-developer geek kind of guy, that I would kind of enjoy it andalso be pretty good at it - actually, I've always had an aptitude forproof-reading.
So I went ahead and mailed in a couple TP&V for copyright clearance,and set out to actually produce my first etext, a 348-page book whichI completed in about 10 weeks, start to finish.
For a book with nice clear, good-sized print, I figure that itaverages out to about 7 or 8 minutes per page to go through mycomplete production process. Some of the books that I am working on,with smaller or less-perfect print (and/or other complications) take alittle (or a lot) longer.
I feel that I've got my process pretty well set by now. I've puttogether several little home-made utility programs, written in FoxPro,which assist me. (I've put in some effort to try to adapt some ofthese for possible use by others, but the problems are that it takes alot more work to polish software to the point that I feel comfortableletting somebody else pound on it, and the scope of what I think thesoftware ought to do gets bigger every time I work on it, and it's notnearly as enjoyable - for somebody who develops software at work everyday - as producing etexts.)
My complete production process, with rough time breakdown, is asfollows:
1. Scan the book, 2 pages at a time, about 1 minute per scan (30
seconds per page). (I do not cut the pages out of the book, I
just lay it flat on the scanner and press down on the spine.)
2. Run the BMP file through TextBridge Pro, about 30 seconds per
page. (Again, when working with clear, good-sized print.) I
save the output as text with no line breaks.
3. Run a little FoxPro utility that I wrote that massages and
formats the file a little bit.
4. Do my first-pass proof-read, about 2 minutes per page, combining
the pages into chapters.
5. Run another little FoxPro utility, which checks for some things
that I might have missed during proof-reading.
6. Use MS Word to perform a spelling and grammar check, another 30
to 60 seconds per page.
7. Run another little FoxPro utility (number 3), which inserts line breaks, then run another one (number 4) which does some more exception-checking.
8. Do my second-pass proof-read, about 2 minutes per page.
9. Combine the chapters into one big file. Run a couple more little FoxPro utilities (numbers 5 and 6) which do some final formatting, checking and analysis.
10. Send the file to Jim Tinsley, who will graciously run it through his GUTCHECK program which scans for a lot of common errors.
11. Call it an etext and send it in for posting.
My primary goal is to produce a quality etext - I don't particularlycare about trying to speed things up. I mean, I don't want toneedlessly waste a lot of time, but I look at this as a hobby and Ienjoy working on it, so I don't get out my stop watch to see if I canget 20 pages done faster today than yesterday. (When I go out running,then I'm concerned about whether I'm faster today than yesterday.) Igenerally put in maybe 5 hours a week on PG - actually, it's ofteneasier for me to fit in some PG work on weekday evenings than on theweekend. And it is definitely gratifying when the etext is done andnot only does it get posted on PG, but then links and copies pop up indifferent places like the "Online Books Page", and, and
I have not encountered any real stumbling blocks so far. There were afew things that took some time to figure out. For example, when myfirst etext was ready, I was pretty sure that it was expected that Iwould put the PG header on myself, but I looked all over the web siteand could not find a "master" copy. (Actually, I think the master,such as it was/is, is available on Lyris, but I was not subscribing toLyris then.) So I just pulled the header from a very-recently postedetext, but then after I sent the etext in it was posted with adifferent header anyway. (Nowadays, my understanding is that the PG"staff" prefers to put the header on.) I also spent some timeresearching 8-bit code pages, but I expect that the new big-FAQ willprovide easy access to all the answers that I had to hunt down then.There's a lot of good information buried in past messages on thevolunteers' board, but no good way to search out information on aparticular topic.
So far I've been able to fill all my book needs without spending muchmoney. I find my books through ABEbooks, or from Ebay, plus I'vegotten a few at Ohio Book Store downtown on Main Street. I've rarelypaid as much as $20 for a book, even including shipping. There's onebook that I've purchased (but not yet started work on) which costs$1000 or more for the original edition, but which is also available inpaperback reprints for about $10. There are some other books in myfuture plans which look like they will be more expensive, but we'llworry about that when the time comes.
My wife still cannot understand why I spend my time scanning books,whereas my kids (and, I guess, most other people I know) seem to thinkit's a little eccentric but basically acceptable behavior. Personally,I definitely enjoy producing etexts and hope to keep doing so for along time. My thanks to Michael Hart, Jim Tinsley, Greg Newby, anduntold others who devote so much effort to nurture the project andgrease the skids for the rest of us. Long live Project Gutenberg.
Lynn Hill
I have been involved with PG since 1994, when I first began readingtexts on-line during slow times at the office where I worked. (I oncegot into trouble with a co-worker when she found me "processing"Little Women instead of the week's payroll report.) I was surprised tofind, even then, such a wide variety of material in the PG archives. Ifound myself re-reading favorite books from my childhood, anddelighting in finding "new" ones—Little Lord Fauntleroy, The SecretGarden, Heidi, the Oz stories. They were not at all like the sugaryold films I had seen on television. They were funny, heartwarming, andutterly charming. After some years as a reader of the texts, I foundmyself thinking, "I'd like to try this."
When I first checked out the web page for volunteers, I feltoverwhelmed. There were all sorts of FAQ's, but when I read them, Iwas baffled by all the information about file types, fonts, and otherdetails. I didn't even know where to get books, let alone what to doabout jagged rights edges or indented lines. It was frustrating — Ihad all this enthusiasm but didn't know where to apply it. I dawdledfor some months, then came back and turned to the PG Volunteers'message board for help.
Help came from many sources. I found someone who needed a fileproofread, so I offered to read it. This worked out well, and I evenfound a couple of typos in it. I proofed some more files for thisperson, and then some for other people on the board.
After a while, I was ready to try a whole book — and from Dianne Beancame my first PG book, "The Golden Slipper" by Anna Katharine Green.When I opened the box, a stale smell floated out, and then I found achunky book with the ugliest green cover I've ever seen on anything.The date was 1915, and the book was starting to crumble all around theedges. My first reaction was "Who would ever want to read this???" Butsince I had promised to do it, I dutifully started scanning andreading as I went along. The book was a collection of mystery/suspensestories about a teenage crime-stopper named Violet Strange. (I alwaysfelt as if Scooby Doo and his friends might turn up at any moment.) AsI read, I began to like Violet, and to notice how different her worldseemed from ours. By the time I reached the end of the book, I feltproud of myself for "saving" some good stories for the future, andready to try another book.
My suggestion to new PG'ers is to jump in and not be shy aboutvolunteering. PG is a big group of great people who care, but they donot know you are out there until you say something. Once you speak up,they will do anything short of triple backflips to help you.
There are many ways new folks can join in, from scavenging old booksat yard sales all the way up to proofing files or scanning and typingin whole books. When you send in your first copy of title page andverso, be patient — it takes time for your copyright research to bedone. This is a great time to do proofing on-line at one of thedistributed proofreading web sites.
I get my books from library sales, yard sales, friends I met on the PGVolunteer board, and even from elderly neighbors who wanted to lend mefavorite books they have saved. When you want old books, telleverybody you know. They may come up with a lot of eligible books youwouldn't have expected.
When you find an old book, my second piece of advice is not to be toohasty in deciding whether you want to read it or not. Old books aredated, naturally, but they can show you things about life in the pastwhich you can't pick up from an A&E documentary. I am especiallyinterested in the way women and children are portrayed in these oldbooks—every woman is not necessarily a lady, and every child is not asweet little angel. (If you haven't read Little Lord Fauntleroy, youare missing a lot of laughs.) These insights and ideas can keep yougoing through a lot of long dark winter evenings, and they're handy tothink over when you hit the occasional dull chapter or scene.
My hardest text to do was See America First, by Orville Heistand. Theauthor invites readers to join him on a trip from Ohio toMassachusetts, in which he visits several landmarks and historicalsites and entertains you all the way with obscure poetry, proverbs,and little moral lectures about each rock and robin he encounters. Itold my husband, Chris, that the author's (literally) rambling stylewas driving me crazy. Chris proofread some chapters for me, thencommented, "Boy, you never see anybody these days have such a fun timegoing nowhere!"
By now, I've done nine complete texts, and have boxes of other booksto do. I have found that children's books are my favorites, but I willtry anything if it is clear enough to read. I don't work on PG everyday, or even every week if I get too busy with other things, but Ikeep coming back. I find PG projects to be very relaxing, a way to usemy computer and writing/proofing skills, and also a refreshing changefrom my daily work. It's also a great excuse and motivation to readlots of books!
Sandra Laythorpe
I first learned about Project Gutenberg from a Computer magazine, so Isearched for it on the Internet, and found all these classic books Ihad wanted to read for years, and they were free! At that time, I reada paperback copy of The Heir of Redclyffe by Charlotte M Yonge. Ithought it was a wonderful book - indeed I still think it is the bestnovel to come out of the nineteenth century. After reading the 'HowTo' files on the Gutenberg site, I thought maybe I could produce MissYonge's books with the equipment I had. I wrote to Michael Hart andasked him, and got a very positive reply and lots of information fromhim.
I jumped in the deep end! I bought a very old copy of The Heir ofRedclyffe, sent the photocopies of the title pages to Michael, and satdown at the computer, learned to use my OCR facilities, and got onwith it, learning by my mistakes. The Instruction files told me mostof what I needed to know, and Michael gave me an introduction to DavidPrice, an experienced Gutenberger, who would be able to help me. Hehas been invaluable in explaining things; I don't think I could haveproduced my first attempt without his guiding hand.
I buy my books off the Internet, or from local dealers. Most of MissYonge's work is still available from second-hand bookshops, and I amhappily living in a location where they are not too scarce. I haveGutenberg colleagues, now, helping with CMY, and I post books to themsnail-mail, if they can't buy them in their own countries.
I use PrimaPage OCR program; it was on the disc which came with myPrimax Colorado Direct scanner, and I do the work on my PC. Before Istart, I open my scanner program, and adjust the settings to takeblack and white photos, and the brightness to about minus 35 or 40.This is crucial, as I won't even be able to see the page until I getit right. When I first began, it took many adjustments to get itright. There should be as few mistakes as possible on the OCR result.If the photograph is too light, the OCR reads words wrongly. If thephotograph is too dark, there are shadows which create black patcheson the pages. If I can't get rid of these black patches, I have totear the pages out of the book and do them one at a time. Important:don't buy first editions!
I use the scanner to take a photograph of two pages. The photographappears on the screen. Then I close the photograph, which my computercalls 'untitl1'. Next I open my OCR program, and search for file'untitl1', and open that. Then I ask the program to clean it, and thenI click onto the button that 'reads' the photograph and converts infrom pixels into letters = Optical Character Recognition!
When I get the OCR result (which takes only a few seconds), I save the'read' text file into my own documents, numbering the file the same asthe number of the page of the book. I have created a folder called'Gutenberg', and I save it in there in a text-only format. So I go tomy Gutenberg folder, open this new file, and visually correct themistakes. I save the finished page, create a Chapter 1 file, and saveit and subsequent pages that I have prepared, to build up the wholebook. After I have proofed the OCR result, I paste the finished textinto a Microsoft Word document, setting the font at Courier New size10. This sets the lines at the right length for Gutenberg. When I havefinished the whole book in Word, I save it as text-with-line-breaks,to get the final text file, which I send to be posted on the Gutenbergsite. I proof my work two or three times, depending on the quality ofthe OCR result, and do a final spelling check with MS Word. I don'task other people to proof my texts, because Miss Yonge'sidiosyncrasies are liable to get edited out, unless the proofer hasthe book to hand.
It took me 6 months to prepare my first text, The Heir of Redclyffe,but I can do 10 pages an hour now.
In my Gutenberg folder, I have other useful files for reference,mostly downloaded Gutenberg Instructions files. So if I need to findsomething out, I can look in these files—it is much easier thansearching on the Internet. If I need to know something I can't find inthese files, I may ask a question on the Volunteers WWW Board,although I try not to, because the answers are nearly always in thefiles.
I try to process 2 sheets of 16 octavo pages a day, taking about 3 or4 hours. I do my housework & gardening in the morning, then settledown to an afternoon's happy Gutenberging :-).
When I became semi-retired, I wanted to do some voluntary work on theInternet. Coincidentally I began reading the works of Charlotte MYonge, and discovered that most of her works are out of print now. Ifelt that they deserved a much wider audience, so I decided that myvoluntary job would be to do just that. Miss Yonge lived in a villageonly a couple of miles away from me, so I had a local interest, too.On my web page,, you will find outa little about her, and Otterbourne, the village she lived in all herlife, and find links to other web sites about her.
I discovered the Charlotte M Yonge Fellowship am now in contact with other people who appreciate her work,including academics who write clever things about her. Her books areabout families, their interactions with each other, and how they, inChristian terms, grow in grace. I don't think there is another writerwho can write so well about families. She was a Tractarian, aChristian who, in the nineteenth century, believed that people couldbe influenced for good by what they read. For this reason, 20thcentury people found her characters too moralistic, and her prose tooturgid. I think her novels are delightful, her characters lovable, andher prose is minutely descriptive. It was said about her that she was'able to make goodness exciting'. This is a rare talent, perhaps onlyfound in other Christian writers like John Bunyan or Charles Kingsley.
Through the Gutenberg site, Miss Yonge's works are more easilyavailable than ever. She originally wrote for upper and middle classyoung women. Even though I live a century and a half later, I canrecognise her characters in their 'descendants' who live around me,but I sometimes wonder what Chinese, African, or even modern Americanreaders think of her, their own backgrounds so different from theEnglish Victorians.
I enjoy making Gutenberg texts, the work is simple, once you know howto. I would prefer, however, to see them presented in HTML. The modernebooks all need to be in HTML format to present nicely on their tinypages. I believe Gutenberg is going to publish HTML files, I wouldlike to learn how to do it. Eventually, I think Gutenberg files willbe available in a format that will work on all PCs, handhelds, palms,and ebooks;—but I don't know what that format is yet, I don't thinkstandards have even been worked out among the ebook publishers.
Finally, yes, I do find mistakes in my published texts. When I havefinished all 200+ of Miss Yonge's books, I am going to go through themall for the second time, and remove the mistakes. So, my work is cutout for many years to come. . . .
Suzanne Shell
Over the past several years, I visited the Project Gutenbergwebsite occasionally, looked at what was involved in making asignificant contribution to the effort, and left after downloading afew books—PG was a project that would need to wait until Iretired.
In the summer and fall of 2002, I was doing research on e-books(sources, devices, costs) for my library, and ran across DistributedProofreaders. I discovered at about this time, andalso followed a link from there to Distributed Proofreaders.Serendipity! After backing away a few times, I took the plunge andregistered on November 5, then began proofing. Thehowever-many-pages-I-wanted-to-proof commitment was just right forletting me get a feel for the process, and to start me thinking ofthe ways I could exploit all this free labor to get the books Iwanted into PG.
I was feeling quite virtuous about proofing my 10-20 pages perday, when I visited the site on November 8, and NONE of the books Iwas working on were available. Also there was this perfectly absurdnumber listed for number of proofers having proofed at least onepage (it had roughly quadrupled). I KNEW the site had been hacked.Actually the site had been slash dotted. The DP discussion forumswere so active, it was hard to find time to read all the messages,questions, suggestions, and complaints; these rapidly led to newdocumentation and more detailed proofing guidelines. Books movedthrough the site so rapidly that they brought out the "hard stuff"from the bottom of the to-do stack, and were STILL desperate forcontent. I was a relative "veteran" after just a few days, andhelped out a little by answering questions, but I was still abeginner. I had some PG dreams that DP could make reality, but Ineeded to learn the ropes first.
Some of my ambitions revolved around professional goals—thereare some public domain titles, which, if available in electronicform, would be extremely useful to my library's patrons. There arealso some standard reference books and indexes—Granger's Index toPoetry is one example—that have pre-1923 editions that could stillbe important resources. In order to learn what I needed to knowabout providing content, though, I decided to start with somethingless overwhelming (wanting to read it on my e-book reader was just acoincidence). I went to my bookshelves and pulled out my P. G.Wodehouse reprints. I downloaded and read the scanning andsubmitting FAQ from the DP site, requested and received clearancefor the first book (Uneasy Money) in late December, and got towork mastering my scanner. I tried Omnipage Pro first, but decidedthat ABBYY Finereader Pro did a significantly better job of the OCR.I offered to be a "behind the scenes" manager for the book while itworked its way through the site, but was made an official "ProjectManager" instead. Although the first frenzy following the slash dotinvasion had calmed down, DP was still feeling a need for morecontent and more hands to manage projects.
On January 5, Uneasy Money started proofing; it went through 2rounds of proofing in less than 20 hours. I felt a like a hickmarveling at a traffic light changing colors, but I sat at my PC andwatched the page count go down. By this time, I had also scanned andOCR'd a couple more Wodehouse reprints and a short book of poetry. Iwas hooked! Juliet Sutherland and the other admins had recruitedsome experienced DP'ers to help train new post-processors in the jobof preparing final PG texts. I was handed over to one of them. Afterseveral projects, I "graduated" and was given permission to uploadmy own projects. My intent was to do 3 or 4 projects a month, nomore than I could handle post-processing by myself. I planned toprocess an occasional reference book in addition to all theWodehouse I could get my hands on. So much for plans…
One ongoing concern of many Distributed Proofreaders was how totrain new volunteers in the DP style of proofreading. (It issomewhat idiosyncratic because of the distributed nature of theprocess.) We were still coping with the aftereffects of the massiveinflux of slash dotters—quantity benefited, but quality suffered.Super7, one of the highest volume proofreaders, suggested settingaside a project without complex formatting for "Beginners" andasking that the second round proofers (all of whom should beveterans) send feedback and encouragement to the newcomers. This wastried successfully, and with a couple of variations. Since I hadbeen planning to start running a variety of genre fiction throughthe site, I then volunteered to manage these as beginners' projectsfor as long as the supply held out. All of a sudden, starting inFebruary 2003, the amount of time I needed to spend locating,scanning, OCR'ing and managing books increased drastically, and theamount of time I could devote to post-processing decreased. Luckily,"veterans" stepped in to answer newcomers' questions, and to serveas "Mentors" in the second round of proofing. Recently, others haveprovided "beginners' projects", to help keep up with the demand of asteadily increasing flow of new volunteers. These projects are alsouseful for helping new post-processors learn the job.
I still have some ambitious projects planned; Granger's Index toPoetry, the unabridged edition of The Golden Bough, Curtis' TheNorth American Indian, and the Book Review Digest (volumes for1905-1921). A couple of volumes are already waiting to be proofed,others are waiting to be scanned on the PG tabloid scanner. But, inthe meantime, there are 23 new Wodehouse books in PG thanks toDistributed Proofreaders, not to mention such remnants of early 20thcentury popular culture as The Sheik.
I believe that a major accomplishment of Distributed Proofreadershas been the creation of way to provide on-the-job training for PGvolunteers. Steady improvement in the quantity and quality oftraining techniques and documentation, enhancements to theuser-friendliness of the site, and ready access to the collectiveexperience and advice of a wide range of volunteers in the Forumshave resulted in a growing core of active and experienced volunteersin all the facets of e-book production. I'm sure that I could nothave progressed from a total newbie to a regular PG contributorwithin a 5-month period without this support structure. Regularcommunication and collaboration with book-lovers from around theworld has enriched my life. The fact that it is easier to get leavefrom my job than from DP, is perhaps beside the point…
Tony Adam
How did you learn about PG?
It's been so long, I don't really remember! I probably read about iton a library listserv (I'm a librarian), and since making old textsaccessible has always been a concern of mine, I jumped right in.
What was your first contact like?
Great! Mike Hart has always been easy to deal with via e-mail,although we've never talked. He and the "crew du jour" directedme to the FAQ and I took it from there.
What was the first PG job you did? How did it go?
My first job might have been Henry James' Turn of the Screw (Ijust found a note from September 1993 on copyright clearance for it).Since in a former incarnation I was editorial assistant for the HenryJames Review, I thought that would be a good start. I've always typedthe files (I'm a fast typist), and I think we had few problems alongthe way.
How did you develop your PG experience from there?
Helter-skelter, much like my reading habits. I work at a historicallyblack university, so getting 19th C African-American works posted is acentral concern. I've done Clotelle (the first A-A American novel)and the autobiography of Henry O. Flipper, the West Point cadet, andI'm always looking for something new in that area. Somewhere along theway I got sidetracked into essays by Whittier and other U.S. poets,and I've collaborated on early American historical documents and SirWalter Scott with a fellow PGer up in Ohio and Chinese documents withanother contact in Japan. A couple of years ago, I saw that someone inSan Francisco needed help with the Shakespeare Apocrypha, and that hasoccupied my time on and off since. It's always something!
Can you tell us about the first text you produced?
I think it was The Turn of the Screw, which wasa good starting point—not too long, a good read, etc. Just pluggingaway at the text a few pages a day made the process go quickly.
Why do you spend your hours contributing to PG?
I love the idea of making all of this print knowledge available toanyone anywhere. Working in a library that has suffered budgetproblems over the years opened my eyes to the need for acquisition ofas much free stuff as possible for our students and faculty. Besides,in a perverse way, it's fun!
Do you specialize in any particular kind of work? of texts?
I've probably focused more on plays, historical documents, and19th C U.S. works than anything else.
What do you like about making a PG text?
Having a project come to fruition—finally seeing an almost forgottentext come to life again.
What do you dislike about making a PG text?
The work can be tedious at times, depending on the author. Butsometimes you have to plow through to get something significantprocessed. For example, we probably should have more philosophersrepresented, but what a horrible thing it would be to scan Kant!
Where do you get your eligible books?
Mostly from my library's collection, although I finally purchased myown copy of the Shakespeare Apocrypha (it's very hard to find, whichmakes it very suitable for posting). I've interlibrary loaned someitems, but that's also been unusual.
Do you type or scan? What Scanner / OCR / Editor / WP do you prefer?
I still type everything—it's easier when working with a play, I'vediscovered. But I'm purchasing a scanner in the very near future andwill do more with that.
How do you check your text? Any special tools? spellchecker? Do youprint it out and read it? Put it on your PDA and read it? Have a voicesynthesis program read it aloud to you from your PC?
I usually run it through the spellchecker, although depending on thework, I read it line by line a second time.
Do you have any tips'n'tricks or special routines you go through whenpreparing a text?
The best thing to do is put yourself on a schedule—do a set amount ofpages every day, and you'll be surprised how quickly you get to theend. I also make a pencil mark in the book at a stopping point andeven read back a paragraph to double check what I last entered.
How long does it take you to make a text?
Depends on my work schedule, other assignments, time of year, etc. Aplay might take a couple of weeks, but a Walter Scott novel could takesix months. I think my record is probably one day for an essay, butthat's unusual.
Do you work alone, or do you share the work of each text? Does anyoneregularly help you proof the text?
I've worked alone and on teams, depending on the text. No oneregularly helps to proof the text, but occasionally someone else does.
Do you do some PG work regularly, or drift in and out as opportunitypermits, or when you feel like it?
I consider myself a regular, as time permits. In other words, Ihaven't dropped out of the picture, but sometimes I might not enteranything for up to a month.
How many different kinds of work, or different books, have you done?
Not sure how many different books I've done, but it's been a widevariety: James' and Scott's novels, Whittier's essays, a wholecollection of early American documents (mostly New Netherlands),Shakespeare (accepted canon and the apocryphal works), some odd works(The Psychology of Beauty comes to mind)—the list goes on and on.I've even forgotten that I've done some titles!
What do you like about the PG process?
That it's open-ended—if I think I have something that should beposted, I don't have to jump through hoops and ladders to getpermission (other than copyright clearance).
What do you dislike about the PG process?
Can't think of anything offhand.
Is there anything you'd like to see PG doing differently?
I know it's a bone of contention, but we probably need to exploremoving away from ASCII.
If one of your friends approached you to ask advice about how to getstarted contributing to PG, what would you tell them?
Start with something fun, that's close to your heart, and keepplugging away a little bit at a time.
What do you expect Project Gutenberg to be like in 5 years? 10 years?
We'll probably be a whole lot bigger (texts and personnel), with adifferent look to the texts. Maybe we'll even have more audio versionsof texts, using some of the new software that's coming out.
Tonya Allen
I discovered Project Gutenberg in about 1997. After several years ofenjoying PG's texts, in June of 2002 I decided it was time to startcontributing. Via the PG web site I learned that the easiest way todo this would be to help out with proofreading via Charles Franks'Distributed Proofreaders web site. The day I signed on I proofednine whole pages of a children's book called Curly and FloppyTwistytail and felt very proud to be contributing.
At that time, there were probably only about 40 active volunteerson the site each day. Often I proofed an entire book almost all bymyself over the course of a week or so. Things moved at a leisurelypace; guidelines were few and simple; and I had fun reading oldbooks and discovering new authors.
After a few months a request was made for volunteers to post-processtexts in French. I volunteered to help with this, and that was how Ibecame a post-processor (PPer). Shortly afterwards, the web pagelisting texts available for post-processing and sign-out wasunveiled. I remember several times checking and being disappointedbecause there was nothing currently available (hard to imagine nowwhen there are always at least 40 texts waiting).
One day in November, I picked out a likely-looking text from theproofing page, and settled down for an hour of reading. As I recall,it was The Greek View of Life, a sizeable text of which only a fewpages had been proofed so far, and which I thought would last forseveral days at least. At about that time, someone emailed me to saythat DP had been "/.ed." "What does that mean?" I replied. I soonfound out.
I had been proofing away peacefully for awhile when suddenly insteadof the next page, I got a page about twenty pages further on. Thesame thing happened again and again, and suddenly all the pages weregone; the whole text had been completed. DP had indeed beenslashdotted.
Since then, a lot of amazing things have happened. The number ofactive volunteers per day has increased almost 1000%. The number oftexts that go through the site has increased exponentially. Allkinds of proofing and processing tools have been developed. I nowspend most of my time checking texts that others have PPed, andsubmitting them to PG, at an average rate of one to four perday—quite a leap from nine pages of Curly and Floppy Twistytail.And I'm looking forward to everything that lies ahead as DPcontinues to evolve.
Walter Debeuf
Quite by chance I became aware of PG when I was surfing and lookingfor interesting sites. I vaguely knew the name because I had heard ofthe Project a long time ago. After reading the "History and Philosophyof PG", I immediately became wildly enthusiastic about it. This waswhat I had been looking for for years, a meaningful use of my PC, andbecause I am a fervent lover of good literature, I didn't hesitate tocontact the founders of the Project. I made a suggestion that I shouldwork on French and Dutch e-texts. The very same day I received ananswer from PG in which they told me they were very pleased with mycontribution but that I had to keep in mind that all books must befree of copyright and published before 1923.
This wasn't so great. . . . After I browsed in the "Help And FAQ" ofthe PG site, I read that I didn't have to worry about all that,because they are willing to do all the clearance!
On my own bookshelf I found an old book of Jules Renard, "Poil deCarotte". It seemed old enough to me, but I couldn't find anycopyright notations. So, I mailed to Mr Hart all the information Ifound on the title page and the verso, and asked him what he thoughtabout it. The next day I received his answer, he wrote: "We still haveto prove this edition was pre-1923, so I am forwarding to ourauthority on such copyright research." This authority is Ms. DianneBean who mailed me a few days later very pleasantly that I could starttyping, because the copyright issues had been resolved. She asked meto send a "TP&V" (a photocopy of the title page and verso) of the bookto Mr. Hart, because they need that for legal reasons.
But something wasn't very clear to me concerning the format I had touse. In the "FAQ" they spoke about "plain vanilla ASCII", something Inever had heard about in my life! In "How to Volunteer, PG Volunteers'Board" Mr. Jim Tinsley answered all kind of questions about all kindsof problems people have when they start volunteering. So I did thesame and sent him my question. I received an extensive answer aboutall kind of formats in the "ISO 8859 Alphabet Soup" and he recommendedme to use "Codepage 1252" which is very common in Windows. Here arethe addresses which Jim sent to me:
"If you are interested in the differences, I recommend the excellentweb page
in the excellent reference site"
I chose a French book, first because I had it already on my bookshelf,and secondly because I wanted to perfect my knowledge of the Frenchlanguage and typing seemed the right way to do it. When copying anauthor's text, you are very close to it. You also have to pay fullattention to the spelling of the words. Gradually you come under thespell of the story and you forget that you are typing . . .Nevertheless, it is hard work, especially when it is not your nativelanguage, and therefore you shouldn't try to rush it. At first Istarted with two or three pages a day, which means that you would needabout two months typing for an average book. But good typists can doit more quickly.
I can only applaud the aim of PG, to put books available on the net asmuch as possible and without cost, for every one in the whole world. Ilove to co-operate with it.
In the meantime there are thousands and thousands of books in thePG-collection, and that makes it a little difficult to find otherexamples which are free of copyright, because they must be from before1923. Since I've got the "PG-bug" it's a challenge for me to findsuitable copies, and I look for them high and low. I can buy a fewbooks for a song and I take them home as a trophy, looking forward tothe work which is waiting for me . . .
In libraries you can find old publications which you can find nowhereelse.
It's amazing how fascinating old books can be and how much you canlearn from them. For the moment I'm working on "Pecheur d'Islande" byPierre Loti, in which I get acquainted with an old tradition offishermen, very interesting. Without PG I would probably never haveread this. There must be still a lot of little treasures in some oldand dusty attics, waiting to be born again by the magic touch of aPG-volunteer.
If you do it, no compensation or payment is waiting, but . . . doingsomething disinterested and unselfish gives you a good feeling.
B.1. Project Gutenberg:
Home Page and Search <>Contact Information <>Donations <>List of FTP sites <>Web Browse to texts <>
Mailing Lists <>Volunteers' Board <>Copyright Rules <>Books In Progress <>(The InProg List)
Greek Transliteration <>
Music <>
(Complete list of posted eBooks)
B.2. Distributed Proofing Sites:
Charles Franks <>
JC Byers <>
Dewayne Cushman <>
B.3. Other On-Line eBook Pages:
The On-Line Books Page <>
/In Progress List <>
Internet Public Library <>
B.4. Lists of Suggested Books to Transcribe:
PG Books In Progress <>
On-Line Requested List <>
Steve Harris' "To-do"s <>
B.5. Finding Paper Books On-Line:
Advanced Book Exchange <>
Alibris <>
Trussel BookSearch <>
Library of Congress Catalog <>
B.6. Character Sets
Overviews <>
ISO-8859 <>
Microsoft & Other Codepages <>
Unicode <>
Updated editions will replace the previous one—the old editions willbe renamed.
Creating the works from print editions not protected by U.S. copyrightlaw means that no one owns a United States copyright in these works,so the Foundation (and you!) can copy and distribute it in the UnitedStates without permission and without paying copyrightroyalties. Special rules, set forth in the General Terms of Use partof this license, apply to copying and distributing ProjectGutenberg™ electronic works to protect the PROJECT GUTENBERG™concept and trademark. Project Gutenberg is a registered trademark,and may not be used if you charge for an eBook, except by followingthe terms of the trademark license, including paying royalties for useof the Project Gutenberg trademark. If you do not charge anything forcopies of this eBook, complying with the trademark license is veryeasy. You may use this eBook for nearly any purpose such as creationof derivative works, reports, performances and research. ProjectGutenberg eBooks may be modified and printed and given away—you maydo practically ANYTHING in the United States with eBooks not protectedby U.S. copyright law. Redistribution is subject to the trademarklicense, especially commercial redistribution.
To protect the Project Gutenberg™ mission of promoting the freedistribution of electronic works, by using or distributing this work(or any other work associated in any way with the phrase “ProjectGutenberg”), you agree to comply with all the terms of the FullProject Gutenberg™ License available with this file or online
Section 1. General Terms of Use and Redistributing Project Gutenberg™electronic works
1.A. By reading or using any part of this Project Gutenberg™electronic work, you indicate that you have read, understand, agree toand accept all the terms of this license and intellectual property(trademark/copyright) agreement. If you do not agree to abide by allthe terms of this agreement, you must cease using and return ordestroy all copies of Project Gutenberg™ electronic works in yourpossession. If you paid a fee for obtaining a copy of or access to aProject Gutenberg™ electronic work and you do not agree to be boundby the terms of this agreement, you may obtain a refund from the personor entity to whom you paid the fee as set forth in paragraph 1.E.8.
1.B. “Project Gutenberg” is a registered trademark. It may only beused on or associated in any way with an electronic work by people whoagree to be bound by the terms of this agreement. There are a fewthings that you can do with most Project Gutenberg™ electronic workseven without complying with the full terms of this agreement. Seeparagraph 1.C below. There are a lot of things you can do with ProjectGutenberg™ electronic works if you follow the terms of thisagreement and help preserve free future access to Project Gutenberg™electronic works. See paragraph 1.E below.
1.C. The Project Gutenberg Literary Archive Foundation (“theFoundation” or PGLAF), owns a compilation copyright in the collectionof Project Gutenberg™ electronic works. Nearly all the individualworks in the collection are in the public domain in the UnitedStates. If an individual work is unprotected by copyright law in theUnited States and you are located in the United States, we do notclaim a right to prevent you from copying, distributing, performing,displaying or creating derivative works based on the work as long asall references to Project Gutenberg are removed. Of course, we hopethat you will support the Project Gutenberg™ mission of promotingfree access to electronic works by freely sharing Project Gutenberg™works in compliance with the terms of this agreement for keeping theProject Gutenberg™ name associated with the work. You can easilycomply with the terms of this agreement by keeping this work in thesame format with its attached full Project Gutenberg™ License whenyou share it without charge with others.
1.D. The copyright laws of the place where you are located also governwhat you can do with this work. Copyright laws in most countries arein a constant state of change. If you are outside the United States,check the laws of your country in addition to the terms of thisagreement before downloading, copying, displaying, performing,distributing or creating derivative works based on this work or anyother Project Gutenberg™ work. The Foundation makes norepresentations concerning the copyright status of any work in anycountry other than the United States.
1.E. Unless you have removed all references to Project Gutenberg:
1.E.1. The following sentence, with active links to, or otherimmediate access to, the full Project Gutenberg™ License must appearprominently whenever any copy of a Project Gutenberg™ work (any workon which the phrase “Project Gutenberg” appears, or with which thephrase “Project Gutenberg” is associated) is accessed, displayed,performed, viewed, copied or distributed:
This eBook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this eBook or online at If you are not located in the United States, you will have to check the laws of the country where you are located before using this eBook.
1.E.2. If an individual Project Gutenberg™ electronic work isderived from texts not protected by U.S. copyright law (does notcontain a notice indicating that it is posted with permission of thecopyright holder), the work can be copied and distributed to anyone inthe United States without paying any fees or charges. If you areredistributing or providing access to a work with the phrase “ProjectGutenberg” associated with or appearing on the work, you must complyeither with the requirements of paragraphs 1.E.1 through 1.E.7 orobtain permission for the use of the work and the Project Gutenberg™trademark as set forth in paragraphs 1.E.8 or 1.E.9.
1.E.3. If an individual Project Gutenberg™ electronic work is postedwith the permission of the copyright holder, your use and distributionmust comply with both paragraphs 1.E.1 through 1.E.7 and anyadditional terms imposed by the copyright holder. Additional termswill be linked to the Project Gutenberg™ License for all worksposted with the permission of the copyright holder found at thebeginning of this work.
1.E.4. Do not unlink or detach or remove the full Project Gutenberg™License terms from this work, or any files containing a part of thiswork or any other work associated with Project Gutenberg™.
1.E.5. Do not copy, display, perform, distribute or redistribute thiselectronic work, or any part of this electronic work, withoutprominently displaying the sentence set forth in paragraph 1.E.1 withactive links or immediate access to the full terms of the ProjectGutenberg™ License.
1.E.6. You may convert to and distribute this work in any binary,compressed, marked up, nonproprietary or proprietary form, includingany word processing or hypertext form. However, if you provide accessto or distribute copies of a Project Gutenberg™ work in a formatother than “Plain Vanilla ASCII” or other format used in the officialversion posted on the official Project Gutenberg™ website(, you must, at no additional cost, fee or expenseto the user, provide a copy, a means of exporting a copy, or a meansof obtaining a copy upon request, of the work in its original “PlainVanilla ASCII” or other form. Any alternate format must include thefull Project Gutenberg™ License as specified in paragraph 1.E.1.
1.E.7. Do not charge a fee for access to, viewing, displaying,performing, copying or distributing any Project Gutenberg™ worksunless you comply with paragraph 1.E.8 or 1.E.9.
1.E.8. You may charge a reasonable fee for copies of or providingaccess to or distributing Project Gutenberg™ electronic worksprovided that:
- • You pay a royalty fee of 20% of the gross profits you derive from the use of Project Gutenberg™ works calculated using the method you already use to calculate your applicable taxes. The fee is owed to the owner of the Project Gutenberg™ trademark, but he has agreed to donate royalties under this paragraph to the Project Gutenberg Literary Archive Foundation. Royalty payments must be paid within 60 days following each date on which you prepare (or are legally required to prepare) your periodic tax returns. Royalty payments should be clearly marked as such and sent to the Project Gutenberg Literary Archive Foundation at the address specified in Section 4, “Information about donations to the Project Gutenberg Literary Archive Foundation.”
- • You provide a full refund of any money paid by a user who notifies you in writing (or by e-mail) within 30 days of receipt that s/he does not agree to the terms of the full Project Gutenberg™ License. You must require such a user to return or destroy all copies of the works possessed in a physical medium and discontinue all use of and all access to other copies of Project Gutenberg™ works.
- • You provide, in accordance with paragraph 1.F.3, a full refund of any money paid for a work or a replacement copy, if a defect in the electronic work is discovered and reported to you within 90 days of receipt of the work.
- • You comply with all other terms of this agreement for free distribution of Project Gutenberg™ works.
1.E.9. If you wish to charge a fee or distribute a ProjectGutenberg™ electronic work or group of works on different terms thanare set forth in this agreement, you must obtain permission in writingfrom the Project Gutenberg Literary Archive Foundation, the manager ofthe Project Gutenberg™ trademark. Contact the Foundation as setforth in Section 3 below.
1.F.1. Project Gutenberg volunteers and employees expend considerableeffort to identify, do copyright research on, transcribe and proofreadworks not protected by U.S. copyright law in creating the ProjectGutenberg™ collection. Despite these efforts, Project Gutenberg™electronic works, and the medium on which they may be stored, maycontain “Defects,” such as, but not limited to, incomplete, inaccurateor corrupt data, transcription errors, a copyright or otherintellectual property infringement, a defective or damaged disk orother medium, a computer virus, or computer codes that damage orcannot be read by your equipment.
1.F.2. LIMITED WARRANTY, DISCLAIMER OF DAMAGES - Except for the “Rightof Replacement or Refund” described in paragraph 1.F.3, the ProjectGutenberg Literary Archive Foundation, the owner of the ProjectGutenberg™ trademark, and any other party distributing a ProjectGutenberg™ electronic work under this agreement, disclaim allliability to you for damages, costs and expenses, including legalfees. YOU AGREE THAT YOU HAVE NO REMEDIES FOR NEGLIGENCE, STRICTLIABILITY, BREACH OF WARRANTY OR BREACH OF CONTRACT EXCEPT THOSEPROVIDED IN PARAGRAPH 1.F.3. YOU AGREE THAT THE FOUNDATION, THETRADEMARK OWNER, AND ANY DISTRIBUTOR UNDER THIS AGREEMENT WILL NOT BELIABLE TO YOU FOR ACTUAL, DIRECT, INDIRECT, CONSEQUENTIAL, PUNITIVE ORINCIDENTAL DAMAGES EVEN IF YOU GIVE NOTICE OF THE POSSIBILITY OF SUCHDAMAGE.
1.F.3. LIMITED RIGHT OF REPLACEMENT OR REFUND - If you discover adefect in this electronic work within 90 days of receiving it, you canreceive a refund of the money (if any) you paid for it by sending awritten explanation to the person you received the work from. If youreceived the work on a physical medium, you must return the mediumwith your written explanation. The person or entity that provided youwith the defective work may elect to provide a replacement copy inlieu of a refund. If you received the work electronically, the personor entity providing it to you may choose to give you a secondopportunity to receive the work electronically in lieu of a refund. Ifthe second copy is also defective, you may demand a refund in writingwithout further opportunities to fix the problem.
1.F.4. Except for the limited right of replacement or refund set forthin paragraph 1.F.3, this work is provided to you ‘AS-IS’, WITH NOOTHER WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOTLIMITED TO WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PURPOSE.
1.F.5. Some states do not allow disclaimers of certain impliedwarranties or the exclusion or limitation of certain types ofdamages. If any disclaimer or limitation set forth in this agreementviolates the law of the state applicable to this agreement, theagreement shall be interpreted to make the maximum disclaimer orlimitation permitted by the applicable state law. The invalidity orunenforceability of any provision of this agreement shall not void theremaining provisions.
1.F.6. INDEMNITY - You agree to indemnify and hold the Foundation, thetrademark owner, any agent or employee of the Foundation, anyoneproviding copies of Project Gutenberg™ electronic works inaccordance with this agreement, and any volunteers associated with theproduction, promotion and distribution of Project Gutenberg™electronic works, harmless from all liability, costs and expenses,including legal fees, that arise directly or indirectly from any ofthe following which you do or cause to occur: (a) distribution of thisor any Project Gutenberg™ work, (b) alteration, modification, oradditions or deletions to any Project Gutenberg™ work, and (c) anyDefect you cause.
Section 2. Information about the Mission of Project Gutenberg™
Project Gutenberg™ is synonymous with the free distribution ofelectronic works in formats readable by the widest variety ofcomputers including obsolete, old, middle-aged and new computers. Itexists because of the efforts of hundreds of volunteers and donationsfrom people in all walks of life.
Volunteers and financial support to provide volunteers with theassistance they need are critical to reaching Project Gutenberg™’sgoals and ensuring that the Project Gutenberg™ collection willremain freely available for generations to come. In 2001, the ProjectGutenberg Literary Archive Foundation was created to provide a secureand permanent future for Project Gutenberg™ and futuregenerations. To learn more about the Project Gutenberg LiteraryArchive Foundation and how your efforts and donations can help, seeSections 3 and 4 and the Foundation information page at
Section 3. Information about the Project Gutenberg Literary Archive Foundation
The Project Gutenberg Literary Archive Foundation is a non-profit501(c)(3) educational corporation organized under the laws of thestate of Mississippi and granted tax exempt status by the InternalRevenue Service. The Foundation’s EIN or federal tax identificationnumber is 64-6221541. Contributions to the Project Gutenberg LiteraryArchive Foundation are tax deductible to the full extent permitted byU.S. federal laws and your state’s laws.
The Foundation’s business office is located at 809 North 1500 West,Salt Lake City, UT 84116, (801) 596-1887. Email contact links and upto date contact information can be found at the Foundation’s websiteand official page at
Section 4. Information about Donations to the Project GutenbergLiterary Archive Foundation
Project Gutenberg™ depends upon and cannot survive without widespreadpublic support and donations to carry out its mission ofincreasing the number of public domain and licensed works that can befreely distributed in machine-readable form accessible by the widestarray of equipment including outdated equipment. Many small donations($1 to $5,000) are particularly important to maintaining tax exemptstatus with the IRS.
The Foundation is committed to complying with the laws regulatingcharities and charitable donations in all 50 states of the UnitedStates. Compliance requirements are not uniform and it takes aconsiderable effort, much paperwork and many fees to meet and keep upwith these requirements. We do not solicit donations in locationswhere we have not received written confirmation of compliance. To SENDDONATIONS or determine the status of compliance for any particular statevisit
While we cannot and do not solicit contributions from states where wehave not met the solicitation requirements, we know of no prohibitionagainst accepting unsolicited donations from donors in such states whoapproach us with offers to donate.
International donations are gratefully accepted, but we cannot makeany statements concerning tax treatment of donations received fromoutside the United States. U.S. laws alone swamp our small staff.
Please check the Project Gutenberg web pages for current donationmethods and addresses. Donations are accepted in a number of otherways including checks, online payments and credit card donations. Todonate, please visit:
Section 5. General Information About Project Gutenberg™ electronic works
Professor Michael S. Hart was the originator of the ProjectGutenberg™ concept of a library of electronic works that could befreely shared with anyone. For forty years, he produced anddistributed Project Gutenberg™ eBooks with only a loose network ofvolunteer support.
Project Gutenberg™ eBooks are often created from several printededitions, all of which are confirmed as not protected by copyright inthe U.S. unless a copyright notice is included. Thus, we do notnecessarily keep eBooks in compliance with any particular paperedition.
Most people start at our website which has the main PG searchfacility:
This website includes information about Project Gutenberg™,including how to make donations to the Project Gutenberg LiteraryArchive Foundation, how to help produce our new eBooks, and how tosubscribe to our email newsletter to hear about new eBooks.