Scans
by now i’ve put a fair amount of time into hobbyist book digitization. here i’ll compile my reasons for doing it, a guide to doing it yourself, and links and notes for the books i’ve done it to.
origin of a habit
i once had a prefigurative vision of myself in old age, ravaged by tsundoku, lost in stacks of books which would consume all my funds while mostly never being read. naturally i resolved to avert this dread fate. tsundoku arises from a simple cause, i reasoned: when one feels the impulse of interest in a book, one obtains it with the purest intention to read. but mere possession brings with it enough satisfaction that the impulse to read slackens prematurely, and one listlessly discards the volume in favor of the next prize. the solution i devised was to make possession immaterial to me: switch to reading pirated ebooks rather than paper copies. not only would downloading an epub provide a less addictive pleasure than holding an object in my hands, but if i could just get over the learning curve, i’d save a sizable sum.
also, there was the problem of my decaying focus: by age thirteen i realized it was increasingly difficult for me to keep my eyes on a text for very long without my mind wandering. i’m still not sure whether to think of that as adhd or a vision problem or my being a victim of internet-induced mass attention death. i took to audiobooks to compensate for the issue, since the narrator’s insistent voice refused me any chance for distraction. when no audiobook exists, i can use text-to-speech on an ebook for a similar effect, reading the text at the same time as i listen.
as in any fable of fate defied, the result of my efforts was whole new inconveniences. so used to immediately receiving a pdf of any book i wanted, i no longer know how to accept when something is not available. if i can’t find something from the usual sources online, i’m overcome with anger, and will go to almost any length to obtain it. if i can get it from the library, i digitize it, and if i don’t know the language, i learn to translate it (can’t do the latter as often, of course).
my digitization workflow
several times now i’ve had someone ask me how to digitize a book properly. my method relies on basic command-line navigation skills, and the tools are all things that run on linux. i haven’t timed myself at work, but can say vaguely that it takes a couple hours to scan a couple books, and a couple more hours later to turn the scans into a useful format.
- to start with, i use a flatbed scanner—there are fancier setups that aim to be easier on a book’s spine, but in my experience the strain is negligible and easily justified by the digital immortality that the book will gain by the ordeal.
- scan the pages in grayscale mode (not B/W), except where color is necessary. output should be tiff format, whatever high resolution option you have. you need 300dpi+ to get good ocr results, according to conventional wisdom, and erring on the higher side shouldn’t hurt since you’ll compress it in time for the pdf output. the actual scanning is tedious but tiring in the same way driving is; it’s not stimulating but you need to pay a little attention to make sure you’re getting every spread and they’re not coming out wrong. the scan software should have a preview of your results, but i can’t recommend a specific tool because i just use what runs on the scanner workstations at my library and then copy it to a usb key to take home. i pretend i’m back in boarding school and the book is a younger boy who hasn’t been paying his dues. dunk his head in the toilet, hold it down, pull up for air, repeat. you get into the rhythm of it, and i fend off boredom by listening on my earbuds to whatever i digitized last week.
- post-process the tiffs with scantailor advanced. you can compile it yourself, or i think get a build from a ppa on ubuntu or from nixpkgs (haven’t double-checked this). it’s pretty straightforward to use and halfway automates most of the steps: fixing image orientation, splitting spreads into pages, deskewing pages, selecting the content region of the page, positioning the content on the output pages, and generating the final output. the main things you have to intervene in by hand are content selection, positioning, identifying regions you don’t want converted to black and white, and maybe manually cleaning up stray marks or annotations. the virtue of letting scantailor convert grayscale to b/w instead of the scanner itself is that it can tell undesirable shadows from desirable text and gives you precise control over the line between black and white.
-
you need to write a few metadata files. i don’t put a lot of detail into them. create a
metadata.yaml
for the epub version in the project directory like this:--- title: - type: main text: "Title" creator: - role: author text: "Author" ...
in the scantailor output directory (i make this a subfolder of the one dedicated to the book project one), make an equivalent file
metadata
for the pdf:~~~ Author: "Author" Title: "Title" ~~~
you’ll also need a dummy
bookmarks
file to generate the pdf. you can make it a proper table of contents later, for now just do this:~~~ "Cover" 1 ~~~
-
cd
into your scantailor output directory. to generate markdown and pdf output, you’ll needtesseract
for ocr (probablytesseract-ocr
in your distro repos),hocr-combine
from hocr-tools (install python from distro repos, thenpip install hocr-tools
),pandoc
(probably in distro repos), imagemagick (ditto), andpdfbeads
(install ruby from distro repos, thengem install pdfbeads
). might have forgotten some dependencies. i use a script calledbind
, which looks like this:#!/bin/bash export COUNT=1 export TOTAL=`ls -1|wc -l` for f in *.tif; do echo "OCRing $f ($COUNT of $TOTAL)" tesseract -l eng $f $(basename $f .tiff) hocr export COUNT=$((COUNT + 1)) done rename -v 's/.tif.hocr/.html/' *.tif.hocr hocr-combine *.html|pandoc -f html-native_divs-native_spans -t markdown+smart -o ../book.md pdfbeads -C bookmarks -M metadata > ../book.pdf
change “eng” on the tesseract line to the relevant language code if the book isn’t in english.
- open
book.pdf
and use the page numbers in it to write a betterbookmarks
file, then run thepdfbeads
command from the above script again to make a final pdf. - the markdown file
book.md
is the corpus you can use to generate an epub. you’ll want to edit it to fix typos and remove page numbers (regex helps for this), add chapter headings, and if necessary insert images. this step is kinda optional, depending on marginal return. i’m often satisfied with only a pdf for nonfiction books, or i put the minimum of effort into the epub that i need to make it readable by text-to-speech. when you’re satisfied, dopandoc book.md -o book.epub --metadata-file=metadata.yaml
.
books that i have scanned
many of the things i’ve scanned i don’t read, or not immediately, so i can’t always say whether something is worth reading. i’ll add notes on them here as i get them uploaded to libgen, but for now you can see them in my book library.