Scanned book to compact PDF

Although scanned images tend to be quite large, books mostly consist of text. Hence, monochrome representations are still legible. Unfortunately, it can be tricky to get a small PDF from your input file. Here is (one) easy way to do it:

gs -sDEVICE=pnggray -dNOPAUSE -dQUIET -r120 -sOutputFile=data/m%d.png

Yes, there is pngmono, but this device has issues with rescaled text. Now the folder foo contains a number of single pages. If you enter this directory, the following command (assuming you to have imagemagick installed) creates even more but smaller png files

for i in m*.png; 
  ID=$(echo $i | sed 's/^m//;s/\..*//'); 
  convert $i -monochrome s$ID.png; echo $ID '\includegraphics{./s'$ID'.png}'; 
done | sort -n | cut -d" " -f 2

This command prints the lines you will have to enter into this LaTeX template


Compiling this document with pdflatex gives you your final file. I know, that this method is unusual (using ghostscript, imagemagick, pdflatex), but on most machines, these tools should be available without further installation. I just tried this on a 640 page file. Here are the reference file sizes:

Method (Format) File size
original file (PS) 767 MB
CUPS export (PDF) 106 MB
pdfsizeopt (PDF) 72 MB
this method (PDF) 22 MB

You may also like...

Leave a Reply

Your email address will not be published.