Scanned book to compact PDF – Life on Numbers

Although scanned images tend to be quite large, books mostly consist of text. Hence, monochrome representations are still legible. Unfortunately, it can be tricky to get a small PDF from your input file. Here is (one) easy way to do it:

gs -sDEVICE=pnggray -dNOPAUSE -dQUIET -r120 -sOutputFile=data/m%d.png inputfile.ps

Yes, there is pngmono, but this device has issues with rescaled text. Now the folder foo contains a number of single pages. If you enter this directory, the following command (assuming you to have imagemagick installed) creates even more but smaller png files

for i in m*.png; 
do 
  ID=$(echo $i | sed 's/^m//;s/\..*//'); 
  convert $i -monochrome s$ID.png; echo $ID '\includegraphics{./s'$ID'.png}'; 
done | sort -n | cut -d" " -f 2

This command prints the lines you will have to enter into this LaTeX template

\documentclass[a4paper,10pt]{scrartcl}
\usepackage[utf8]{inputenc}
\usepackage{graphics}
\usepackage{geometry}
\geometry{top=0cm,bottom=0cm,left=0cm,right=0cm,nohead,nofoot}
\begin{document}
\pdfcompresslevel=9
\begin{center}
 \includegraphics{./s1.png}
 \includegraphics{./s2.png}
 \includegraphics{./s3.png}
 \includegraphics{./s4.png}
 \includegraphics{./s5.png}
\end{center}
\end{document}

Compiling this document with pdflatex gives you your final file. I know, that this method is unusual (using ghostscript, imagemagick, pdflatex), but on most machines, these tools should be available without further installation. I just tried this on a 640 page file. Here are the reference file sizes:

Method (Format)	File size
original file (PS)	767 MB
CUPS export (PDF)	106 MB
pdfsizeopt (PDF)	72 MB
this method (PDF)	22 MB

Leave a comment Cancel reply