hpr1939 :: Collating Pages with pdftk

I describe how to collate the pages of two separate PDF files using pdftk

Hosted by Jon Kulp on Thursday, 2016-01-07 is flagged as Clean and is released under a CC-BY-SA license.
Tags: pdftk, scripting, productivity, scanning, document management, pdf. Comments: 4.

Listen in ogg, opus, or mp3 format. Play now:

Duration: 00:15:34
Download the transcription and subtitles.

Part of the series: general.

I'm moving into my new office at work, and among many things I had to move are file boxes full of old class notes from graduate school. The academic hoarder in me doesn't want to recycle them—I might need these things again! I'm scanning.

I've inherited an excellent scanner/copier with a feeder that lets you scan stacks of pages with one click. This works great for single-sided documents, but most of my handwritten notes are double-sided. I scan one side, then turn the stack over and scan the other side, and I end up with two PDFs for a single stack of pages—one with the front pages and the other with back pages in reverse order. The difficulty is to collate the pages of those two files so that the front and back sides appear in a single PDF in the correct order. Sounds like a job for a shell script!

The script takes two CLI arguments. The first argument is the PDF containing front pages, and the second is the PDF of the back pages.

The first job is take the backsides and reverse the page order, because they were scanned in last-page-to-first. This is very easy with pdftk:

pdftk back.pdf cat end-1 output backfix.pdf

Now that the pages are all in the correct order it's time to collate them. We're going to use the burst function of the PDF toolkit to explode each of the two PDFs into separate pages. After that, we recombine the separate pages in the correct order. The trick is finding a way to do this efficiently. In concept, it's not hard to collate pages in whatever order you want after they've been burst. You simply keep giving pdftk CLI arguments for all of the files you want to combine and then output them as a single file. However, if you have 40 or 50 pages, it's extremely tedious to provide that many CLI args one at a time. This must be automated!

The way I figured out how to do this was to ensure that the burst command would output files that would appear in the correct order automatically when using the ls command inside the working directory. The burst command automatically numbers the output files, but you can specify certain filename formatting parameters if you want to. I chose a format that would begin the filename with the numerical page count in at least three digits with leading zeros (001, 002, etc), followed by an underscore and either the word "front" for the front pages or "reverse" for the back pages.

So here are the burst commands:

pdftk front.pdf burst output %03d_front.pdf
pdftk backfix.pdf burst output %03d_reverse.pdf

At this point a bunch of new files appear, looking something like this:

001_front.pdf
001_reverse.pdf
002_front.pdf
002_reverse.pdf
003_front.pdf
003_reverse.pdf
...

Notice how the front and back pages all appear in the correct order? Now, instead of typing in the filename for every page, we can use the output of the ls command, filtering out any files not beginning with numbers.

pdftk $(ls |grep ^[0-9]) cat output collated.pdf

And it's done. The entire script loks like this:

#!/bin/bash

# Requires: pdftk

front=$(readlink -f "$1")
back=$(readlink -f "$2")
basedir=$(dirname $front) 
stem=$(basename $back .pdf)
backfix="$stem"-fixed.pdf
new=$(basename $front .pdf | sed -e 's/[Ff]ront/Combined/')

cd $basedir
pdftk $back cat end-1 output $backfix &> /dev/null
pdftk $front burst output %03d_front.pdf &> /dev/null
pdftk $backfix burst output %03d_reverse.pdf &> /dev/null
pdftk $(ls |grep ^[0-9]) cat output "$new".pdf

Links

pdftk man page: https://www.pdflabs.com/docs/pdftk-man-page/
My earlier HPR episode about pdftk: hpr1760

Comments

Comment #1 posted on 2016-01-10 14:05:52 by Ken Fallon

Thanks

The exact right show at the exact right time.

I am looking at doing something similar for jpg scanning.

Now if Dave would only get off his donkey and send us in the show on how to scan via cups, we'd be finished.

Ken.

Comment #2 posted on 2016-01-10 16:53:44 by Dave Morriss

Very nice

Great show Jon. I like how pdftk can do so many things to PDFs.

You'd expect me to comment on the script, so I tend to use 'find' instead of 'ls' in such cases. It has some powerful regular expression capabilities and is less error-prone than 'ls' in my experience. It's more complex to get right though.

Ken: I don't think CUPS can scan. In my show 1864 I described how I'd set up SANE to do scanning on my Raspberry Pi connected to my HP Inkjet/Scanner. That was for one-at-a-time scanning though, not bulk stuff.

Interesting donkey-related fact: Thursday January 14th is The Festival of the Ass. I have it in my calendar

Comment #3 posted on 2016-01-11 12:08:04 by el Mussol

where is Dave

As possibly the only HPR listener with donkeys, I would like to clarify that if Mr Morriss is sat on a donkey somewhere, it's not on one of ours.

pault

Comment #4 posted on 2016-01-14 13:54:10 by Dave Morriss

Donkeys

It's my impression that Ken's donkey reference was euphemistic. So no actual donkeys were harmed ...

Leave Comment

Note to Verbose Commenters
If you can't fit everything you want to say in the comment below then you really should record a response show instead.

Note to Spammers
All comments are moderated. All links are checked by humans. We strip out all html. Feel free to record a show about yourself, or your industry, or any other topic we may find interesting. We also check shows for spam :).

Your Name/Handle:
Title:
Comment:
Anti Spam Question:	What does the letter P in HPR stand for?
Are you a spammer?	Yes No
Who is the host of this show?
What does HPR mean to you?

HPR

Hacker Public Radio

https://HackerPublicRadio.org

The Community Podcast

Sharing your ideas, projects, opinions since 2005

New episodes every weekday