That was (semi-)easy: Creating a book index from PDF page proofs

Reading (printed) books, it is so common to every once in a while look something up in the index, I never thought twice how much work actually goes into a good index. Well – now I know! I just put together an index for mine (I’ll post more on that book soon – be patient).

Before I started, I enjoyed reading comments that clearly said that they’d never do that again – for example here. They also mentioned that creating an index can be a very lengthy process. As it turns out, mine took me roughly three days and as long as one has a good set of PDF page proofs, it’s actually not that hard. It just meant I had to plow (yet again) through the entire text. For the curious, here is my process:

To be able to use this method you must have:

  • A PDF file of the book with accurate page numbers. Page proofs are typically a good starting point.
  • The full version of Adobe Acrobat or the freely available PDF X-change viewer (see link below). Any PDF viewer will do as long as it has the required capabilities (described next).

This method requires that you read through the entire text and highlight (in the PDF) every single word or term that is supposed to go into the index – wherever it appears. However, before you get going on highlighting anything, in Acrobat (or your PDF reader of preference), go to Edit > Preferences > Commenting and select “Copy selected text into highlight…”. This is very important because we will extract the text contained in the markup using the method below. It will also serve as the sortable base for our index.

Now go through your PDF book and use the text highlighter to accurately highlight any term (a word, a sentence,…) that you want to have extracted for the index (use one highlight per term). You can even add notes to self in the highlights – for example, if you want to add something for an entire section but don’t want to highlight the entire section. Once you are done, save everything.

Before you export the terms, it is a good idea to go through them in Acrobat’s comment review pane and edit them to conform to the index standards. You want the comment contents in a form where you can simply sort them alphabetically later, do some minor editing and formatting and then be done with it.

We will next use the JavaScript engine in the PDF viewer to extract our highlights, complete with page numbers as a tab-separated file, which in turn can be opened in Excel. Hit Ctrl-J to open the JavaScript console and paste the following script into it:

If you do this in Adobe Acrobat, highlight the entire code and hit Ctrl-Enter to execute it. In other editors, you might be able to just push a Run button.

The last line of this code will open Excel, which then displays all of your highlights, neatly arranged with terms in column 1 and page numbers in column 2. Now you can sort them alphabetically, do a final editing and then load them into your word processing application (save them as a comma-separated CSV file, for example).

UPDATE: Starting with Acrobat X, Actions can be used to do this instead of raw JavaScript code. Try the following instead: