To help with parsing pdf documents I found the pure python PyPDF2 package which provides a simple interface for manipulating pdf documents. Using this package it was relatively simple to put together the code to rotate each page and crop to the area of interest. The following python function does this by creating two reader objects and using one for the top of the pages and the other for the bottom.
import PyPDF2 as pdf
def convert_to_landscape(fn, out_fn, verbose=False):
reader = pdf.PdfFileReader(fn)
reader2 = pdf.PdfFileReader(fn)
writer = pdf.PdfFileWriter()
number_pages = reader.getNumPages()
for pn in range(number_pages):
if verbose: print(f'Onto page {pn}')
page = reader.getPage(pn)
page2 = reader2.getPage(pn)
b = page.cropBox
ar = b[3]/b[2]
crop_height = int(b[2]/ar)
page.rotateCounterClockwise(90)
page.cropBox = pdf.generic.RectangleObject([0, b[3]-crop_height, 612, b[3]])
writer.addPage(page)
page2.rotateCounterClockwise(90)
page2.cropBox = pdf.generic.RectangleObject([0, 0, 612, crop_height])
writer.addPage(page2)
if verbose: print("Done processing pages and now writing them to the file")
with open(out_fn, "wb") as io:
writer.write(io)
For convenience I created a simple python package called landscape_pdf and registered it with the PyPI registry so it can be installed directly using pip install landscape_pdf
. I also submitted a pull request to the zotrm package so that it can be used to transform pdf documents automatically when synchronising these from zotero.