Extracting page sizes from PDF in Python
I want to read a PDF and get a list of its pages and the size of each page. I don't need to manipulate it in any way, just read it.
Currently trying out pyPdf and it does everything I need except a way to get page sizes. Understanding that I will probably have to iterate through, as page sizes can vary in a pdf document. Is there another libray/method I can us开发者_如何学编程e?
I tried using PIL, some online recipes even have d=Image(imagefilename) usage, but it NEVER reads any of my PDFs - it reads everything else I throw at it - even some things I didn't know PIL could do.
Any guidance appreciated - I'm on windows 7 64, python25 (because I also do GAE stuff), but I'm happy to do it in Linux or more modern pythiis.
This can be done with PyPDF2:
>>> from PyPDF2 import PdfReader
>>> reader = PdfReader('example.pdf')
>>> box = reader.pages[0].mediabox
>>> box
RectangleObject([0, 0, 612, 792])
>>> box.width
Decimal('612')
>>> box.height
Decimal('792')
(Formerly known as pyPdf.)
With pdfrw:
>>> from pdfrw import PdfReader
>>> pdf = PdfReader('example.pdf')
>>> pdf.pages[0].MediaBox
['0', '0', '595.2756', '841.8898']
Lengths are given in points (1 pt = 1/72 inch). The format is [x0, y0, x1, y1]
(thanks, mara004!).
Update in 2021-07-22: original answer was not always correct, so I update my answer.
With PyMuPDF:
>>> import fitz
>>> doc = fitz.open("example.pdf")
>>> page = doc[0]
>>> print(page.rect.width, page.rect.height)
842.0 595.0
>>> print(page.mediabox.width, page.mediabox.height)
595.0 842.0
Return values of mediabox and rect are of type Rect, which has attributes "width" and "height". One difference between mediabox and rect is that mediabox is the same as /MediaBox in document and does not change if page is rotated. However, rect is affected by rotation. For more information about different boxes in PyMuPDF, you can read glossary.
for pdfminer python 3.x (pdfminer.six) (did not try on python 2.7):
parser = PDFParser(open(pdfPath, 'rb'))
doc = PDFDocument(parser)
pageSizesList = []
for page in PDFPage.create_pages(doc):
print(page.mediabox) # <- the media box that is the page size as list of 4 integers x0 y0 x1 y1
pageSizesList.append(page.mediabox) # <- appending sizes to this list. eventually the pageSizesList will contain list of list corresponding to sizes of each page
With pikepdf:
import pikepdf
# open the file and select the first page
pdf = pikepdf.Pdf.open("/path/to/file.pdf")
page = pdf.pages[0]
if '/CropBox' in page:
# use CropBox if defined since that's what the PDF viewer would usually display
relevant_box = page.CropBox
elif '/MediaBox' in page:
relevant_box = page.MediaBox
else:
# fall back to ANSI A (US Letter) if neither CropBox nor MediaBox are defined
# unlikely, but possible
relevant_box = [0, 0, 612, 792]
# actually there could also be a viewer preference ViewArea or ViewClip in
# pdf.Root.ViewerPreferences defining which box to use, but most PDF readers
# disregard this option anyway
# check whether the page defines a UserUnit
userunit = 1
if '/UserUnit' in page:
userunit = float(page.UserUnit)
# convert the box coordinates to float and multiply with the UserUnit
relevant_box = [float(x)*userunit for x in relevant_box]
# obtain the dimensions of the box
width = abs(relevant_box[2] - relevant_box[0])
height = abs(relevant_box[3] - relevant_box[1])
rotation = 0
if '/Rotate' in page:
rotation = page.Rotate
# if the page is rotated clockwise or counter-clockwise, swap width and height
# (pdf rotation modifies the coordinate system, so the box always refers to
# the non-rotated page)
if (rotation // 90) % 2 != 0:
width, height = height, width
# now you have width and height in points
# 1 point is equivalent to 1/72in (1in -> 2.54cm)
Another way is to use popplerqt4
doc = popplerqt4.Poppler.Document.load('/path/to/my.pdf')
qsizedoc = doc.page(0).pageSize()
h = qsizedoc.height() # given in pt, 1pt = 1/72 in
w = qsizedoc.width()
Right code for Python 3.9 and library PyPDF2:
from PyPDF2 import PdfFileReader
reader = PdfFileReader('C:\\MyFolder\\111.pdf')
box = reader.pages[0].mediaBox
print(box.getWidth())
print(box.getHeight())
For all pages:
from PyPDF2 import PdfFileReader
reader = PdfFileReader('C:\\MyFolder\\111.pdf')
i = 0
for p in reader.pages:
box = p.mediaBox
print(f"i:{i} page:{i+1} Width = {box.getWidth()} Height = {box.getHeight()}")
i=i+1
input("Press Enter to continue...")
精彩评论