Extracting page sizes from PDF in Python

2023-03-11 09:19 问答作者：

I want to read a PDF and get a list of its pages and the size of each page. I don't need to manipulate it in any way, just read it.

Currently trying out pyPdf and it does everything I need except a way to get page sizes. Understanding that I will probably have to iterate through, as page sizes can vary in a pdf document. Is there another libray/method I can us开发者_如何学编程e?

I tried using PIL, some online recipes even have d=Image(imagefilename) usage, but it NEVER reads any of my PDFs - it reads everything else I throw at it - even some things I didn't know PIL could do.

Any guidance appreciated - I'm on windows 7 64, python25 (because I also do GAE stuff), but I'm happy to do it in Linux or more modern pythiis.

This can be done with PyPDF2:

>>> from PyPDF2 import PdfReader
>>> reader = PdfReader('example.pdf')
>>> box = reader.pages[0].mediabox
>>> box
RectangleObject([0, 0, 612, 792])
>>> box.width
Decimal('612')
>>> box.height
Decimal('792')

(Formerly known as pyPdf.)

With pdfrw:

>>> from pdfrw import PdfReader
>>> pdf = PdfReader('example.pdf')
>>> pdf.pages[0].MediaBox
['0', '0', '595.2756', '841.8898']

Lengths are given in points (1 pt = 1/72 inch). The format is [x0, y0, x1, y1] (thanks, mara004!).

Update in 2021-07-22: original answer was not always correct, so I update my answer.

With PyMuPDF:

>>> import fitz
>>> doc = fitz.open("example.pdf")
>>> page = doc[0]
>>> print(page.rect.width, page.rect.height)
842.0 595.0
>>> print(page.mediabox.width, page.mediabox.height)
595.0 842.0

Return values of mediabox and rect are of type Rect, which has attributes "width" and "height". One difference between mediabox and rect is that mediabox is the same as /MediaBox in document and does not change if page is rotated. However, rect is affected by rotation. For more information about different boxes in PyMuPDF, you can read glossary.

for pdfminer python 3.x (pdfminer.six) (did not try on python 2.7):

parser = PDFParser(open(pdfPath, 'rb'))
doc = PDFDocument(parser)
pageSizesList = []
for page in PDFPage.create_pages(doc):
    print(page.mediabox) # <- the media box that is the page size as list of 4 integers x0 y0 x1 y1
    pageSizesList.append(page.mediabox) # <- appending sizes to this list. eventually the pageSizesList will contain list of list corresponding to sizes of each page

With pikepdf:

import pikepdf

# open the file and select the first page
pdf = pikepdf.Pdf.open("/path/to/file.pdf")
page = pdf.pages[0]

if '/CropBox' in page:
    # use CropBox if defined since that's what the PDF viewer would usually display
    relevant_box = page.CropBox
elif '/MediaBox' in page:
    relevant_box = page.MediaBox
else:
    # fall back to ANSI A (US Letter) if neither CropBox nor MediaBox are defined
    # unlikely, but possible
    relevant_box = [0, 0, 612, 792]

# actually there could also be a viewer preference ViewArea or ViewClip in
# pdf.Root.ViewerPreferences defining which box to use, but most PDF readers 
# disregard this option anyway

# check whether the page defines a UserUnit
userunit = 1
if '/UserUnit' in page:
    userunit = float(page.UserUnit)

# convert the box coordinates to float and multiply with the UserUnit
relevant_box = [float(x)*userunit for x in relevant_box]

# obtain the dimensions of the box
width  = abs(relevant_box[2] - relevant_box[0])
height = abs(relevant_box[3] - relevant_box[1])

rotation = 0
if '/Rotate' in page:
    rotation = page.Rotate

# if the page is rotated clockwise or counter-clockwise, swap width and height
# (pdf rotation modifies the coordinate system, so the box always refers to 
# the non-rotated page)
if (rotation // 90) % 2 != 0:
    width, height = height, width

# now you have width and height in points
# 1 point is equivalent to 1/72in (1in -> 2.54cm)

Another way is to use popplerqt4

doc = popplerqt4.Poppler.Document.load('/path/to/my.pdf')
qsizedoc = doc.page(0).pageSize()
h = qsizedoc.height() # given in pt,  1pt = 1/72 in
w = qsizedoc.width()

Right code for Python 3.9 and library PyPDF2:

from PyPDF2 import PdfFileReader

reader = PdfFileReader('C:\\MyFolder\\111.pdf')
box = reader.pages[0].mediaBox
print(box.getWidth())
print(box.getHeight())

For all pages:

from PyPDF2 import PdfFileReader

reader = PdfFileReader('C:\\MyFolder\\111.pdf')

i = 0
for p in reader.pages:
    box = p.mediaBox
    print(f"i:{i}   page:{i+1}   Width = {box.getWidth()}   Height = {box.getHeight()}")
    i=i+1
    
input("Press Enter to continue...")

继续阅读：pdf python

Extracting page sizes from PDF in Python

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？