Why my code not correctly split every page in a scanned pdf?

2023-03-16 11:40 问答作者：

Update: Thanks to stardt whose script works! The pdf is a page of another one. I tried the script on the other one, and it also correctly spit each pdf page, but the order of page numbers is sometimes right and sometimes wrong. For example, in page 25-28 of the pdf file, the printed page numbers are 14, 15, 17, are 16. I was wondering why? The entire pdf can be downloaded f开发者_JAVA百科rom http://download304.mediafire.com/u6ewhjt77lzg/bgf8uzvxatckycn/3.pdf

Original: I have a scanned pdf, where two paper pages sit side by side in a pdf page. I would like to split the pdf page into two, with the original left half becoming the earlier of the two new pdf pages. The pdf looks like

Here is my Python script named un2up inspired by Gilles:

#!/usr/bin/env python
import copy, sys
from pyPdf import PdfFileWriter, PdfFileReader
input = PdfFileReader(sys.stdin)
output = PdfFileWriter()
for p in [input.getPage(i) for i in range(0,input.getNumPages())]:
    q = copy.copy(p)
    (w, h) = p.mediaBox.upperRight

    p.mediaBox.upperLeft = (0, h/2)
    p.mediaBox.upperRight = (w, h/2)
    p.mediaBox.lowerRight = (w, 0)
    p.mediaBox.lowerLeft = (0, 0)

    q.mediaBox.upperLeft = (0, h)
    q.mediaBox.upperRight = (w, h)
    q.mediaBox.lowerRight = (w, h/2)
    q.mediaBox.lowerLeft = (0, h/2)

    output.addPage(q)
    output.addPage(p)
output.write(sys.stdout)

I tried the script on a pdf in terminal with command being un2up < page.pdf > out.pdf, but the output out.pdf is not correctly split.

I also checked the values of variables w and h, the output of p.mediaBox.upperRight, and they are 514 and 1224 which don't look right based on their actual ratio.

The file can be downloaded from http://download851.mediafire.com/bdr4sv7v5nzg/raci13ct5w4c86j/page.pdf.

Your code assumes that p.mediaBox.lowerLeft is (0,0) but it is actually (0, 497)

This works for the file you provided:

#!/usr/bin/env python
import copy, sys
from pyPdf import PdfFileWriter, PdfFileReader
input = PdfFileReader(sys.stdin)
output = PdfFileWriter()
for i in range(input.getNumPages()):
    p = input.getPage(i)
    q = copy.copy(p)

    bl = p.mediaBox.lowerLeft
    ur = p.mediaBox.upperRight

    print >> sys.stderr, 'splitting page',i
    print >> sys.stderr, '\tlowerLeft:',p.mediaBox.lowerLeft
    print >> sys.stderr, '\tupperRight:',p.mediaBox.upperRight

    p.mediaBox.upperRight = (ur[0], (bl[1]+ur[1])/2)
    p.mediaBox.lowerLeft = bl

    q.mediaBox.upperRight = ur
    q.mediaBox.lowerLeft = (bl[0], (bl[1]+ur[1])/2)
    if i%2==0:
        output.addPage(q)
        output.addPage(p)
    else:
        output.addPage(p)
        output.addPage(q)

output.write(sys.stdout)

@stardt's code was quite useful, but I had problems to split a batch of pdf files with different orientations. Here's a more general function that will work no matter what the page orientation is:

import copy
import math
import pyPdf

def split_pages(src, dst):
    src_f = file(src, 'r+b')
    dst_f = file(dst, 'w+b')

    input = pyPdf.PdfFileReader(src_f)
    output = pyPdf.PdfFileWriter()

    for i in range(input.getNumPages()):
        p = input.getPage(i)
        q = copy.copy(p)
        q.mediaBox = copy.copy(p.mediaBox)

        x1, x2 = p.mediaBox.lowerLeft
        x3, x4 = p.mediaBox.upperRight

        x1, x2 = math.floor(x1), math.floor(x2)
        x3, x4 = math.floor(x3), math.floor(x4)
        x5, x6 = math.floor(x3/2), math.floor(x4/2)

        if x3 > x4:
            # horizontal
            p.mediaBox.upperRight = (x5, x4)
            p.mediaBox.lowerLeft = (x1, x2)

            q.mediaBox.upperRight = (x3, x4)
            q.mediaBox.lowerLeft = (x5, x2)
        else:
            # vertical
            p.mediaBox.upperRight = (x3, x4)
            p.mediaBox.lowerLeft = (x1, x6)

            q.mediaBox.upperRight = (x3, x6)
            q.mediaBox.lowerLeft = (x1, x2)

        output.addPage(p)
        output.addPage(q)

    output.write(dst_f)
    src_f.close()
    dst_f.close()

I'd like to add that you have to pay attention that your mediaBox variables are not shared across the copies p and q. This can easily happen if you read from p.mediaBox before taking the copy.

In that case, writing to e.g. p.mediaBox.upperRight may modify q.mediaBox and vice versa.

@moraes' solution takes care of this by explicitly copying the mediaBox.

继续阅读：pdf pypdf python

Why my code not correctly split every page in a scanned pdf?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？