开发者

PDF text search and split library

I am look for a server side PDF library (or command line tool) which can:

  • split a multi-page PDF file into individual PDF files, based on
  • a search result of the PDF file content
开发者_开发问答

Examples:

  • Search "Page ???" pattern in text and split the big PDF into 001.pdf, 002,pdf, ... ???.pdf

A server program will scan the PDF, look for the search pattern, save the page(s) which match the patten, and save the file in the disk.

It will be nice with integration with PHP / Ruby. Command line tool is also acceptable. It will be a server side (linux or win32) batch processing tool. GUI/login is not supported. i18n support will be nice but no required. Thanks~


My company, Atalasoft, has just released some PDF manipulation tools that run on .NET. There is a text extract class that you can use to find the text and determine how you will split your document and a very high level document class that makes the splitting trivial. Suppose you have a Stream to your source PDF and an increasingly ordered List that describes the starting page of each split, then the code to generate your split files looks like this:

public void SplitPdf(Stream stm, List<int> pageStarts, string outputDirectory)
{
    PdfDocument mainDoc = new PdfDocument(stm);
    int lastPage = mainDoc.Pages.Count - 1;

    for (int i=0; i < pageStarts.Count; i++) {
        int startPage = pageStarts[i];
        int endPage= (i < pageStarts.Count - 1) ?
            pageStarts[i + 1] - 1 :
            lastPage;
        if (startPage > endPage) throw new ArgumentException("list is not ordered properly", "pageStarts");
        PdfDocument splitDoc = new PdfDocument();
        for (j = startPage; j <= endPage; j++)
            splitDoc.Pages.Add(mainDoc.Pages[j];

        string outputPath = Path.Combine(outputDirectory, 
                                         string.Format("{0:D3}.pdf", i + 1));
        splitDoc.Save(outputPath);
    }

if you generalize this into a page range struct:

public struct PageRange {
    public int StartPage;
    public int EndPage;
}

where StartPage and EndPage inclusively describe a range of pages, then the code is simpler:

public void SplitPdf(Stream stm, List<PageRange> ranges, string outputDirectory)
{
    PdfDocument mainDoc = new PdfDocument(stm);

    int outputDocCount = 1;
    foreach (PageRange range in ranges) {
        int startPage = Math.Min(range.StartPage, range.EndPage); // assume not in order
        int endPage = Math.Max(range.StartPage, range.EndPage);
        PdfDocument splitDoc = new PdfDocument();
        for (int i=startPage; i <= endPage; i++)
            splitDoc.Pages.Add(mainDoc.Pages[i]);
        string outputPath = Path.Combine(outputDirectory, 
                                         string.Format("{0:D3}.pdf", outputDocCount));
        splitDoc.Save(outputPath);
        outputDocCount++;
    }
}


PDFBox is a Java library but it does have some command line tools as well:

http://pdfbox.apache.org/

PDFBox can extract text and also rebuilt/split PDFS


pdfminer + multi-line pattern matching in python


You can use pdfsam to split your file in pages, then use pdftotext (from foolabs.com) to turn this into text and use ruby (or grep) to find the strings. Then you have the page ranges and can return the previous generated pages.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜