reading and counting words from pdf document

2023-02-22 08:42 问答作者：

i have been working on this text extraction project of various file extensions, but i am having the most pain with pdf and powerpoint,here is the code for pdf any one here know how to read text from existing pdf documents using any tool or library tcpdf , xpdf or fpdfi because i havent seen any exact solution for reading text from pdf or ppt,but please no zend solutions

function pdf2txt($filename){

    $data = getFileData($filename);

    // grab objects and then grab their contents (chunks)
    $a_obj = getDataArray($data,"obj","endobj");
    foreach($a_obj as $obj){

        $a_filter = getDataArray($obj,"<<",">>");
        if (is_array($a_filter)){
            $j++;
            $a_chunks[$j]["filter"] = $a_filter[0];

            $a_data = getDataArray($obj,"stream\r\n","endstream");
            if (is_array($a_data)){
                $a_chunks[$j]["data"] = substr($a_data[0],strlen("stream\r\n"),strlen($a_data[0])-strlen("stream\r\n")-strlen("endstream"));
            }
        }
    }

    // decode the chunks
    foreach($a_chunks as $chunk){

        // look at each chunk and decide how to decode it - by looking at the contents of the filter
        $a_filter = split("/",$chunk["filter"]);

        if ($chunk["data"]!=""){
            // look at the filter to find out which encoding has been used          
            if (substr($chunk["filter"],"FlateDecode")!==false){
                $data =@ gzuncompress($chunk["data"]);
                if (trim($data)!=""){
                    $result_data .= ps2txt($data);
                } else {

                    //$result_data .= "x";
                }
            }
        }
    }

    return $result_data;

}


// Function    : ps2txt()
// Arguments   : $ps_data - postscript data you want to convert to plain text
// Description : Does a very basic parse of postscript data to
//             :  return the plain text
// Author      : Jonathan Beckett, 2005-05-02
function ps2txt($ps_data){
    $result = "";
    $a_data = get开发者_如何学运维DataArray($ps_data,"[","]");
    if (is_array($a_data)){
        foreach ($a_data as $ps_text){
            $a_text = getDataArray($ps_text,"(",")");
            if (is_array($a_text)){
                foreach ($a_text as $text){
                    $result .= substr($text,1,strlen($text)-2);
                }
            }
        }
    } else {
        // the data may just be in raw format (outside of [] tags)
        $a_text = getDataArray($ps_data,"(",")");
        if (is_array($a_text)){
            foreach ($a_text as $text){
                $result .= substr($text,1,strlen($text)-2);
            }
        }
    }
    return $result;
}


// Function    : getFileData()
// Arguments   : $filename - filename you want to load
// Description : Reads data from a file into a variable
//               and passes that data back
// Author      : Jonathan Beckett, 2005-05-02
function getFileData($filename){
    $handle = fopen($filename,"rb");
    $data = fread($handle, filesize($filename));
    fclose($handle);
    return $data;
}


// Function    : getDataArray()
// Arguments   : $data       - data you want to chop up
//               $start_word - delimiting characters at start of each chunk
//               $end_word   - delimiting characters at end of each chunk
// Description : Loop through an array of data and put all chunks
//               between start_word and end_word in an array
// Author      : Jonathan Beckett, 2005-05-02
function getDataArray($data,$start_word,$end_word){

    $start = 0;
    $end = 0;
    unset($a_result);

    while ($start!==false && $end!==false){
        $start = strpos($data,$start_word,$end);
        if ($start!==false){
            $end = strpos($data,$end_word,$start);
            if ($end!==false){
                // data is between start and end
                $a_result[] = substr($data,$start,$end-$start+strlen($end_word));
            }
        }
    }
    return $a_result;
}
this one is for powerpoint i found here some where but that isnt working also
function parsePPT($filename) {
// This approach uses detection of the string "chr(0f).Hex_value.chr(0x00).chr(0x00).chr(0x00)" to find text strings, which are then terminated by another NUL chr(0x00). [1] Get text between delimiters [2] 
    $fileHandle = fopen($filename, "r");
    $line = @fread($fileHandle, filesize($filename));
    $lines = explode(chr(0x0f),$line);
    $outtext = '';

    foreach($lines as $thisline) {
        if (strpos($thisline, chr(0x00).chr(0x00).chr(0x00)) == 1) {
            $text_line = substr($thisline, 4);
            $end_pos   = strpos($text_line, chr(0x00));
            $text_line = substr($text_line, 0, $end_pos);
            $text_line = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$text_line);
            if(substr($text_line,0,20)!="Click to edit Master")
            if (strlen($text_line) > 1) {
                $outtext.= substr($text_line, 0, $end_pos)."\n<br>";
            }
        }
    }
return $outtext;
}

Why are you trying to reinvent the wheel? You could either resort to using ie. xpdf or a similar tool to extract the text data inside the PDF, and afterwards process the plain text file resulting from that operation. The same approach could be used for virtually any file format that contains text (ie. first convert to a plain text version, then process that)...

Indexing PDF Documents with Zend_Search_Lucene could be an interesting read if you opt for that solution.

继续阅读：pdf php powerpoint

reading and counting words from pdf document

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？