开发者

reading and counting words from pdf document

i have been working on this text extraction project of various file extensions, but i am having the most pain with pdf and powerpoint,here is the code for pdf any one here know how to read text from existing pdf documents using any tool or library tcpdf , xpdf or fpdfi because i havent seen any exact solution for reading text from pdf or ppt,but please no zend solutions

function pdf2txt($filename){

    $data = getFileData($filename);

    // grab objects and then grab their contents (chunks)
    $a_obj = getDataArray($data,"obj","endobj");
    foreach($a_obj as $obj){

        $a_filter = getDataArray($obj,"<<",">>");
        if (is_array($a_filter)){
            $j++;
            $a_chunks[$j]["filter"] = $a_filter[0];

            $a_data = getDataArray($obj,"stream\r\n","endstream");
            if (is_array($a_data)){
                $a_chunks[$j]["data"] = substr($a_data[0],strlen("stream\r\n"),strlen($a_data[0])-strlen("stream\r\n")-strlen("endstream"));
            }
        }
    }

    // decode the chunks
    foreach($a_chunks as $chunk){

        // look at each chunk and decide how to decode it - by looking at the contents of the filter
        $a_filter = split("/",$chunk["filter"]);

        if ($chunk["data"]!=""){
            // look at the filter to find out which encoding has been used          
            if (substr($chunk["filter"],"FlateDecode")!==false){
                $data =@ gzuncompress($chunk["data"]);
                if (trim($data)!=""){
                    $result_data .= ps2txt($data);
                } else {

                    //$result_data .= "x";
                }
            }
        }
    }

    return $result_data;

}


// Function    : ps2txt()
// Arguments   : $ps_data - postscript data you want to convert to plain text
// Description : Does a very basic parse of postscript data to
//             :  return the plain text
// Author      : Jonathan Beckett, 2005-05-02
function ps2txt($ps_data){
    $result = "";
    $a_data = get开发者_如何学运维DataArray($ps_data,"[","]");
    if (is_array($a_data)){
        foreach ($a_data as $ps_text){
            $a_text = getDataArray($ps_text,"(",")");
            if (is_array($a_text)){
                foreach ($a_text as $text){
                    $result .= substr($text,1,strlen($text)-2);
                }
            }
        }
    } else {
        // the data may just be in raw format (outside of [] tags)
        $a_text = getDataArray($ps_data,"(",")");
        if (is_array($a_text)){
            foreach ($a_text as $text){
                $result .= substr($text,1,strlen($text)-2);
            }
        }
    }
    return $result;
}


// Function    : getFileData()
// Arguments   : $filename - filename you want to load
// Description : Reads data from a file into a variable
//               and passes that data back
// Author      : Jonathan Beckett, 2005-05-02
function getFileData($filename){
    $handle = fopen($filename,"rb");
    $data = fread($handle, filesize($filename));
    fclose($handle);
    return $data;
}


// Function    : getDataArray()
// Arguments   : $data       - data you want to chop up
//               $start_word - delimiting characters at start of each chunk
//               $end_word   - delimiting characters at end of each chunk
// Description : Loop through an array of data and put all chunks
//               between start_word and end_word in an array
// Author      : Jonathan Beckett, 2005-05-02
function getDataArray($data,$start_word,$end_word){

    $start = 0;
    $end = 0;
    unset($a_result);

    while ($start!==false && $end!==false){
        $start = strpos($data,$start_word,$end);
        if ($start!==false){
            $end = strpos($data,$end_word,$start);
            if ($end!==false){
                // data is between start and end
                $a_result[] = substr($data,$start,$end-$start+strlen($end_word));
            }
        }
    }
    return $a_result;
}
this one is for powerpoint i found here some where but that isnt working also
function parsePPT($filename) {
// This approach uses detection of the string "chr(0f).Hex_value.chr(0x00).chr(0x00).chr(0x00)" to find text strings, which are then terminated by another NUL chr(0x00). [1] Get text between delimiters [2] 
    $fileHandle = fopen($filename, "r");
    $line = @fread($fileHandle, filesize($filename));
    $lines = explode(chr(0x0f),$line);
    $outtext = '';

    foreach($lines as $thisline) {
        if (strpos($thisline, chr(0x00).chr(0x00).chr(0x00)) == 1) {
            $text_line = substr($thisline, 4);
            $end_pos   = strpos($text_line, chr(0x00));
            $text_line = substr($text_line, 0, $end_pos);
            $text_line = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$text_line);
            if(substr($text_line,0,20)!="Click to edit Master")
            if (strlen($text_line) > 1) {
                $outtext.= substr($text_line, 0, $end_pos)."\n<br>";
            }
        }
    }
return $outtext;
}


Why are you trying to reinvent the wheel? You could either resort to using ie. xpdf or a similar tool to extract the text data inside the PDF, and afterwards process the plain text file resulting from that operation. The same approach could be used for virtually any file format that contains text (ie. first convert to a plain text version, then process that)...

Indexing PDF Documents with Zend_Search_Lucene could be an interesting read if you opt for that solution.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜