regex issue while parsing .pdf file using CAM::PDF

2023-03-06 18:25 问答作者：

Unmatched [ in regex; marked by <-- HERE in m/ <-- HERE / at ./pdf_parse.pl line 37.

Actually I'm parsing .pdf file word by word [in order to make a dictionary out of it] line 37:-

if(grep(!/$w开发者_如何学Cord/,@line_rd)){
}

Well actual word where parser script stops working is in different font [in side the pdf which I'm parsing], is that the culprit here ?

Whether CAM::PDF recognizes words in different fonts ? What care should i do, in order to stop this !

You need to quote $word in the regular expression if it can contain special chars (like [ or even .). Try with:

if (grep(!/\Q$word\E/, @line_rd)) {
  ...
}

If you want to make a dictionary of all the words, use a hash:

my %allwords;
...
  # each time you have a new word incoming from the parser:
  $allwords{$word}++;

At the end, the %allwords hash will contain the distinct words as keys, and the word count as values. You could e.g. print it using:

map {
 print "Word $_: count: ", $allwords{$_}, "\n";
} (sort keys %allwords);

继续阅读：pdf perl

精彩评论