Extracting "((Adj|Noun)+|((Adj|Noun)(Noun-Prep)?)(Adj|Noun))Noun" from Text (Justeson & Katz, 1995)
Is it possible to extract ((Adj|Noun)+|((Adj|Noun)(Noun-Prep)?)(Adj|Noun))Noun
proposed by Justeson and Katz (1995) using the R package openNLP
?
That is, I would like to use this linguistic filtering to extract candidate noun phrases.
I cannot understand its meaning well.
Could you do me a favor to explain it? Or show how to code the filtering rule in the R language?
Many thanks.
Maybe we can start the sample code from:
library("openNLP")
acq <- "This paper describes 开发者_Python百科a novel optical thread plug
gauge (OTPG) for internal thread inspection using machine
vision. The OTPG is composed of a rigid industrial
endoscope, a charge-coupled device camera, and a two
degree-of-freedom motion control unit. A sequence of
partial wall images of an internal thread are retrieved and
reconstructed into a 2D unwrapped image. Then, a digital
image processing and classification procedure is used to
normalize, segment, and determine the quality of the
internal thread."
acqTag <- tagPOS(acq)
acqTagSplit = strsplit(acqTag," ")
I was told to open a new question for this. The original question is here.
Installing the package by:
install.packages("openNLP")
install.packages("openNLPmodels.en")
After, you could run the above code. It will POS tag all words in the text and give back the original text with all words tagged like noun, verb etc. I this example as follows:
acqTagSplit = strsplit(acqTag," ")
> acqTag
[1] "This/DT paper/NN describes/VBZ a/DT novel/NN optical/JJ thread/NN plug/NN gauge/NN (OTPG)/NN for/IN internal/JJ thread/NN inspection/NN using/VBG machine/NN vision./NN The/DT OTPG/NNP is/VBZ composed/VBN of/IN a/DT rigid/JJ industrial/JJ endoscope,/NNS a/DT charge-coupled/JJ device/NN camera,/VBD and/CC a/DT two/CD degree-of-freedom/NN motion/NN control/NN unit./NN A/DT sequence/NN of/IN partial/JJ wall/NN images/NNS of/IN an/DT internal/JJ thread/NN are/VBP retrieved/VBN and/CC reconstructed/VBN into/IN a/DT 2D/JJ unwrapped/JJ image./NN Then,/IN a/DT digital/JJ image/NN processing/NN and/CC classification/NN procedure/NN is/VBZ used/VBN to/TO normalize,/JJ segment,/NN and/CC determine/VB the/DT quality/NN of/IN the/DT internal/JJ thread./NN"
After all word, separated by a dash, you have all the POS tags. To separate theese from the word, you could first separate the words - as you did in your example:
acqTagSplit = strsplit(acqTag," ")
acqTagSplit
[[1]]
[1] "This/DT" "paper/NN" "describes/VBZ"
[4] "a/DT" "novel/NN" "optical/JJ"
[7] "thread/NN" "plug/NN" "gauge/NN"
[10] "(OTPG)/NN" "for/IN" "internal/JJ"
[13] "thread/NN" "inspection/NN" "using/VBG"
[16] "machine/NN" "vision./NN" "The/DT"
[19] "OTPG/NNP" "is/VBZ" "composed/VBN"
[22] "of/IN" "a/DT" "rigid/JJ"
[25] "industrial/JJ" "endoscope,/NNS" "a/DT"
[28] "charge-coupled/JJ" "device/NN" "camera,/VBD"
[31] "and/CC" "a/DT" "two/CD"
[34] "degree-of-freedom/NN" "motion/NN" "control/NN"
[37] "unit./NN" "A/DT" "sequence/NN"
[40] "of/IN" "partial/JJ" "wall/NN"
[43] "images/NNS" "of/IN" "an/DT"
[46] "internal/JJ" "thread/NN" "are/VBP"
[49] "retrieved/VBN" "and/CC" "reconstructed/VBN"
[52] "into/IN" "a/DT" "2D/JJ"
[55] "unwrapped/JJ" "image./NN" "Then,/IN"
[58] "a/DT" "digital/JJ" "image/NN"
[61] "processing/NN" "and/CC" "classification/NN"
[64] "procedure/NN" "is/VBZ" "used/VBN"
[67] "to/TO" "normalize,/JJ" "segment,/NN"
[70] "and/CC" "determine/VB" "the/DT"
[73] "quality/NN" "of/IN" "the/DT"
[76] "internal/JJ" "thread./NN"
And later split up the words from the POS tags:
strsplit(acqTagSplit[[1]], "/")
You will have a list, which contains all of your words with the tags, and inside first have the word and after the tag separated. See:
str(strsplit(acqTagSplit[[1]], "/"))
List of 77
$ : chr [1:2] "This" "DT"
$ : chr [1:2] "paper" "NN"
$ : chr [1:2] "describes" "VBZ"
$ : chr [1:2] "a" "DT"
$ : chr [1:2] "novel" "NN"
$ : chr [1:2] "optical" "JJ"
$ : chr [1:2] "thread" "NN"
$ : chr [1:2] "plug" "NN"
$ : chr [1:2] "gauge" "NN"
$ : chr [1:2] "(OTPG)" "NN"
$ : chr [1:2] "for" "IN"
$ : chr [1:2] "internal" "JJ"
$ : chr [1:2] "thread" "NN"
$ : chr [1:2] "inspection" "NN"
$ : chr [1:2] "using" "VBG"
$ : chr [1:2] "machine" "NN"
$ : chr [1:2] "vision." "NN"
$ : chr [1:2] "The" "DT"
$ : chr [1:2] "OTPG" "NNP"
$ : chr [1:2] "is" "VBZ"
$ : chr [1:2] "composed" "VBN"
$ : chr [1:2] "of" "IN"
$ : chr [1:2] "a" "DT"
$ : chr [1:2] "rigid" "JJ"
$ : chr [1:2] "industrial" "JJ"
$ : chr [1:2] "endoscope," "NNS"
$ : chr [1:2] "a" "DT"
$ : chr [1:2] "charge-coupled" "JJ"
$ : chr [1:2] "device" "NN"
$ : chr [1:2] "camera," "VBD"
$ : chr [1:2] "and" "CC"
$ : chr [1:2] "a" "DT"
$ : chr [1:2] "two" "CD"
$ : chr [1:2] "degree-of-freedom" "NN"
$ : chr [1:2] "motion" "NN"
$ : chr [1:2] "control" "NN"
$ : chr [1:2] "unit." "NN"
$ : chr [1:2] "A" "DT"
$ : chr [1:2] "sequence" "NN"
$ : chr [1:2] "of" "IN"
$ : chr [1:2] "partial" "JJ"
$ : chr [1:2] "wall" "NN"
$ : chr [1:2] "images" "NNS"
$ : chr [1:2] "of" "IN"
$ : chr [1:2] "an" "DT"
$ : chr [1:2] "internal" "JJ"
$ : chr [1:2] "thread" "NN"
$ : chr [1:2] "are" "VBP"
$ : chr [1:2] "retrieved" "VBN"
$ : chr [1:2] "and" "CC"
$ : chr [1:2] "reconstructed" "VBN"
$ : chr [1:2] "into" "IN"
$ : chr [1:2] "a" "DT"
$ : chr [1:2] "2D" "JJ"
$ : chr [1:2] "unwrapped" "JJ"
$ : chr [1:2] "image." "NN"
$ : chr [1:2] "Then," "IN"
$ : chr [1:2] "a" "DT"
$ : chr [1:2] "digital" "JJ"
$ : chr [1:2] "image" "NN"
$ : chr [1:2] "processing" "NN"
$ : chr [1:2] "and" "CC"
$ : chr [1:2] "classification" "NN"
$ : chr [1:2] "procedure" "NN"
$ : chr [1:2] "is" "VBZ"
$ : chr [1:2] "used" "VBN"
$ : chr [1:2] "to" "TO"
$ : chr [1:2] "normalize," "JJ"
$ : chr [1:2] "segment," "NN"
$ : chr [1:2] "and" "CC"
$ : chr [1:2] "determine" "VB"
$ : chr [1:2] "the" "DT"
$ : chr [1:2] "quality" "NN"
$ : chr [1:2] "of" "IN"
$ : chr [1:2] "the" "DT"
$ : chr [1:2] "internal" "JJ"
$ : chr [1:2] "thread." "NN"
It seems like you need to understand the regular expression: ((Adj|Noun)+|((Adj|Noun)(Noun-Prep)?)(Adj|Noun))Noun, convert it to a DFA (deterministic finite automata) and follow the DFA in R.
Here you have a description of a regular language through a regular expression. Unlike the common usage of regular expressions in text processing the "symbols" are not simple characters, but adjectives, nouns and noun prepositions. Once you understand the theory (automata theory), you will be able to easily implement the DFA in R (or whatever PL you choose).
The problem in not R, the problem is that you don't understand the theory.
精彩评论