Perl: How can I split these texts to extract the required info? [closed]
EDITED/Shortened VERSION
I have two texts, which come from two files that I have to loop through (you can ignore my variables). Here is a sample from each:
Tagged:
5.4_CD Passive_NNP Processes_NNP of_IN Membrane_NNP Transport_NNP 85_CD We_PRP have_VBP examined_VBN membrane_NN structure_NN and_CC how_WRB it_PRP is_VBZ used_VBN to_TO perform_VB one_CD membrane_NN function_NN :_: the_DT binding_JJ of_IN one_CD cell_NN to_TO another_DT ._.
Desired output:
5.4 Passive Processes of Membrane Transport 85 We have examined membrane stru....
Parsed:
Parsing [sent. 1 len. 31]:
nsubj(85-7, Processes-3)
nn(Transport-6, Membrane-5)
prep_of(Processes-3, Transport-6)
nsubj(examined-10, We-8)
nsubjpass(used-17, it-15)
xsubj(perform-19, it-15)
conj_and(examined-10, used-17)
xcomp(used-17, perform-19)
dobj(perform-19, function-22)
prep_of(binding-25, cell-28) <- refer to开发者_如何学JAVA this for examples below
Desired output:
- the sent. number (ie.
sent. 1
) - the grammar function (ie.
prep_of
) - the first dependency word (ie.
binding
) - the second dependency word (ie.
cell
)
QUESTION
How can I split/substitute these to get my desired output, so that they keep a word boundary on the end and beginning (=~ \bword\b
should apply)??
THANKS a lot for taking your time to read this! Any advice is appreciated!
Well, I have difficulty understanding even your revised question. Since I have skipped your historical questions due to not understanding what you wanted, I thought I would share a better explanation. You would be well advised to skip the background material and just break down the problem into:
@subsentences = ("5.4_CD Passive_NNP Processes_NNP","85_CD We_PRP have_VBP examined_VBN membrane_NN");
foreach my $sub (@subsentences) {
@final = split(/_\S+/,$sub);
print join(",",@final)."\n";
}
Expected output: ("5.4", "Passive", "Process") and ("85", "We", "have", "examined").
The sad thing is, I cannot even tell if my guess about what you might mean in this ONE example is correct (might you have meant @subsentence = qw(5.4_CD Passive_NNP Processes_NNP)
instead? or something else?). Repeat for each example. Assuming I guessed correctly, the regex you want in this example is:
@finalsentence = split(/_\S+(?:\s+|$)/,$subsentences[$j])
Or the equally valid(?)
@finalsentence = grep(s/_\S+//||1,split(/\s+/,$subsentences[$j]));
I think we have discovered that the actual question he wanted asked was:
@subs = qw(5.4_CD Passive_NNP Processes_NNP);
Expected output: qw(5.4 Passive Processes)
If my revised understand is correct, the following will do what you want
@subs = qw(5.4_CD Passive_NNP Processes_NNP);
@final = @subs;
grep(s/_\S+//,@final);
print join(",",@final)."\n";
精彩评论