Stanford Core NLP - understanding coreference resolution
I'm having some trouble understanding the changes made to the coref resolver in the last version of the Stanford NLP tools. As an example, below is a sentence and the corresponding CorefChainAnnotation:
The atom is a basic unit of matter, it consists of a dense central nucleus surrounded by a cloud of negatively开发者_Python百科 charged electrons.
{1=[1 1, 1 2], 5=[1 3], 7=[1 4], 9=[1 5]}
I am not sure I understand the meaning of these numbers. Looking at the source doesn't really help either.
Thank you
I've been working with the coreference dependency graph and I started by using the other answer to this question. After a while though I realized that this algorithm above is not exactly correct. The output it produced is not even close to the modified version I have.
For anyone else who uses this article, here is the algorithm I ended up with which also filters out self references because every representativeMention also mentions itself and a lot of mentions only reference themselves.
Map<Integer, CorefChain> coref = document.get(CorefChainAnnotation.class);
for(Map.Entry<Integer, CorefChain> entry : coref.entrySet()) {
CorefChain c = entry.getValue();
//this is because it prints out a lot of self references which aren't that useful
if(c.getCorefMentions().size() <= 1)
continue;
CorefMention cm = c.getRepresentativeMention();
String clust = "";
List<CoreLabel> tks = document.get(SentencesAnnotation.class).get(cm.sentNum-1).get(TokensAnnotation.class);
for(int i = cm.startIndex-1; i < cm.endIndex-1; i++)
clust += tks.get(i).get(TextAnnotation.class) + " ";
clust = clust.trim();
System.out.println("representative mention: \"" + clust + "\" is mentioned by:");
for(CorefMention m : c.getCorefMentions()){
String clust2 = "";
tks = document.get(SentencesAnnotation.class).get(m.sentNum-1).get(TokensAnnotation.class);
for(int i = m.startIndex-1; i < m.endIndex-1; i++)
clust2 += tks.get(i).get(TextAnnotation.class) + " ";
clust2 = clust2.trim();
//don't need the self mention
if(clust.equals(clust2))
continue;
System.out.println("\t" + clust2);
}
}
And the final output for your example sentence is the following:
representative mention: "a basic unit of matter" is mentioned by:
The atom
it
Usually "the atom" ends up being the representative mention but in the case it doesn't surprisingly. Another example with a slightly more accurate output is for the following sentence:
The Revolutionary War occurred during the 1700s and it was the first war in the United States.
produces the following output:
representative mention: "The Revolutionary War" is mentioned by:
it
the first war in the United States
The first number is a cluster id (representing tokens, which stand for the same entity), see source code of SieveCoreferenceSystem#coref(Document)
. The pair numbers are outout of CorefChain#toString():
public String toString(){
return position.toString();
}
where position is a set of postion pairs of entity mentioning (to get them use CorefChain.getCorefMentions()
). Here is an example of a complete code (in groovy), which shows how to get from positions to tokens:
class Example {
public static void main(String[] args) {
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
props.put("dcoref.score", true);
pipeline = new StanfordCoreNLP(props);
Annotation document = new Annotation("The atom is a basic unit of matter, it consists of a dense central nucleus surrounded by a cloud of negatively charged electrons.");
pipeline.annotate(document);
Map<Integer, CorefChain> graph = document.get(CorefChainAnnotation.class);
println aText
for(Map.Entry<Integer, CorefChain> entry : graph) {
CorefChain c = entry.getValue();
println "ClusterId: " + entry.getKey();
CorefMention cm = c.getRepresentativeMention();
println "Representative Mention: " + aText.subSequence(cm.startIndex, cm.endIndex);
List<CorefMention> cms = c.getCorefMentions();
println "Mentions: ";
cms.each { it ->
print aText.subSequence(it.startIndex, it.endIndex) + "|";
}
}
}
}
Output (I do not understand where 's' comes from):
The atom is a basic unit of matter, it consists of a dense central nucleus surrounded by a cloud of negatively charged electrons.
ClusterId: 1
Representative Mention: he
Mentions: he|atom |s|
ClusterId: 6
Representative Mention: basic unit
Mentions: basic unit |
ClusterId: 8
Representative Mention: unit
Mentions: unit |
ClusterId: 10
Representative Mention: it
Mentions: it |
These are the recent results from the annotator.
- [1, 1] 1 The atom
- [1, 2] 1 a basic unit of matter
- [1, 3] 1 it
- [1, 6] 6 negatively charged electrons
- [1, 5] 5 a cloud of negatively charged electrons
The markings are as follows :
[Sentence number,'id'] Cluster_no Text_Associated
The text belonging to the same cluster refers to the same context.
精彩评论