Using the Stanford CoreNLP API

05-08

1. Generating annotations

CoreNLP package的骨幹由兩個類構成：

Annotation
Annotator

Annotation是數據結構來保存annotator的結果，通常是maps，表示由key到annotations的bit，例如the parse, the part-of-speech tags, or named entity tags.

Annotator更像函數，除了操作對象是Annotation而不是Objects，Annotator能夠進行tokenize, parse, or NER tag sentences. Annotators和Annotations通過AnnotationPipelines進行集成，構造出了一般Annotator的序列。Stanford CoreNLP繼承了AnnotationPipeline這個類，並且用NLP Annotator來進行定製化。

目前支持的Annotator和生成的Annotation總結在Annotators，我們給出一些例子

tokenize(TokenizerAnnotator類)，用來進行切分詞，這個方法將文本切分為roughly words。
pos(POSTaggerAnnotator類)，用來將一個tokens序列切分成句子；
parse(ParserAnnotator類)，採用constituent representations和dependency representations提供了完整的句法分析。其中consituent-based 輸出存儲在TreeAnnotation中.

採用StanfordCoreNLP(Properties props) 來創建一個Stanford CoreNLP對象，這個方法採用「annotators」屬性中給出的annotator來創建了一個pipeline。

import edu.stanford.nlp.pipeline.*;import java.util.*;public class BasicPipelineExample { public static void main(String[] args) { // creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and coreference resolution Properties props = new Properties(); props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref"); StanfordCoreNLP pipeline = new StanfordCoreNLP(props); // read some text in the text variable String text = "..."; // create an empty Annotation just with the given text Annotation document = new Annotation(text); // run all Annotators on this text pipeline.annotate(document); }}

我們可以將通過創建一個內容更多的Properties屬性對象並傳遞給CoreNLP，有一些overall properties例如「annotators」，但是大部分的properties只針對一個annotator並且要被anannotator.property. 注意一個property的值一定是String。在我們的單個annotator文檔中，我們將類型寫作"boolean", "file, classpath, or URL" or "List<String>"。這意味著String value會被解析為這種類型的值。在Properties對象中的值一定是String，如果需要設置多個properties，則可以採用PropertiesUtils.asProperties(String ...)，如下所示

// build pipelineStanfordCoreNLP pipeline = new StanfordCoreNLP( PropertiesUtils.asProperties( "annotators", "tokenize,ssplit,pos,lemma,parse,natlog", "ssplit.isOneSentence", "true", "parse.model", "edu/stanford/nlp/models/srparser/englishSR.ser.gz", "tokenize.language", "en"));// read some text in the text variableString text = ... // Add your text here!Annotation document = new Annotation(text);// run all Annotators on this textpipeline.annotate(document);

2. Interpreting the output

Annotators的輸出需要通過數據結構CoreMap和CoreLabel來進行訪問。

// these are all the sentences in this document// a CoreMap is essentially a Map that uses class objects as keys and has values with custom typesList<CoreMap> sentences = document.get(SentencesAnnotation.class);for(CoreMap sentence: sentences) { // traversing the words in the current sentence // a CoreLabel is a CoreMap with additional token-specific methods for (CoreLabel token: sentence.get(TokensAnnotation.class)) { // this is the text of the token String word = token.get(TextAnnotation.class); // this is the POS tag of the token String pos = token.get(PartOfSpeechAnnotation.class); // this is the NER label of the token String ne = token.get(NamedEntityTagAnnotation.class); } // this is the parse tree of the current sentence Tree tree = sentence.get(TreeAnnotation.class); // this is the Stanford dependency graph of the current sentence SemanticGraph dependencies = sentence.get(CollapsedCCProcessedDependenciesAnnotation.class);}// This is the coreference link graph// Each chain stores a set of mentions that link to each other,// along with a method for getting the most representative mention// Both sentence and token offsets start at 1!Map<Integer, CorefChain> graph = document.get(CorefChainAnnotation.class);

我們可以得到結果如下所示

this is a simple textthis DT Ois VBZ Oa DT Osimple JJ Otext NN O語法樹(ROOT (S (NP (DT this)) (VP (VBZ is) (NP (DT a) (JJ simple) (NN text)))))依存句法分析-> text/NN (root) -> this/DT (nsubj) -> is/VBZ (cop) -> a/DT (det) -> simple/JJ (amod)

對於constituent-based和dependency-based的區別我們會專門講。

3. 中文庫的使用

相對於英文來說，中文文本的處理稍微麻煩一點，需要給pipeline指定配置文件。

首先需要stanford-corenlp-3.8.0-models-chinese.jar，可以在maven中指定

<dependency> <groupId>edu.stanford.nlp</groupId> <artifactId>stanford-corenlp</artifactId> <version>3.8.0</version> </dependency> <dependency> <groupId>edu.stanford.nlp</groupId> <artifactId>stanford-corenlp</artifactId> <version>3.8.0</version> <classifier>models</classifier> </dependency> <dependency> <groupId>edu.stanford.nlp</groupId> <artifactId>stanford-corenlp</artifactId> <version>3.8.0</version> <classifier>models-chinese</classifier> </dependency>

然後中文語料模型包中有一個默認的配置文件，在stanford-corenlp-3.8.0-models-chinese.jar的StanfordCoreNLP-chinese.properties，如下所示

主要是指定相應pipeline的操作步驟以及對應的語料文件的位置。實際使用中我們可能用不到所有的步驟，或者要使用不同的語料庫，因此可以自定義配置文件，再引入代碼中。

public void runAllAnnotators() { StanfordCoreNLP pipeline = new StanfordCoreNLP("StanfordCoreNLP-chinese.properties"); String text2 = "我愛北京天安門"; Annotation document = new Annotation(text2); pipeline.annotate(document); parserOutput(document);}

運行得到結果如下所示

我愛 VV O北京 NR GPE天安門 NR FACILITY語法樹(ROOT (IP (VP (VV 我愛) (NP (NP (NR 北京)) (NP (NR 天安門))))))依存句法分析-> 我愛/VV (root) -> 天安門/NR (dobj) -> 北京/NR (compound:nn)