Using the Stanford CoreNLP API

1. Generating annotations

CoreNLP package的骨幹由兩個類構成:

  1. Annotation
  2. Annotator

Annotation是數據結構來保存annotator的結果,通常是maps,表示由key到annotations的bit,例如the parse, the part-of-speech tags, or named entity tags.

Annotator更像函數,除了操作對象是Annotation而不是Objects,Annotator能夠進行tokenize, parse, or NER tag sentences. Annotators和Annotations通過AnnotationPipelines進行集成,構造出了一般Annotator的序列。Stanford CoreNLP繼承了AnnotationPipeline這個類,並且用NLP Annotator來進行定製化。

目前支持的Annotator和生成的Annotation總結在Annotators,我們給出一些例子

  1. tokenize(TokenizerAnnotator類),用來進行切分詞,這個方法將文本切分為roughly words。
  2. pos(POSTaggerAnnotator類),用來將一個tokens序列切分成句子;
  3. parse(ParserAnnotator類),採用constituent representations和dependency representations提供了完整的句法分析。其中consituent-based 輸出存儲在TreeAnnotation中.

採用StanfordCoreNLP(Properties props) 來創建一個Stanford CoreNLP對象,這個方法採用「annotators」屬性中給出的annotator來創建了一個pipeline。

import edu.stanford.nlp.pipeline.*;import java.util.*;public class BasicPipelineExample { public static void main(String[] args) { // creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and coreference resolution Properties props = new Properties(); props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref"); StanfordCoreNLP pipeline = new StanfordCoreNLP(props); // read some text in the text variable String text = "..."; // create an empty Annotation just with the given text Annotation document = new Annotation(text); // run all Annotators on this text pipeline.annotate(document); }}

我們可以將通過創建一個內容更多的Properties屬性對象並傳遞給CoreNLP,有一些overall properties例如「annotators」,但是大部分的properties只針對一個annotator並且要被anannotator.property. 注意一個property的值一定是String。在我們的單個annotator文檔中,我們將類型寫作"boolean", "file, classpath, or URL" or "List<String>"。這意味著String value會被解析為這種類型的值。在Properties對象中的值一定是String,如果需要設置多個properties,則可以採用PropertiesUtils.asProperties(String ...),如下所示

// build pipelineStanfordCoreNLP pipeline = new StanfordCoreNLP( PropertiesUtils.asProperties( "annotators", "tokenize,ssplit,pos,lemma,parse,natlog", "ssplit.isOneSentence", "true", "parse.model", "edu/stanford/nlp/models/srparser/englishSR.ser.gz", "tokenize.language", "en"));// read some text in the text variableString text = ... // Add your text here!Annotation document = new Annotation(text);// run all Annotators on this textpipeline.annotate(document);

2. Interpreting the output

Annotators的輸出需要通過數據結構CoreMap和CoreLabel來進行訪問。

// these are all the sentences in this document// a CoreMap is essentially a Map that uses class objects as keys and has values with custom typesList<CoreMap> sentences = document.get(SentencesAnnotation.class);for(CoreMap sentence: sentences) { // traversing the words in the current sentence // a CoreLabel is a CoreMap with additional token-specific methods for (CoreLabel token: sentence.get(TokensAnnotation.class)) { // this is the text of the token String word = token.get(TextAnnotation.class); // this is the POS tag of the token String pos = token.get(PartOfSpeechAnnotation.class); // this is the NER label of the token String ne = token.get(NamedEntityTagAnnotation.class); } // this is the parse tree of the current sentence Tree tree = sentence.get(TreeAnnotation.class); // this is the Stanford dependency graph of the current sentence SemanticGraph dependencies = sentence.get(CollapsedCCProcessedDependenciesAnnotation.class);}// This is the coreference link graph// Each chain stores a set of mentions that link to each other,// along with a method for getting the most representative mention// Both sentence and token offsets start at 1!Map<Integer, CorefChain> graph = document.get(CorefChainAnnotation.class);

我們可以得到結果如下所示

this is a simple textthis DT Ois VBZ Oa DT Osimple JJ Otext NN O語法樹(ROOT (S (NP (DT this)) (VP (VBZ is) (NP (DT a) (JJ simple) (NN text)))))依存句法分析-> text/NN (root) -> this/DT (nsubj) -> is/VBZ (cop) -> a/DT (det) -> simple/JJ (amod)

對於constituent-based和dependency-based的區別我們會專門講。

3. 中文庫的使用

相對於英文來說,中文文本的處理稍微麻煩一點,需要給pipeline指定配置文件。

首先需要stanford-corenlp-3.8.0-models-chinese.jar,可以在maven中指定

<dependency> <groupId>edu.stanford.nlp</groupId> <artifactId>stanford-corenlp</artifactId> <version>3.8.0</version> </dependency> <dependency> <groupId>edu.stanford.nlp</groupId> <artifactId>stanford-corenlp</artifactId> <version>3.8.0</version> <classifier>models</classifier> </dependency> <dependency> <groupId>edu.stanford.nlp</groupId> <artifactId>stanford-corenlp</artifactId> <version>3.8.0</version> <classifier>models-chinese</classifier> </dependency>

然後中文語料模型包中有一個默認的配置文件,在stanford-corenlp-3.8.0-models-chinese.jar的StanfordCoreNLP-chinese.properties,如下所示

主要是指定相應pipeline的操作步驟以及對應的語料文件的位置。實際使用中我們可能用不到所有的步驟,或者要使用不同的語料庫,因此可以自定義配置文件,再引入代碼中。

public void runAllAnnotators() { StanfordCoreNLP pipeline = new StanfordCoreNLP("StanfordCoreNLP-chinese.properties"); String text2 = "我愛北京天安門"; Annotation document = new Annotation(text2); pipeline.annotate(document); parserOutput(document);}

運行得到結果如下所示

我愛 VV O北京 NR GPE天安門 NR FACILITY語法樹(ROOT (IP (VP (VV 我愛) (NP (NP (NR 北京)) (NP (NR 天安門))))))依存句法分析-> 我愛/VV (root) -> 天安門/NR (dobj) -> 北京/NR (compound:nn)

推薦閱讀:

NLP&ML入門:數據分析of zhihu.com
機器學習中的數學基礎(簡介)
嶺回歸-嶺回歸
周明:如果用一個詞形容NLP圈的2017,我選「想像」| 人物對話
基於注意力的循環神經網路模型用於聯合意圖推測和槽位填充

TAG:自然語言處理 | 編程 |