Tokenizer Example in Apache openNLP
In this tutorial, we shall look into Tokenizer Example in Apache OpenNLP. Also, a little understanding of the Tokenizaion process.
What is tokenization?
Tokenization is a process of segmenting strings into smaller parts called tokens(say sub-strings). These tokens are usually words, punctuation marks, sequence of digits, and like.
An example is shown in the following table.
Input to Tokenizer | John is 26 years old. |
Output of Tokenizer | [John, is, 26, years, old] |
Tokenization in OpenNLP
Tokenizer API in OpenNLP provides following three ways for tokenization:
Note : OpenNLP version used is 1.7.2.
Please observe the differences in the output from these three ways of tokenization in the examples provided below.
TokenizerME class Loaded with a Token Model
Step 1: Read the pretrained model into a stream.
InputStream modelIn = new FileInputStream("en-token.bin");
Step 2: Read the stream to a Tokenizer model.
TokenizerModel model = new TokenizerModel(modelIn);
Step 3: Initialize the tokenizer with the model.
TokenizerME tokenizer = new TokenizerME(model);
Step 4: Use TokenizerME.tokenize() method to extract the tokens to a String Array.
String tokens[] = tokenizer.tokenize("John is 26 years old.");
Step 5: Use TokenizerME.getTokenProbabilities() to get the probabilities for the segments to be tokens.
double tokenProbs[] = tokenizer.getTokenProbabilities();
Step 6: Finally, print the results.
Everything put together, is the below below program :
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
/**
* www.tutorialkart.com
* Tokenizer Example in Apache openNLP using TokenizerME class loaded with pre-trained token model
*/
public class TokenizerModelExample {
public static void main(String[] args) {
InputStream modelIn = null;
try {
modelIn = new FileInputStream("en-token.bin");
TokenizerModel model = new TokenizerModel(modelIn);
TokenizerME tokenizer = new TokenizerME(model);
String tokens[] = tokenizer.tokenize("John is 26 years old.");
double tokenProbs[] = tokenizer.getTokenProbabilities();
System.out.println("Token\t: Probability\n-------------------------------");
for(int i=0;i<tokens.length;i++){
System.out.println(tokens[i]+"\t: "+tokenProbs[i]);
}
}
catch (IOException e) {
e.printStackTrace();
}
finally {
if (modelIn != null) {
try {
modelIn.close();
}
catch (IOException e) {
}
}
}
}
}
When the above program is run, the output to the console is as shown below :
Token : Probability
-------------------------------
John : 1.0
is : 1.0
26 : 1.0
years : 1.0
old : 0.9954218897531331
. : 1.0
WhitespaceTokenizer
Following is the example to demonstrate WhitespaceTokenizer of OpenNLP Tokenizer API.
WhiteSpaceTokenizerExample.java
import opennlp.tools.tokenize.Tokenizer;
import opennlp.tools.tokenize.WhitespaceTokenizer;
/**
* www.tutorialkart.com
* Tokenizer Example in Apache openNLP using WhitespaceTokenizer
*/
public class WhiteSpaceTokenizerExample {
public static void main(String[] args) {
Tokenizer tokenizer = WhitespaceTokenizer.INSTANCE;
String tokens[] = tokenizer.tokenize("John is 26 years old.");
System.out.println("Token\n----------------");
for(int i=0;i<tokens.length;i++){
System.out.println(tokens[i]);
}
}
}
When the above program is run, the output to the console is as shown in the following.
Output
Token
----------------
John
is
26
years
old.
SimpleTokenizer
Following is the example to demostrateSimpleTokenizer of OpenNLP Tokenizer API.
SimpleTokenizerExample.java
import opennlp.tools.tokenize.SimpleTokenizer;
import opennlp.tools.tokenize.Tokenizer;
/**
* www.tutorialkart.com
* Tokenizer Example in Apache openNLP using SimpleTokenizer
*/
public class SimpleTokenizerExample {
public static void main(String[] args) {
Tokenizer tokenizer = SimpleTokenizer.INSTANCE;
String tokens[] = tokenizer.tokenize("John is 26 years old.");
System.out.println("Token\n----------------");
for(int i=0;i<tokens.length;i++){
System.out.println(tokens[i]);
}
}
}
When the above program is run, the output to the console is as shown in the following.
Output
Token
----------------
John
is
26
years
old
.
Conclusion
In this Apache OpenNLP Tutorial, we have seen different ways of tokenization the OpenNLP Tokenizer API provides.
Following are some of the other examples of openNLP :