Tokenizer Example in Apache openNLP

In this tutorial, we shall look into Tokenizer Example in Apache OpenNLP. Also, a little understanding of the Tokenizaion process.

What is tokenization?

Tokenization is a process of segmenting strings into smaller parts called tokens(say sub-strings). These tokens are usually words, punctuation marks, sequence of digits, and like.

An example is shown in the following table.

Input to TokenizerJohn is 26 years old.
Output of Tokenizer[John, is, 26, years, old]

Tokenization in OpenNLP

Tokenizer API in OpenNLP provides following three ways for tokenization:

Note : OpenNLP version used is 1.7.2.

Please observe the differences in the output from these three ways of tokenization in the examples provided below.

TokenizerME class Loaded with a Token Model

Step 1: Read the pretrained model into a stream.

</>
Copy
InputStream modelIn = new FileInputStream("en-token.bin");

Step 2: Read the stream to a Tokenizer model.

</>
Copy
TokenizerModel model = new TokenizerModel(modelIn);

Step 3: Initialize the tokenizer with the model.

</>
Copy
TokenizerME tokenizer = new TokenizerME(model);

Step 4: Use TokenizerME.tokenize() method to extract the tokens to a String Array.

</>
Copy
String tokens[] = tokenizer.tokenize("John is 26 years old.");

Step 5: Use TokenizerME.getTokenProbabilities() to get the probabilities for the segments to be tokens.

</>
Copy
double tokenProbs[] = tokenizer.getTokenProbabilities();

Step 6: Finally, print the results.

Everything put together, is the below below program :

</>
Copy
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;

import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;

/**
 * www.tutorialkart.com
 * Tokenizer Example in Apache openNLP using TokenizerME class loaded with pre-trained token model
 */
public class TokenizerModelExample {

	public static void main(String[] args) {
		InputStream modelIn = null;

		try {
			modelIn = new FileInputStream("en-token.bin");
			TokenizerModel model = new TokenizerModel(modelIn);
			TokenizerME tokenizer = new TokenizerME(model);
			String tokens[] = tokenizer.tokenize("John is 26 years old.");
			double tokenProbs[] = tokenizer.getTokenProbabilities();
			
			System.out.println("Token\t: Probability\n-------------------------------");
			for(int i=0;i<tokens.length;i++){
				System.out.println(tokens[i]+"\t: "+tokenProbs[i]);
			}
		}
		catch (IOException e) {
			e.printStackTrace();
		}
		finally {
			if (modelIn != null) {
				try {
					modelIn.close();
				}
				catch (IOException e) {
				}
			}
		}
	}
}

When the above program is run, the output to the console is as shown below :

Token	: Probability
-------------------------------
John	: 1.0
is  	: 1.0
26  	: 1.0
years	: 1.0
old 	: 0.9954218897531331
.   	: 1.0

WhitespaceTokenizer

Following is the example to demonstrate WhitespaceTokenizer of OpenNLP Tokenizer API.

WhiteSpaceTokenizerExample.java

</>
Copy
import opennlp.tools.tokenize.Tokenizer;
import opennlp.tools.tokenize.WhitespaceTokenizer;

/**
 * www.tutorialkart.com
 * Tokenizer Example in Apache openNLP using WhitespaceTokenizer
 */
public class WhiteSpaceTokenizerExample {

	public static void main(String[] args) {
		Tokenizer tokenizer = WhitespaceTokenizer.INSTANCE;
		String tokens[] = tokenizer.tokenize("John is 26 years old.");

		System.out.println("Token\n----------------");
		for(int i=0;i<tokens.length;i++){
			System.out.println(tokens[i]);
		}
	}
}

When the above program is run, the output to the console is as shown in the following.

Output

Token
----------------
John
is
26
years
old.

SimpleTokenizer

Following is the example to demostrateSimpleTokenizer of OpenNLP Tokenizer API.

SimpleTokenizerExample.java

</>
Copy
import opennlp.tools.tokenize.SimpleTokenizer;
import opennlp.tools.tokenize.Tokenizer;

/**
 * www.tutorialkart.com
 * Tokenizer Example in Apache openNLP using SimpleTokenizer
 */
public class SimpleTokenizerExample {

	public static void main(String[] args) {
		Tokenizer tokenizer = SimpleTokenizer.INSTANCE;
		String tokens[] = tokenizer.tokenize("John is 26 years old.");

		System.out.println("Token\n----------------");
		for(int i=0;i<tokens.length;i++){
			System.out.println(tokens[i]);
		}
	}
}

When the above program is run, the output to the console is as shown in the following.

Output

Token
----------------
John
is
26
years
old
.

Conclusion

In this Apache OpenNLP Tutorial, we have seen different ways of tokenization the OpenNLP Tokenizer API provides.

Following are some of the other examples of openNLP :