NER Training in OpenNLP with Name Finder Training Java Example
In this OpenNLP Tutorial, we shall learn how to build a model for Named Entity Recognition using custom training data [that varies from requirement to requirement]. We shall do NER Training in OpenNLP with Name Finder Training Java Example program and generate a model, which can be used to detect the custom Named Entities that are specific to our requirement and of course similar to those provided in the training file.
Prerequisites :
To follow this tutorial, you should have basic understanding of Java programming language and setup of OpenNLP libraries in a Java project to use the OpenNLP Name Finder Training API.
Following is a step-by-step process in generating a model for custom training data :
Step 1: Prepare Training Data
As sugguested by OpenNLP manual, atleast 15,000 sentences should be available in the training file, so that the trained model may perform well.
Annotations should be provided for Named Entities in the training file using the below format.
<START:named_entitiy_type>Named Entity<END> remaining sentence.
An example could be : <START:person>Johny<END> and<START:person>Ricky<END> are brothers.
Note : If there is only one named entity type, mentioning named_entity_type is not required. <START>Johny<END> and<START>Ricky<END> are brothers.
Multiple types could be given in a single training file.
An example for training sentence having multiple types is : <START:person>Johny<END> and<START:person>Ricky<END> are <START:relation>brothers<END>.
The type is mentioned after the <START: tag.
AnnotatedSentences.txt [ source is from apache openNLP, but modified to demonstrate the usage of multiple types for the Named Entities.]
Once we are ready with the training data, we shall proceed with writing the Java program to train on these sentences.
Step 2: Read the training data
Read the training data file into ObjectStream<NameSample>
InputStreamFactory in = null;
try {
in = new MarkableFileInputStreamFactory(new File("AnnotatedSentences.txt"));
} catch (FileNotFoundException e2) {
e2.printStackTrace();
}
ObjectStream sampleStream = null;
try {
sampleStream = new NameSampleDataStream(
new PlainTextByLineStream(in, StandardCharsets.UTF_8));
} catch (IOException e1) {
e1.printStackTrace();
}
Step 3: Training Parameters.
TrainingParameters params = new TrainingParameters();
params.put(TrainingParameters.ITERATIONS_PARAM, 70);
params.put(TrainingParameters.CUTOFF_PARAM, 1);
Step 4: Train the model.
TokenNameFinderModel nameFinderModel = null;
try {
nameFinderModel = NameFinderME.train("en", null, sampleStream,
params, TokenNameFinderFactory.create(null, null, Collections.emptyMap(), new BioCodec()));
} catch (IOException e) {
e.printStackTrace();
}
Step 5: Save the model to a file.
Once you have generated the model, save it for loading it in other computers or using at a later point of time.
File output = new File("ner-custom-model.bin");
FileOutputStream outputStream = new FileOutputStream(output);
nameFinderModel.serialize(outputStream);
Step 6: Test the program.
To verify the program, use the model and predict the types from a sentence.
Complete program is given below :
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.util.Collections;
import opennlp.tools.namefind.BioCodec;
import opennlp.tools.namefind.NameFinderME;
import opennlp.tools.namefind.NameSample;
import opennlp.tools.namefind.NameSampleDataStream;
import opennlp.tools.namefind.TokenNameFinder;
import opennlp.tools.namefind.TokenNameFinderFactory;
import opennlp.tools.namefind.TokenNameFinderModel;
import opennlp.tools.util.InputStreamFactory;
import opennlp.tools.util.MarkableFileInputStreamFactory;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;
import opennlp.tools.util.Span;
import opennlp.tools.util.TrainingParameters;
/**
* NER Training in OpenNLP with Name Finder Training Java Example
* @author www.tutorialkart.com
*/
public class NERTrainingExample {
public static void main(String[] args) {
// reading training data
InputStreamFactory in = null;
try {
in = new MarkableFileInputStreamFactory(new File("AnnotatedSentences.txt"));
} catch (FileNotFoundException e2) {
e2.printStackTrace();
}
ObjectStream sampleStream = null;
try {
sampleStream = new NameSampleDataStream(
new PlainTextByLineStream(in, StandardCharsets.UTF_8));
} catch (IOException e1) {
e1.printStackTrace();
}
// setting the parameters for training
TrainingParameters params = new TrainingParameters();
params.put(TrainingParameters.ITERATIONS_PARAM, 70);
params.put(TrainingParameters.CUTOFF_PARAM, 1);
// training the model using TokenNameFinderModel class
TokenNameFinderModel nameFinderModel = null;
try {
nameFinderModel = NameFinderME.train("en", null, sampleStream,
params, TokenNameFinderFactory.create(null, null, Collections.emptyMap(), new BioCodec()));
} catch (IOException e) {
e.printStackTrace();
}
// saving the model to "ner-custom-model.bin" file
try {
File output = new File("ner-custom-model.bin");
FileOutputStream outputStream = new FileOutputStream(output);
nameFinderModel.serialize(outputStream);
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
// testing the model and printing the types it found in the input sentence
TokenNameFinder nameFinder = new NameFinderME(nameFinderModel);
String[] testSentence ={"Alisa","Fernandes","is","a","tourist","from","Spain"};
System.out.println("Finding types in the test sentence..");
Span[] names = nameFinder.find(testSentence);
for(Span name:names){
String personName="";
for(int i=name.getStart();i<name.getEnd();i++){
personName+=testSentence[i]+" ";
}
System.out.println(name.getType()+" : "+personName+"\t [probability="+name.getProb()+"]");
}
}
}
Output :
Indexing events using cutoff of 1
Computing event counts... done. 1392 events
Indexing... done.
Collecting events... Done indexing.
Incorporating indexed data for training...
done.
Number of Event Tokens: 1392
Number of Outcomes: 3
Number of Predicates: 9268
Computing model parameters...
Performing 70 iterations.
1: . (1358/1392) 0.9755747126436781
2: . (1387/1392) 0.9964080459770115
3: . (1390/1392) 0.9985632183908046
4: . (1392/1392) 1.0
5: . (1392/1392) 1.0
6: . (1392/1392) 1.0
7: . (1392/1392) 1.0
Stopping: change in training set accuracy less than 1.0E-5
Stats: (1392/1392) 1.0
...done.
Compressed 9268 parameters to 428
4 outcome patterns
Finding types in the test sentence..
person : Alisa Fernandes [probability=0.6643846020606172]
Once the program is run, the model is saved to “ner-custom-model.bin” as shown in the following screenshot.
Conclusion :
In this Apache OpenNLP Tutorial, we have learnt how to generate a custom model for Named Entity Recognition, save the model file to file system, and test the model to predict named entity types in a test sentence.