What is Topic Modelling ?
Topic Modelling is a natural language processing task of identifying the probable topic that is being represented by the text in the document.
We come across articles or documents containing text that usually belong to a topic. For example, consider some news articles or research papers or internet pages. Each of these describe or explain about a topic. Infact one starts writing text regarding to a topic.
The very example is right here. In this tutorial, we are discussing about Topic Modelling. So, our topic is “Topic Modelling”. You might come across the following words more frequently than others :
- document
- natural language processing
- task
- topic
- model
- probability
As another example, if a document belongs to a topic, “forest”, it might contain frequent words like trees, animals, types of forest, forest, life cycle, ecosystem, etc.
To capture these kind of information into a mathematical model, Apache Spark MLlib provides Topic modelling using Latent Dirichlet Condition.
Topic modelling using Latent Dirichlet Condition in Apache Spark MLlib
Now, we shall learn the process of generating the Topic Model and using the same for prediction, in a step by step process.
Step 1 : Start Spark Context
Configure the ecosystem to run on local and Start Spark Context.
// Configure spark application
SparkConf conf = new SparkConf().setAppName("TopicModellingLDAExampleApp")
.setMaster("local[2]");
// start Spark Context
JavaSparkContext jsc = new JavaSparkContext(conf);
Step 2 : Load Data into Spark RDD
Load and Parse the sample data from data/mllib/sample_lda_data.txt (we are using the sample data provided in Apache Spark MLlib Examples github). Each line in the file represents a document. So, index each document with an unique id.
// Load and parse the sample data
String path = "data/mllib/sample_lda_data.txt";
JavaRDD<String> data = jsc.textFile(path);
JavaRDD<Vector> parsedData = data.map(s -> {
String[] sarray = s.trim().split(" ");
double[] values = new double[sarray.length];
for (int i = 0; i < sarray.length; i++) {
values[i] = Double.parseDouble(sarray[i]);
}
return Vectors.dense(values);
});
// Index documents with unique IDs : Long - document id, Vector - Transformed document contents
JavaPairRDD<Long, Vector> corpus =
JavaPairRDD.fromJavaRDD(parsedData.zipWithIndex().map(Tuple2::swap));
corpus.cache();
Step 3 : Run LDA Topic Modeller
Set the number of topics, and run the LDA Topic Modeller against the data corpus.
LDAModel ldaModel = new LDA().setK(3).run(corpus);
Step 4 : Output Topics Distribution over vocabulary
Once the model is generated, we may print the topics’ distribution over vocabulary.
System.out.println("Learned topics (as distributions over vocab of " + ldaModel.vocabSize()
+ " words):");
Matrix topics = ldaModel.topicsMatrix();
for (int topic = 0; topic < 3; topic++) {
System.out.print("Topic " + topic + ":");
for (int word = 0; word < ldaModel.vocabSize(); word++) {
System.out.print(" " + topics.apply(word, topic));
}
System.out.println();
}
Step 5 : Model Persistence
Save the model generated to predict topic for further documents.
ldaModel.save(jsc.sc(),
"TopicModellingLDAExampleApp");
Step 6 : Stop Spark Context
jsc.stop();
Java Example – Topic Modelling
In the following example program, we shall perform Topic Modelling using Latent Dirichlet Condition.
TopicModellingLDAExample.java
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import scala.Tuple2;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.mllib.clustering.DistributedLDAModel;
import org.apache.spark.mllib.clustering.LDA;
import org.apache.spark.mllib.clustering.LDAModel;
import org.apache.spark.mllib.linalg.Matrix;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;
/**
* Topic modelling using Latent Dirichlet Condition in Apache Spark MLlib
*/
public class TopicModellingLDAExample {
public static void main(String[] args) {
// Configure spark application
SparkConf conf = new SparkConf().setAppName("TopicModellingLDAExampleApp")
.setMaster("local[2]");
// start Spark Context
JavaSparkContext jsc = new JavaSparkContext(conf);
// Load and parse the sample data
String path = "data/mllib/sample_lda_data.txt";
JavaRDD<String> data = jsc.textFile(path);
JavaRDD<Vector> parsedData = data.map(s -> {
String[] sarray = s.trim().split(" ");
double[] values = new double[sarray.length];
for (int i = 0; i < sarray.length; i++) {
values[i] = Double.parseDouble(sarray[i]);
}
return Vectors.dense(values);
});
// Index documents with unique IDs : Long - document id, Vector - Transformed document contents
JavaPairRDD<Long, Vector> corpus =
JavaPairRDD.fromJavaRDD(parsedData.zipWithIndex().map(Tuple2::swap));
corpus.cache();
// Cluster the documents into three topics using LDA
LDAModel ldaModel = new LDA().setK(3).run(corpus);
// Output topics. Each is a distribution over words (matching word count vectors)
System.out.println("Learned topics (as distributions over vocab of " + ldaModel.vocabSize()
+ " words):");
Matrix topics = ldaModel.topicsMatrix();
for (int topic = 0; topic < 3; topic++) {
System.out.print("Topic " + topic + ":");
for (int word = 0; word < ldaModel.vocabSize(); word++) {
System.out.print(" " + topics.apply(word, topic));
}
System.out.println();
}
// save the model
ldaModel.save(jsc.sc(),
"TopicModellingLDAExampleApp");
// test if the model loads
DistributedLDAModel sameModel = DistributedLDAModel.load(jsc.sc(),
"TopicModellingLDAExampleApp");
// stop the Spark Context
jsc.stop();
}
}
Output
Learned topics (as distributions over vocab of 11 words):
Topic 0: 7.576226952795377 5.816129763687888 3.443659463610819 13.523621733565031 5.564859588817557 6.605326794930297 14.782903558924001 3.063190448611529 2.8630735297090064 6.5170961047598635 17.015304210597
Topic 1: 8.966136190838393 7.450099807005361 4.338884068311933 18.900255115448275 9.235601145516164 7.2157902467479875 9.709434717615075 3.0358356116955343 2.2640073474546254 5.074403401553405 9.89058872292561
Topic 2: 9.457636856366229 15.73377042930675 4.2174564680772475 7.576123150986693 10.199539265666282 8.178882958321715 6.507661723460921 3.9009739396929377 2.8729191228363677 12.40850049368673 6.094107066477388
Conclusion
In this Spark Tutorial, we learned how to do Topic Modelling using Latent Dirichlet Condition, with the help of example Java application.