What is Topic Modelling ?

Topic Modelling is a natural language processing task of identifying the probable topic that is being represented by the text in the document.

We come across articles or documents containing text that usually belong to a topic. For example, consider some news articles or research papers or internet pages. Each of these describe or explain about a topic. Infact one starts writing text regarding to a topic.

The very example is right here. In this tutorial, we are discussing about Topic Modelling. So, our topic is “Topic Modelling”. You might come across the following words more frequently than others :

  • document
  • natural language processing
  • task
  • topic
  • model
  • probability

As another example, if a document belongs to a topic, “forest”, it might contain frequent words like trees, animals, types of forest, forest, life cycle, ecosystem, etc.

To capture these kind of information into a mathematical model, Apache Spark MLlib provides Topic modelling using Latent Dirichlet Condition.

Topic modelling using Latent Dirichlet Condition in Apache Spark MLlib

Now, we shall learn the process of generating the Topic Model and using the same for prediction, in a step by step process.

Step 1 : Start Spark Context

Configure the ecosystem to run on local and Start Spark Context.

</>
Copy
// Configure spark application
SparkConf conf = new SparkConf().setAppName("TopicModellingLDAExampleApp")
        .setMaster("local[2]");
// start Spark Context
JavaSparkContext jsc = new JavaSparkContext(conf);
 

Step 2 : Load Data into Spark RDD

Load and Parse the sample data from data/mllib/sample_lda_data.txt (we are using the sample data provided in Apache Spark MLlib Examples github). Each line in the file represents a document. So, index each document with an unique id.

</>
Copy
// Load and parse the sample data
String path = "data/mllib/sample_lda_data.txt";
JavaRDD<String> data = jsc.textFile(path);
JavaRDD<Vector> parsedData = data.map(s -> {
    String[] sarray = s.trim().split(" ");
    double[] values = new double[sarray.length];
    for (int i = 0; i < sarray.length; i++) {
        values[i] = Double.parseDouble(sarray[i]);
    }
    return Vectors.dense(values);
});
 
// Index documents with unique IDs : Long - document id, Vector - Transformed document contents
JavaPairRDD<Long, Vector> corpus =
        JavaPairRDD.fromJavaRDD(parsedData.zipWithIndex().map(Tuple2::swap));
corpus.cache();

Step 3 : Run LDA Topic Modeller

Set the number of topics, and run the LDA Topic Modeller against the data corpus.

</>
Copy
LDAModel ldaModel = new LDA().setK(3).run(corpus);

Step 4 : Output Topics Distribution over vocabulary

Once the model is generated, we may print the topics’ distribution over vocabulary.

</>
Copy
System.out.println("Learned topics (as distributions over vocab of " + ldaModel.vocabSize()
+ " words):");
Matrix topics = ldaModel.topicsMatrix();
for (int topic = 0; topic < 3; topic++) {
    System.out.print("Topic " + topic + ":");
    for (int word = 0; word < ldaModel.vocabSize(); word++) {
        System.out.print(" " + topics.apply(word, topic));
    }
    System.out.println();
}

Step 5 : Model Persistence

Save the model generated to predict topic for further documents.

</>
Copy
ldaModel.save(jsc.sc(),
        "TopicModellingLDAExampleApp");

Step 6 : Stop Spark Context

</>
Copy
jsc.stop();

Java Example – Topic Modelling

In the following example program, we shall perform Topic Modelling using Latent Dirichlet Condition.

TopicModellingLDAExample.java

</>
Copy
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
 
import scala.Tuple2;
 
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.mllib.clustering.DistributedLDAModel;
import org.apache.spark.mllib.clustering.LDA;
import org.apache.spark.mllib.clustering.LDAModel;
import org.apache.spark.mllib.linalg.Matrix;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;
 
/**
* Topic modelling using Latent Dirichlet Condition in Apache Spark MLlib
*/
public class TopicModellingLDAExample {
    public static void main(String[] args) {
 
        // Configure spark application
        SparkConf conf = new SparkConf().setAppName("TopicModellingLDAExampleApp")
                .setMaster("local[2]");
        // start Spark Context
        JavaSparkContext jsc = new JavaSparkContext(conf);
 
        // Load and parse the sample data
        String path = "data/mllib/sample_lda_data.txt";
        JavaRDD<String> data = jsc.textFile(path);
        JavaRDD<Vector> parsedData = data.map(s -> {
            String[] sarray = s.trim().split(" ");
            double[] values = new double[sarray.length];
            for (int i = 0; i < sarray.length; i++) {
                values[i] = Double.parseDouble(sarray[i]);
            }
            return Vectors.dense(values);
        });
        
        // Index documents with unique IDs : Long - document id, Vector - Transformed document contents
        JavaPairRDD<Long, Vector> corpus =
                JavaPairRDD.fromJavaRDD(parsedData.zipWithIndex().map(Tuple2::swap));
        corpus.cache();
 
        // Cluster the documents into three topics using LDA
        LDAModel ldaModel = new LDA().setK(3).run(corpus);
 
        // Output topics. Each is a distribution over words (matching word count vectors)
        System.out.println("Learned topics (as distributions over vocab of " + ldaModel.vocabSize()
        + " words):");
        Matrix topics = ldaModel.topicsMatrix();
        for (int topic = 0; topic < 3; topic++) {
            System.out.print("Topic " + topic + ":");
            for (int word = 0; word < ldaModel.vocabSize(); word++) {
                System.out.print(" " + topics.apply(word, topic));
            }
            System.out.println();
        }
 
        // save the model
        ldaModel.save(jsc.sc(),
                "TopicModellingLDAExampleApp");
        
        // test if the model loads
        DistributedLDAModel sameModel = DistributedLDAModel.load(jsc.sc(),
                "TopicModellingLDAExampleApp");
 
        // stop the Spark Context
        jsc.stop();
    }
}

Output

Learned topics (as distributions over vocab of 11 words):
Topic 0: 7.576226952795377 5.816129763687888 3.443659463610819 13.523621733565031 5.564859588817557 6.605326794930297 14.782903558924001 3.063190448611529 2.8630735297090064 6.5170961047598635 17.015304210597
Topic 1: 8.966136190838393 7.450099807005361 4.338884068311933 18.900255115448275 9.235601145516164 7.2157902467479875 9.709434717615075 3.0358356116955343 2.2640073474546254 5.074403401553405 9.89058872292561
Topic 2: 9.457636856366229 15.73377042930675 4.2174564680772475 7.576123150986693 10.199539265666282 8.178882958321715 6.507661723460921 3.9009739396929377 2.8729191228363677 12.40850049368673 6.094107066477388

Conclusion

In this Spark Tutorial, we learned how to do Topic Modelling using Latent Dirichlet Condition, with the help of example Java application.