Classification is a task of identifying the features of an entity and classifying the entity to one of the predefined classes/categories.

Logistic Regression is a model which knows about relation between categorical variable and its corresponding features of an experiment.

Logistic meaning detailed organization and implementation of a complex operation. Which means identifying common features for all examples/experiments and transforming all of the examples to feature vectors.

Regression is a measure of relation between mean value of output variable to the dependent variables. Output is the label a problem instance is classified to. Variable values are the feature values.

An Example for Classification using Logistic Regression in Apache Spark MLlib with Java

ADVERTISEMENT

In this Apache Spark Tutorial, we shall look into an example, with step by step explanation, in generating a Logistic Regression Model for classification using Spark MLlib.

1. Configure Spark.

SparkConf conf = new SparkConf().setAppName("LogisticRegressionClassifier")
                .setMaster("local[2]").set("spark.executor.memory","2g");

2. Start a spark context.

JavaSparkContext jsc = new JavaSparkContext(sparkConf);

3. Load Data and Split the data to be used for training and testing. The data file used in this example is present in the folder “data” in “apache spark“, downloaded from official website.

// provide path to data transformed as [feature vectors]
String path = "data/mllib/sample_libsvm_data.txt";
JavaRDD inputData = MLUtils.loadLibSVMFile(jsc.sc(), path).toJavaRDD();
        
// Split initial RDD into two... [80% training data, 20% testing data].
JavaRDD[] splits = data.randomSplit(new double[] {0.8, 0.2}, 11L);
JavaRDD training = splits[0].cache();
JavaRDD test = splits[1];

4. Train a Naive Bayes model.

LogisticRegressionModel model = new LogisticRegressionWithLBFGS()
        .setNumClasses(10)
        .run(training.rdd());

5. Use the model to predict on the test data, and calculate accuracy.

// Compute raw scores on the test set.
JavaPairRDD<Object, Object> predictionAndLabels = test.mapToPair(p ->
        new Tuple2<>(model.predict(p.features()), p.label()));
 
// get evaluation metrics
MulticlassMetrics metrics = new MulticlassMetrics(predictionAndLabels.rdd());
double accuracy = metrics.accuracy();
System.out.println("Accuracy = " + accuracy);

6. Save the trained classifier model to local for future use.

model.save(jsc, "LogisticRegressionClassifier");

7. Stop the spark context.

jsc.stop();

Complete example program is given below.

LogisticRegressionClassifierExample.java

import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;

import scala.Tuple2;

import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.mllib.classification.LogisticRegressionModel;
import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS;
import org.apache.spark.mllib.evaluation.MulticlassMetrics;
import org.apache.spark.mllib.regression.LabeledPoint;
import org.apache.spark.mllib.util.MLUtils;

/**
 * Example for Logistic Regression Classifier
 */
public class LogisticRegressionClassifierExample {
	public static void main(String[] args) {
		// configure spark
		SparkConf conf = new SparkConf().setAppName("LogisticRegressionClassifier")
				.setMaster("local[2]").set("spark.executor.memory","2g");
		
		// start a spark context
		SparkContext jsc = new SparkContext(conf);
		
		// provide path to data transformed as [feature vectors]
		String path = "data/mllib/sample_libsvm_data.txt";
		JavaRDD data = MLUtils.loadLibSVMFile(jsc, path).toJavaRDD();

		// Split initial RDD into two... [80% training data, 20% testing data].
		JavaRDD[] splits = data.randomSplit(new double[] {0.8, 0.2}, 11L);
		JavaRDD training = splits[0].cache();
		JavaRDD test = splits[1];

		// Run training algorithm to build the model.
		LogisticRegressionModel model = new LogisticRegressionWithLBFGS()
				.setNumClasses(10)
				.run(training.rdd());

		// Compute raw scores on the test set.
		JavaPairRDD<Object, Object> predictionAndLabels = test.mapToPair(p ->
				new Tuple2<>(model.predict(p.features()), p.label()));

		// get evaluation metrics
		MulticlassMetrics metrics = new MulticlassMetrics(predictionAndLabels.rdd());
		double accuracy = metrics.accuracy();
		System.out.println("Accuracy = " + accuracy);

		// After training, save model to local for prediction in future  
		model.save(jsc, "LogisticRegressionClassifier");

		// stop the spark context
		jsc.stop();
	}
}

Output

Accuracy = 0.9523809523809523