Spark – RDD Distinct

Spark RDD Distinct : RDD<T> class provides distinct() method to pick unique elements present in the RDD. In this tutorial, we learn to get unique elements of an RDD using RDD<T>.distinct() method with the help of Java, Scala and Python examples.

To get distinct elements of an RDD, apply the function distinct on the RDD. The method returns an RDD containing unique/distinct elements.

Spark – RDD distinct() – Java Example

In this example, we will take an RDD created from a list of strings, and find the distinct of them using RDD.distinct() method.

DistinctRDD.java

</>
Copy
import java.util.Arrays;
import java.util.List;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

public class DistinctRDD {

	public static void main(String[] args) {
		// configure spark
		SparkConf sparkConf = new SparkConf().setAppName("Spark RDD Distinct")
				.setMaster("local[2]")
				.set("spark.executor.memory", "2g");
		// start a spark context
		JavaSparkContext sc = new JavaSparkContext(sparkConf);

		// read list to RDD
		List<String> data = Arrays.asList("Learn", "Apache", "Spark", "Learn", "Spark", "RDD", "Functions");
		JavaRDD<String> words = sc.parallelize(data, 1);

		// get distinct elements of RDD
		JavaRDD<String> rddDistinct = words.distinct();

		// print
		rddDistinct.foreach(item -> {
			System.out.println(item);
		});
		
		sc.close();
	}
}

Output

Functions
Spark
Tutorial Kart
Learn
Apache
with
RDD

Spark – RDD distinct() – Scala Example

In the following example, we will find the distinct elements in an RDD using RDD.distinct() with Scala programming language.

RDDdistinct.scala

</>
Copy
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf

object RDDdistinct {
	def main(args: Array[String]) {

		/* configure spark application */
		val conf = new SparkConf().setAppName("Spark RDD Distinct Example").setMaster("local[1]")

				/* spark context*/
				val sc = new SparkContext(conf)

				/* map */
				var rdd = sc.parallelize(Seq("Learn", "Apache", "Spark", "Learn", "Spark", "RDD", "Functions"));

				/* reduce */
				var rddDist = rdd.distinct()

				/* print */
				rddDist.collect().foreach(println)

				/* or save the output to file */
				rddDist.saveAsTextFile("out.txt")

				sc.stop()
	}
}

Conclusion

In this Spark Tutorial – Spark RDD.distinct(), we have learnt to get unique/distinct elements of Spark RDD with example programs.