Spark – RDD Distinct
Spark RDD Distinct : RDD<T> class provides distinct() method to pick unique elements present in the RDD. In this tutorial, we learn to get unique elements of an RDD using RDD<T>.distinct() method with the help of Java, Scala and Python examples.
To get distinct elements of an RDD, apply the function distinct on the RDD. The method returns an RDD containing unique/distinct elements.
Spark – RDD distinct() – Java Example
In this example, we will take an RDD created from a list of strings, and find the distinct of them using RDD.distinct() method.
DistinctRDD.java
import java.util.Arrays; import java.util.List; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; public class DistinctRDD { public static void main(String[] args) { // configure spark SparkConf sparkConf = new SparkConf().setAppName("Spark RDD Distinct") .setMaster("local[2]") .set("spark.executor.memory", "2g"); // start a spark context JavaSparkContext sc = new JavaSparkContext(sparkConf); // read list to RDD List<String> data = Arrays.asList("Learn", "Apache", "Spark", "Learn", "Spark", "RDD", "Functions"); JavaRDD<String> words = sc.parallelize(data, 1); // get distinct elements of RDD JavaRDD<String> rddDistinct = words.distinct(); // print rddDistinct.foreach(item -> { System.out.println(item); }); sc.close(); } }
Output
Functions Spark Tutorial Kart Learn Apache with RDD
Spark – RDD distinct() – Scala Example
In the following example, we will find the distinct elements in an RDD using RDD.distinct() with Scala programming language.
RDDdistinct.scala
import org.apache.spark.SparkContext import org.apache.spark.SparkConf object RDDdistinct { def main(args: Array[String]) { /* configure spark application */ val conf = new SparkConf().setAppName("Spark RDD Distinct Example").setMaster("local[1]") /* spark context*/ val sc = new SparkContext(conf) /* map */ var rdd = sc.parallelize(Seq("Learn", "Apache", "Spark", "Learn", "Spark", "RDD", "Functions")); /* reduce */ var rddDist = rdd.distinct() /* print */ rddDist.collect().foreach(println) /* or save the output to file */ rddDist.saveAsTextFile("out.txt") sc.stop() } }
Conclusion
In this Spark Tutorial – Spark RDD.distinct(), we have learnt to get unique/distinct elements of Spark RDD with example programs.