Spark – RDD.filter()
Spark RDD Filter : RDD.filter() method returns an RDD with those elements which pass a filter condition (function) that is given as argument to the method. In this tutorial, we learn to filter RDD containing Integers, and an RDD containing Tuples, with example programs.
Steps to apply filter to Spark RDD
To apply filter to Spark RDD,
- Create a Filter Function to be applied on an RDD.
- Use RDD<T>.filter() method with filter function passed as argument to it. The filter() method returns RDD<T> with elements filtered as per the function provided to it.
Spark – RDD.filter() – Java Example
In this example, we will take an RDD with integers, and filter them using RDD.filter() method.
FilterRDD.java
import java.util.Arrays;
import java.util.List;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
public class FilterRDD {
public static void main(String[] args) {
// configure spark
SparkConf sparkConf = new SparkConf().setAppName("Spark RDD Filter")
.setMaster("local[2]").set("spark.executor.memory","2g");
// start a spark context
JavaSparkContext sc = new JavaSparkContext(sparkConf);
// read list to RDD
List<Integer> data = Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
JavaRDD<Integer> rdd = sc.parallelize(data,1);
// filter : where the number (rdd element) is exactly divisible by 3
Function<Integer, Boolean> filter = k -> ( k % 3 == 0);
// apply filter on rdd with filter passed as argument
JavaRDD<Integer> rddf = rdd.filter(filter);
// print the filtered rdd
rddf.foreach(element -> {
System.out.println(element);
});
sc.close();
}
}
Output
3
6
9
Spark – RDD.filter() – Filter RDD with Tuples
In this example, we will take an RDD with Tuples as elements. We will filter this RDD using filter method. We will filter the elements based on condition that the length of string, which is second element in tuple, is equal to 5.
FilterRDD.java
import java.util.Arrays;
import java.util.List;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.VoidFunction;
import scala.Tuple2;
public class FilterRDD {
public static void main(String[] args) {
// configure spark
SparkConf sparkConf = new SparkConf().setAppName("Spark RDD filter")
.setMaster("local[2]")
.set("spark.executor.memory", "2g");
// start a spark context
JavaSparkContext sc = new JavaSparkContext(sparkConf);
// read list to RDD
List<String> data = Arrays.asList("Learn", "Apache", "Spark", "with", "Tutorial Kart");
JavaRDD<String> words = sc.parallelize(data, 1);
// map each word to -> (word, word length)
JavaPairRDD<String, Integer> wordsRDD = words.mapToPair(s -> new Tuple2<>(s, s.length()));
// filter : where the second element in tuple is equal to 5. (i.e., word length == 5)
Function<Tuple2<String, Integer>, Boolean> filterFunction = w -> (w._2 == 5);
// apply the filter on wordsRDD
JavaPairRDD<String, Integer> rddf = wordsRDD.filter(filterFunction);
// print filtered rdd
rddf.foreach(item -> {
System.out.println(item);
});
sc.close();
}
}
Output
(Learn,5)
(Spark,5)
Conclusion
In this Spark Tutorial – Spark RDD.filter(), we have learnt to filter elements of Spark RDD with example programs.