Spark RDD with custom class objects
To assign Spark RDD with custom class objects, implement the custom class with Serializable interface, create an immutable list of custom class objects, then parallelize the list with SparkContext. Parallelizing returns RDD created with custom class objects as elements.
Java Example
Following example demonstrates the creation of RDD with list of class objects.
CustomObjectsRDD.java
import java.io.Serializable; import java.util.List; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import com.google.common.collect.ImmutableList; public class CustomObjectsRDD { public static void main(String[] args) { // configure spark SparkConf sparkConf = new SparkConf().setAppName("Print Elements of RDD") .setMaster("local[2]").set("spark.executor.memory","2g"); // start a spark context JavaSparkContext sc = new JavaSparkContext(sparkConf); // prepare list of objects List<Person> personList = ImmutableList.of( new Person("Arjun", 25), new Person("Akhil", 2)); // parallelize the list using SparkContext JavaRDD<Person> perJavaRDD = sc.parallelize(personList); for(Person person : perJavaRDD.collect()){ System.out.println(person.name); } sc.close(); } } class Person implements Serializable{ private static final long serialVersionUID = -2685444218382696366L; String name; int age; public Person(String name, int age){ this.name = name; this.age = age; } }
Output
18/02/10 21:59:10 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 18/02/10 21:59:10 INFO DAGScheduler: ResultStage 0 (collect at CustomObjectsRDD.java:29) finished in 0.223 s 18/02/10 21:59:10 INFO DAGScheduler: Job 0 finished: collect at CustomObjectsRDD.java:29, took 0.661038 s Arjun Akhil 18/02/10 21:59:10 INFO SparkContext: Invoking stop() from shutdown hook 18/02/10 21:59:10 INFO SparkUI: Stopped Spark web UI at http://192.168.1.104:4040 18/02/10 21:59:10 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
ADVERTISEMENT
Conclusion
In this Spark Tutorial – Spark RDD with custom class objects, we have learnt to initialize RDD from an immutable list of custom objects using SparkContext.parallelize(), with the help of an Example.