Spark RDD with custom class objects
To assign Spark RDD with custom class objects, implement the custom class with Serializable interface, create an immutable list of custom class objects, then parallelize the list with SparkContext. Parallelizing returns RDD created with custom class objects as elements.
Java Example
Following example demonstrates the creation of RDD with list of class objects.
CustomObjectsRDD.java
</>
Copy
import java.io.Serializable;
import java.util.List;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import com.google.common.collect.ImmutableList;
public class CustomObjectsRDD {
public static void main(String[] args) {
// configure spark
SparkConf sparkConf = new SparkConf().setAppName("Print Elements of RDD")
.setMaster("local[2]").set("spark.executor.memory","2g");
// start a spark context
JavaSparkContext sc = new JavaSparkContext(sparkConf);
// prepare list of objects
List<Person> personList = ImmutableList.of(
new Person("Arjun", 25),
new Person("Akhil", 2));
// parallelize the list using SparkContext
JavaRDD<Person> perJavaRDD = sc.parallelize(personList);
for(Person person : perJavaRDD.collect()){
System.out.println(person.name);
}
sc.close();
}
}
class Person implements Serializable{
private static final long serialVersionUID = -2685444218382696366L;
String name;
int age;
public Person(String name, int age){
this.name = name;
this.age = age;
}
}
Output
18/02/10 21:59:10 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
18/02/10 21:59:10 INFO DAGScheduler: ResultStage 0 (collect at CustomObjectsRDD.java:29) finished in 0.223 s
18/02/10 21:59:10 INFO DAGScheduler: Job 0 finished: collect at CustomObjectsRDD.java:29, took 0.661038 s
Arjun
Akhil
18/02/10 21:59:10 INFO SparkContext: Invoking stop() from shutdown hook
18/02/10 21:59:10 INFO SparkUI: Stopped Spark web UI at http://192.168.1.104:4040
18/02/10 21:59:10 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
Conclusion
In this Spark Tutorial – Spark RDD with custom class objects, we have learnt to initialize RDD from an immutable list of custom objects using SparkContext.parallelize(), with the help of an Example.