Spark – Create RDD
To create RDD in Apache Spark, some of the possible ways are
- Create RDD from List<T> using Spark Parallelize.
- Create RDD from Text file
- Create RDD from JSON file
In this tutorial, we will go through examples, covering each of the above mentioned processes.
Example – Create RDD from List<T>
In this example, we will take a List of strings, and then create a Spark RDD from this list.
RDDfromList.java
</>
Copy
import java.util.Arrays;
import java.util.List;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
public class RDDfromList {
public static void main(String[] args) {
// configure spark
SparkConf sparkConf = new SparkConf().setAppName("Spark RDD foreach Example")
.setMaster("local[2]").set("spark.executor.memory","2g");
// start a spark context
JavaSparkContext sc = new JavaSparkContext(sparkConf);
// read list to RDD
List<String> data = Arrays.asList("Learn","Apache","Spark","with","Tutorial Kart");
JavaRDD<String> items = sc.parallelize(data,1);
// apply a function for each element of RDD
items.foreach(item -> {
System.out.println("* "+item);
});
}
}
Example – Create RDD from Text file
In this example, we have the data in text file and will create an RDD from this text file.
ReadTextToRDD.java
</>
Copy
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
public class ReadTextToRDD {
public static void main(String[] args) {
// configure spark
SparkConf sparkConf = new SparkConf().setAppName("Read Text to RDD")
.setMaster("local[2]").set("spark.executor.memory","2g");
// start a spark context
JavaSparkContext sc = new JavaSparkContext(sparkConf);
// provide path to input text file
String path = "data/rdd/input/sample.txt";
// read text file to RDD
JavaRDD<String> lines = sc.textFile(path);
// collect RDD for printing
for(String line:lines.collect()){
System.out.println(line);
}
}
}
Example – Create RDD from JSON file
In this example, we will create an RDD from JSON file.
JSONtoRDD.java
</>
Copy
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public class JSONtoRDD {
public static void main(String[] args) {
// configure spark
SparkSession spark = SparkSession
.builder()
.appName("Spark Example - Read JSON to RDD")
.master("local[2]")
.getOrCreate();
// read list to RDD
String jsonPath = "data/employees.json";
JavaRDD<Row> items = spark.read().json(jsonPath).toJavaRDD();
items.foreach(item -> {
System.out.println(item);
});
}
}
Conclusion
In this Spark Tutorial, we have learnt to create Spark RDD from a List, reading a text or JSON file from file-system etc., with the help of example programs.