Read input text file to RDD
To read an input text file to RDD, we can use SparkContext.textFile() method.
In this tutorial, we will learn the syntax of SparkContext.textFile() method, and how to use in a Spark Application to load data from a text file to RDD with the help of Java and Python examples.
Syntax of textFile()
The syntax of textFile() method is
JavaRDD<String> textFile(String path, int minPartitions)
textFile() method reads a text file from HDFS/local file system/any hadoop supported file system URI into the number of partitions specified and returns it as an RDD of Strings
Parameter | Description |
---|---|
path | Required. Specifies the path to text file. |
minPartitions | Specifies the number of partitions the resulting RDD should have. |
Spark textFile() – Java Example
Following is a Java Example where we shall read a local text file and load it to RDD.
ReadTextToRDD.java
import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; public class ReadTextToRDD { public static void main(String[] args) { // configure spark SparkConf sparkConf = new SparkConf().setAppName("Read Text to RDD") .setMaster("local[2]").set("spark.executor.memory","2g"); // start a spark context JavaSparkContext sc = new JavaSparkContext(sparkConf); // provide path to input text file String path = "data/rdd/input/sample.txt"; // read text file to RDD JavaRDD<String> lines = sc.textFile(path); // collect RDD for printing for(String line:lines.collect()){ System.out.println(line); } } }
Input Text File
Welcome to TutorialKart Learn Apache Spark Learn to work with RDD
Output
17/11/28 10:33:55 INFO DAGScheduler: ResultStage 0 (collect at ReadTextToRDD.java:20) finished in 0.407 s 17/11/28 10:33:55 INFO DAGScheduler: Job 0 finished: collect at ReadTextToRDD.java:20, took 0.751794 s Welcome to TutorialKart Learn Apache Spark Learn to work with RDD 17/11/28 10:33:55 INFO SparkContext: Invoking stop() from shutdown hook
Spark textFile() – Python Example
Following is a Python Example where we shall read a local text file and load it to RDD.
read-text-file-to-rdd.py
import sys from pyspark import SparkContext, SparkConf if __name__ == "__main__": # create Spark context with Spark configuration conf = SparkConf().setAppName("Read Text to RDD - Python") sc = SparkContext(conf=conf) # read input text file to RDD lines = sc.textFile("/home/arjun/workspace/spark/sample.txt") # collect the RDD to a list llist = lines.collect() # print the list for line in llist: print(line)
Submit this python application to Spark using the following command.
~$ spark-submit /workspace/spark/read-text-file-to-rdd.py
17/11/28 15:03:13 INFO DAGScheduler: ResultStage 0 (collect at /home/arjun/workspace/spark/read-text-file-to-rdd.py:15) finished in 0.508 s 17/11/28 15:03:13 INFO DAGScheduler: Job 0 finished: collect at /home/arjun/workspace/spark/read-text-file-to-rdd.py:15, took 0.699556 s Welcome to TutorialKart Learn Apache Spark Learn to work with RDD 17/11/28 15:03:13 INFO SparkContext: Invoking stop() from shutdown hook
Conclusion
In this Spark Tutorial – Read Text file to RDD, we have learnt to read data from a text file to an RDD using SparkContext.textFile() method, with the help of Java and Python examples.
In our next tutorial, we shall learn to Read multiple text files to single RDD.