Read input text file to RDD
To read an input text file to RDD, we can use SparkContext.textFile() method.
In this tutorial, we will learn the syntax of SparkContext.textFile() method, and how to use in a Spark Application to load data from a text file to RDD with the help of Java and Python examples.
Syntax of textFile()
The syntax of textFile() method is
JavaRDD<String> textFile(String path, int minPartitions)
textFile() method reads a text file from HDFS/local file system/any hadoop supported file system URI into the number of partitions specified and returns it as an RDD of Strings
Parameter | Description |
---|---|
path | Required. Specifies the path to text file. |
minPartitions | Specifies the number of partitions the resulting RDD should have. |
Spark textFile() – Java Example
Following is a Java Example where we shall read a local text file and load it to RDD.
ReadTextToRDD.java
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
public class ReadTextToRDD {
public static void main(String[] args) {
// configure spark
SparkConf sparkConf = new SparkConf().setAppName("Read Text to RDD")
.setMaster("local[2]").set("spark.executor.memory","2g");
// start a spark context
JavaSparkContext sc = new JavaSparkContext(sparkConf);
// provide path to input text file
String path = "data/rdd/input/sample.txt";
// read text file to RDD
JavaRDD<String> lines = sc.textFile(path);
// collect RDD for printing
for(String line:lines.collect()){
System.out.println(line);
}
}
}
Input Text File
Welcome to TutorialKart
Learn Apache Spark
Learn to work with RDD
Output
17/11/28 10:33:55 INFO DAGScheduler: ResultStage 0 (collect at ReadTextToRDD.java:20) finished in 0.407 s
17/11/28 10:33:55 INFO DAGScheduler: Job 0 finished: collect at ReadTextToRDD.java:20, took 0.751794 s
Welcome to TutorialKart
Learn Apache Spark
Learn to work with RDD
17/11/28 10:33:55 INFO SparkContext: Invoking stop() from shutdown hook
Spark textFile() – Python Example
Following is a Python Example where we shall read a local text file and load it to RDD.
read-text-file-to-rdd.py
import sys
from pyspark import SparkContext, SparkConf
if __name__ == "__main__":
# create Spark context with Spark configuration
conf = SparkConf().setAppName("Read Text to RDD - Python")
sc = SparkContext(conf=conf)
# read input text file to RDD
lines = sc.textFile("/home/arjun/workspace/spark/sample.txt")
# collect the RDD to a list
llist = lines.collect()
# print the list
for line in llist:
print(line)
Submit this python application to Spark using the following command.
~$ spark-submit /workspace/spark/read-text-file-to-rdd.py
17/11/28 15:03:13 INFO DAGScheduler: ResultStage 0 (collect at /home/arjun/workspace/spark/read-text-file-to-rdd.py:15) finished in 0.508 s
17/11/28 15:03:13 INFO DAGScheduler: Job 0 finished: collect at /home/arjun/workspace/spark/read-text-file-to-rdd.py:15, took 0.699556 s
Welcome to TutorialKart
Learn Apache Spark
Learn to work with RDD
17/11/28 15:03:13 INFO SparkContext: Invoking stop() from shutdown hook
Conclusion
In this Spark Tutorial – Read Text file to RDD, we have learnt to read data from a text file to an RDD using SparkContext.textFile() method, with the help of Java and Python examples.
In our next tutorial, we shall learn to Read multiple text files to single RDD.