Read input text file to RDD

To read an input text file to RDD, we can use SparkContext.textFile() method.

In this tutorial, we will learn the syntax of SparkContext.textFile() method, and how to use in a Spark Application to load data from a text file to RDD with the help of Java and Python examples.

Syntax of textFile()

The syntax of textFile() method is

JavaRDD<String> textFile(String path, int minPartitions)

textFile() method reads a text file from HDFS/local file system/any hadoop supported file system URI into the number of partitions specified and returns it as an RDD of Strings

ParameterDescription
pathRequired. Specifies the path to text file.
minPartitionsSpecifies the number of partitions the resulting RDD should have.
ADVERTISEMENT

Spark textFile() – Java Example

Following is a Java Example where we shall read a local text file and load it to RDD.

ReadTextToRDD.java

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

public class ReadTextToRDD {

	public static void main(String[] args) {
		// configure spark
		SparkConf sparkConf = new SparkConf().setAppName("Read Text to RDD")
										.setMaster("local[2]").set("spark.executor.memory","2g");
		// start a spark context
		JavaSparkContext sc = new JavaSparkContext(sparkConf);
		
		// provide path to input text file
		String path = "data/rdd/input/sample.txt";
		
		// read text file to RDD
		JavaRDD<String> lines = sc.textFile(path);
		
		// collect RDD for printing
		for(String line:lines.collect()){
			System.out.println(line);
		}
	}
}

Input Text File

Welcome to TutorialKart
Learn Apache Spark
Learn to work with RDD

Output

17/11/28 10:33:55 INFO DAGScheduler: ResultStage 0 (collect at ReadTextToRDD.java:20) finished in 0.407 s
17/11/28 10:33:55 INFO DAGScheduler: Job 0 finished: collect at ReadTextToRDD.java:20, took 0.751794 s
Welcome to TutorialKart
Learn Apache Spark
Learn to work with RDD
17/11/28 10:33:55 INFO SparkContext: Invoking stop() from shutdown hook

Spark textFile() – Python Example

Following is a Python Example where we shall read a local text file and load it to RDD.

read-text-file-to-rdd.py

import sys

from pyspark import SparkContext, SparkConf

if __name__ == "__main__":

  # create Spark context with Spark configuration
  conf = SparkConf().setAppName("Read Text to RDD - Python")
  sc = SparkContext(conf=conf)

  # read input text file to RDD
  lines = sc.textFile("/home/arjun/workspace/spark/sample.txt")

  # collect the RDD to a list
  llist = lines.collect()

  # print the list
  for line in llist:
    print(line)

Submit this python application to Spark using the following command.

~$ spark-submit /workspace/spark/read-text-file-to-rdd.py
17/11/28 15:03:13 INFO DAGScheduler: ResultStage 0 (collect at /home/arjun/workspace/spark/read-text-file-to-rdd.py:15) finished in 0.508 s
17/11/28 15:03:13 INFO DAGScheduler: Job 0 finished: collect at /home/arjun/workspace/spark/read-text-file-to-rdd.py:15, took 0.699556 s
Welcome to TutorialKart
Learn Apache Spark
Learn to work with RDD
17/11/28 15:03:13 INFO SparkContext: Invoking stop() from shutdown hook

Conclusion

In this Spark Tutorial – Read Text file to RDD, we have learnt to read data from a text file to an RDD using SparkContext.textFile() method, with the help of Java and Python examples.

In our next tutorial, we shall learn to Read multiple text files to single RDD.