Spark RDD flatMap()

In this Spark Tutorial, we shall learn to flatMap one RDD to another. Flat-Mapping is transforming each RDD element using a function that could return multiple elements to new RDD. Simple example would be applying a flatMap to Strings and using split function to return words to new RDD.

Spark RDD flatMap

Syntax

</>
Copy
RDD.flatMap(<function>)

where <function>  is the transformation function that could return multiple elements to new RDD for each of the element of source RDD.

Java Example – Spark RDD flatMap

In this example, we will use flatMap() to convert a list of strings into a list of words. In this case, flatMap() kind of converts a list of sentences to a list of words.

RDDflatMapExample.java

</>
Copy
import java.util.Arrays;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

public class RDDflatMapExample {

	public static void main(String[] args) {
		// configure spark
		SparkConf sparkConf = new SparkConf().setAppName("Read Text to RDD")
										.setMaster("local[2]").set("spark.executor.memory","2g");
		// start a spark context
		JavaSparkContext sc = new JavaSparkContext(sparkConf);
		
		// provide path to input text file
		String path = "data/rdd/input/sample.txt";
		
		// read text file to RDD
		JavaRDD<String> lines = sc.textFile(path);
		
		// flatMap each line to words in the line
		JavaRDD<String> words = lines.flatMap(s -> Arrays.asList(s.split(" ")).iterator()); 
		
		// collect RDD for printing
		for(String word:words.collect()){
			System.out.println(word);
		}
	}
}

Following is the input file we used in the above Java application.

sample.text

Welcome to TutorialKart
Learn Apache Spark
Learn to work with RDD

Run this Spark Java application, and you will get following output in the console.

Output

17/11/29 12:33:59 INFO DAGScheduler: ResultStage 0 (collect at RDDflatMapExample.java:26) finished in 0.513 s
17/11/29 12:33:59 INFO DAGScheduler: Job 0 finished: collect at RDDflatMapExample.java:26, took 0.793858 s
Welcome
to
TutorialKart
Learn
Apache
Spark
Learn
to
work
with
RDD
17/11/29 12:33:59 INFO SparkContext: Invoking stop() from shutdown hook

Python Example – Spark RDD.flatMap()

We shall implement the same use case as in the previous example, but as a Python application.

spark-rdd-flatmap-example.py

</>
Copy
import sys

from pyspark import SparkContext, SparkConf

if __name__ == "__main__":

  # create Spark context with Spark configuration
  conf = SparkConf().setAppName("Read Text to RDD - Python")
  sc = SparkContext(conf=conf)

  # read input text file to RDD
  lines = sc.textFile("/home/tutorialkart/workspace/spark/sample.txt")

  # flatMap each line to words
  words = lines.flatMap(lambda line: line.split(" "))

  # collect the RDD to a list
  llist = words.collect()

  # print the list
  for line in llist:
    print line

Run the following command in your console, from the location of your python file.

$ spark-submit spark-rdd-flatmap-example.py

Spark will submit this python application for running.

17/11/29 15:15:30 INFO DAGScheduler: ResultStage 0 (collect at /home/tutorialkart/workspace/spark/spark-rdd-flatmap-example.py:18) finished in 1.127 s
17/11/29 15:15:30 INFO DAGScheduler: Job 0 finished: collect at /home/tutorialkart/workspace/spark/spark-rdd-flatmap-example.py:18, took 1.299076 s
Welcome
to
TutorialKart
Learn
Apache
Spark
Learn
to
work
with
RDD
17/11/29 15:15:30 INFO SparkContext: Invoking stop() from shutdown hook

Conclusion

In this Spark Tutorial, we learned the syntax and examples for RDD.flatMap() method.