Spark RDD flatMap()
In this Spark Tutorial, we shall learn to flatMap one RDD to another. Flat-Mapping is transforming each RDD element using a function that could return multiple elements to new RDD. Simple example would be applying a flatMap to Strings and using split function to return words to new RDD.
Syntax
RDD.flatMap(<function>)
where <function> is the transformation function that could return multiple elements to new RDD for each of the element of source RDD.
Java Example – Spark RDD flatMap
In this example, we will use flatMap() to convert a list of strings into a list of words. In this case, flatMap() kind of converts a list of sentences to a list of words.
RDDflatMapExample.java
import java.util.Arrays;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
public class RDDflatMapExample {
public static void main(String[] args) {
// configure spark
SparkConf sparkConf = new SparkConf().setAppName("Read Text to RDD")
.setMaster("local[2]").set("spark.executor.memory","2g");
// start a spark context
JavaSparkContext sc = new JavaSparkContext(sparkConf);
// provide path to input text file
String path = "data/rdd/input/sample.txt";
// read text file to RDD
JavaRDD<String> lines = sc.textFile(path);
// flatMap each line to words in the line
JavaRDD<String> words = lines.flatMap(s -> Arrays.asList(s.split(" ")).iterator());
// collect RDD for printing
for(String word:words.collect()){
System.out.println(word);
}
}
}
Following is the input file we used in the above Java application.
sample.text
Welcome to TutorialKart
Learn Apache Spark
Learn to work with RDD
Run this Spark Java application, and you will get following output in the console.
Output
17/11/29 12:33:59 INFO DAGScheduler: ResultStage 0 (collect at RDDflatMapExample.java:26) finished in 0.513 s
17/11/29 12:33:59 INFO DAGScheduler: Job 0 finished: collect at RDDflatMapExample.java:26, took 0.793858 s
Welcome
to
TutorialKart
Learn
Apache
Spark
Learn
to
work
with
RDD
17/11/29 12:33:59 INFO SparkContext: Invoking stop() from shutdown hook
Python Example – Spark RDD.flatMap()
We shall implement the same use case as in the previous example, but as a Python application.
spark-rdd-flatmap-example.py
import sys
from pyspark import SparkContext, SparkConf
if __name__ == "__main__":
# create Spark context with Spark configuration
conf = SparkConf().setAppName("Read Text to RDD - Python")
sc = SparkContext(conf=conf)
# read input text file to RDD
lines = sc.textFile("/home/tutorialkart/workspace/spark/sample.txt")
# flatMap each line to words
words = lines.flatMap(lambda line: line.split(" "))
# collect the RDD to a list
llist = words.collect()
# print the list
for line in llist:
print line
Run the following command in your console, from the location of your python file.
$ spark-submit spark-rdd-flatmap-example.py
Spark will submit this python application for running.
17/11/29 15:15:30 INFO DAGScheduler: ResultStage 0 (collect at /home/tutorialkart/workspace/spark/spark-rdd-flatmap-example.py:18) finished in 1.127 s
17/11/29 15:15:30 INFO DAGScheduler: Job 0 finished: collect at /home/tutorialkart/workspace/spark/spark-rdd-flatmap-example.py:18, took 1.299076 s
Welcome
to
TutorialKart
Learn
Apache
Spark
Learn
to
work
with
RDD
17/11/29 15:15:30 INFO SparkContext: Invoking stop() from shutdown hook
Conclusion
In this Spark Tutorial, we learned the syntax and examples for RDD.flatMap() method.