Load data from JSON data source and execute Spark SQL query

Apache Spark Dataset and DataFrame APIs provides an abstraction to the Spark SQL from data sources. Dataset provides the goodies of RDDs along with the optimization benefits of Spark SQL’s execution engine.

Dataset loads JSON data source as a distributed collection of data. DataFrame is Dataset with data arranged into named columns. The architecture containing JSON data source, Dataset, Dataframe and Spark SQL is shown below :

Load data from JSON file and execute SQL query in Apache Spark - Apache Spark Tutorial - www.tutorialkart.com
JSON -> Dataset -> DataFrame -> Spark SQL -> SQL Query

Load data from JSON file and execute SQL query

Following is a step-by-step process to load data from JSON file and execute SQL query on the loaded data from JSON file.

ADVERTISEMENT

1. Create a Spark Session

Provide application name and set master to local with two threads.

SparkSession spark = SparkSession
        .builder()
        .appName("Java Spark SQL data source JSON example")
        .master("local[2]")
        .getOrCreate();

2. Read JSON data source

SparkSession.read().json(String path) can accept either a single text file or a directory storing text files, and load the data to Dataset.

Dataset<Row> people = spark.read().json("path-to-json-files");

3. Create a temporary view using the DataFrame

people.createOrReplaceTempView("people");

4. Run SQL query

Temporary view could be considered as a table and attributes under schema root as columns

root
|-- name: string (nullable = true)
|-- salary: long (nullable = true)

Table : people Columns : name, salary

Dataset<Row> namesDF = spark.sql("SELECT name FROM people WHERE salary>3500 ");
namesDF.show();

5. Stop spark session

Complete java program to load data from JSON file and execute SQL query in given below:

data/sql/json/people.json

{"name":"Michael", "salary":3000}
{"name":"Andy", "salary":4500}
{"name":"Justin", "salary":3500}
{"name":"Berta", "salary":4000}
{"name":"Raju", "salary":3000}
{"name":"Chandy", "salary":4500}
{"name":"Joey", "salary":3500}
{"name":"Mon", "salary":4000}
{"name":"Rachel", "salary":4000}

JSONsqlExample.java

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
 
public class JSONsqlExample {
 
    public static void main(String[] args) {
        SparkSession spark = SparkSession
                .builder()
                .appName("Java Spark SQL data source JSON example")
                .master("local[2]")
                .getOrCreate();
 
        // A JSON dataset is pointed to by path.
        // The path can be either a single text file or a directory storing text files
        Dataset<Row> people = spark.read().json("data/sql/json/");
 
        // The inferred schema can be visualized using the printSchema() method
        System.out.println("Schema\n=======================");
        people.printSchema();
 
        // Creates a temporary view using the DataFrame
        people.createOrReplaceTempView("people");
 
        // SQL statements can be run by using the sql methods provided by spark
        Dataset<Row> namesDF = spark.sql("SELECT name FROM people WHERE salary>3500 ");
        System.out.println("\n\nSQL Result\n=======================");
        namesDF.show();
 
        // stop spark session
        spark.stop();
    }
}

Output

Schema
=======================
root
|-- name: string (nullable = true)
|-- salary: long (nullable = true)
 
SQL Result
=======================
+------+
| name|
+------+
| Andy|
| Berta|
|Chandy|
| Mon|
|Rachel|
+------+

Conclusion

In this Apache Spark Tutorial, we have learnt to load a json file into Dataset and access the data using SQL queries through Spark SQL.