Mastering Spark Session Creation and Configuration in Apache Spark

Sachin D N
3 min readJul 14, 2024

--

Apache Spark is a powerful open-source processing engine for big data. At the heart of Spark’s functionality is the Spark Session, which serves as the main entry point for any Spark functionality.

Creation of Spark Session

A Spark Session is required to execute any code on a Spark Cluster. It’s also necessary for working with higher-level APIs like DataFrames and Spark SQL. For lower-level RDD operations, a Spark Context is needed.

The Spark Session acts as an umbrella, encapsulating and unifying different contexts like Spark Context, Hive Context, and SQL Context.

from pyspark.sql import SparkSession

spark = SparkSession.builder \

.appName(“Spark Session Example”) \

.getOrCreate()

In this code snippet, we’re using the builder pattern to create a new Spark Session. The appName method sets the name of the application, which will be displayed in the Spark web UI. The getOrCreate method returns an existing Spark Session if there’s already one in the environment, or creates a new one if necessary.

Customizing Spark Session

Apache Spark provides a variety of options to customize the Spark Session according to your needs. You can specify custom configurations for your Spark Session using the config method. This method takes two arguments: the name of the configuration property and its value.

spark = SparkSession.builder \

.appName(“Spark Session Example”) \

.config(“spark.some.config.option”, “some-value”) \

.getOrCreate()

The master method is used to set the master URL for the Spark Session. This determines where the Spark application will run.

spark = SparkSession.builder \

.appName(“Spark Session Example”) \

.master(“local[*]”) \

.getOrCreate()

If you’re working with Hive, you can enable Hive support using the enableHiveSupport method. This provides a Spark Session with Hive support, including connectivity to a persistent Hive metastore, support for Hive SerDes, and Hive user-defined functions (UDFs).

spark = SparkSession.builder \

.appName(“Spark Session Example”) \

.enableHiveSupport() \

.getOrCreate()

You can set the location of the Spark warehouse, which is the directory where Spark will store table data, using the config method with the “spark.sql.warehouse.dir” property.

spark = SparkSession.builder \

.appName(“Spark Session Example”) \

.config(“spark.sql.warehouse.dir”, “/path/to/warehouse”) \

.getOrCreate()

Spark Application Deployment Modes

Every Spark Application has a driver (Master) and multiple Executors (Workers). There are two modes for deploying Spark applications:

  1. Client Mode (Interactive Mode): The driver runs on the client machine or gateway node. This mode is suitable for interactive and debugging purposes.
  2. Cluster Mode (Non-interactive Mode): The driver runs on a random node in the cluster. This mode is suitable for running applications in production.

In conclusion, understanding the creation and usage of Spark Session is crucial for leveraging the power of Apache Spark. It provides the entry point for using DataFrame and Dataset APIs and allows you to run relational queries and manipulate data. The Spark Session builder provides a variety of methods to customize your Spark Session, enabling you to effectively configure your Spark environment.

Thanks for reading and your patience. I hope you liked the post.

Happy Learning!!

--

--

Sachin D N
Sachin D N

Written by Sachin D N

Data Engineer and Trained on Data Science

No responses yet