Different Ways of Creating a DataFrame in Spark

2 min readJul 14, 2024

Apache Spark is a powerful open-source processing engine built around speed, ease of use, and sophisticated analytics. One of its core data structures is DataFrame, a distributed collection of data organized into named columns. Here are different ways to create a DataFrame in Spark:

Using spark.read

We can create a DataFrame from a data source file like CSV, JSON, or Parquet. Here’s an example using CSV:

df = spark.read.format(“csv”).option(“header”,”true”).load(filePath)

Using spark.sql

We can create a DataFrame as a result of a Spark SQL query:

df = spark.sql(“select * from table_name”)

Using spark.table

We can create a DataFrame from a table in Spark’s catalog:

df = spark.table(“table_name”)

Using spark.range

You can create a DataFrame with a single long column named id, containing elements in a range:

df = spark.range(start_range, end_range, increment)

Creating DataFrame from Local List

We can create a DataFrame from a local list:

df = spark.createDataFrame(list).toDF(“column_name”)

Creating DataFrame with Explicit Schema

We can create a DataFrame with an explicit schema:

from pyspark.sql.types import StructType, StructField, StringType

schema = StructType([

StructField(“column_name_1”, StringType(), True),

StructField(“column_name_2”, StringType(), True)

])

df = spark.createDataFrame(list, schema)

Creating DataFrame from RDD

We can create a DataFrame from an RDD (Resilient Distributed Dataset), another fundamental data structure in Spark:

rdd = spark.sparkContext.parallelize(list)

df = rdd.toDF()

In conclusion, Spark provides various ways to create DataFrames to suit different needs, making it a versatile tool for big data processing and analytics.