Different Ways of Creating a DataFrame in Spark
Apache Spark is a powerful open-source processing engine built around speed, ease of use, and sophisticated analytics. One of its core data structures is DataFrame, a distributed collection of data organized into named columns. Here are different ways to create a DataFrame in Spark:
Using spark.read
We can create a DataFrame from a data source file like CSV, JSON, or Parquet. Here’s an example using CSV:
df = spark.read.format(“csv”).option(“header”,”true”).load(filePath)
Using spark.sql
We can create a DataFrame as a result of a Spark SQL query:
df = spark.sql(“select * from table_name”)
Using spark.table
We can create a DataFrame from a table in Spark’s catalog:
df = spark.table(“table_name”)
Using spark.range
You can create a DataFrame with a single long column named id, containing elements in a range:
df = spark.range(start_range, end_range, increment)
Creating DataFrame from Local List
We can create a DataFrame from a local list:
df = spark.createDataFrame(list).toDF(“column_name”)
Creating DataFrame with Explicit Schema
We can create a DataFrame with an explicit schema:
from pyspark.sql.types import StructType, StructField, StringType
schema = StructType([
StructField(“column_name_1”, StringType(), True),
StructField(“column_name_2”, StringType(), True)
])
df = spark.createDataFrame(list, schema)
Creating DataFrame from RDD
We can create a DataFrame from an RDD (Resilient Distributed Dataset), another fundamental data structure in Spark:
rdd = spark.sparkContext.parallelize(list)
df = rdd.toDF()
In conclusion, Spark provides various ways to create DataFrames to suit different needs, making it a versatile tool for big data processing and analytics.