Sachin D N
2 min readDec 10, 2024

Schema Evolution in Apache Spark

Schema evolution is particularly important in Big Data systems where data is often stored in a schema-on-read format like Parquet or Avro. These formats allow the schema to be inferred at the time of reading the data, rather than at the time of writing. This flexibility makes it easier to adapt to changes in the data structure over time.

➡️ Events Triggering Schema Change

Several events can bring about changes in the schema:

✅ Adding new columns/fields
✅ Dropping existing columns/fields
✅ Changing the datatypes of existing fields

These changes can be handled gracefully with schema evolution, ensuring that your data processing systems can adapt to evolving data structures.

➡️ Benefits of Schema Evolution

1️⃣ Flexibility : Schema evolution allows for changes in the data structure over time without requiring all data to be rewritten.
2️⃣ Efficiency : By only needing to modify the schema (which is often much smaller than the data itself), schema evolution can save significant storage and processing resources.
3️⃣ Robustness : Systems that support schema evolution can handle changes in the data structure, making them more robust to real-world conditions where data schemas can change.

➡️ Example of Schema Evolution

Consider a generic example where we have a dataset of “products” with the initial schema including fields like `product_id` and `product_name`. This data is loaded into a dataframe and written in a Parquet file format.

Later, a new field `product_category` is added to the dataset. The updated data now includes `product_id`, `product_name`, and `product_category`. This new data is loaded into the dataframe and written in the Parquet file format.

By default, if we try to read this data, a schema merge would not occur as it is disabled. This would result in an error or incomplete data when trying to read the updated dataset.

To handle this, we can enable the `mergeSchema` option when reading the data: `option(“mergeSchema”, True)`. This tells Spark to merge the different schemas it finds in the Parquet files, allowing it to read the entire dataset with the evolved schema.

#BigData #DataEngineering #SchemaEvolution #Parquet #DataFrames #ApacheSpark #DataOptimization

Sachin D N
Sachin D N

Written by Sachin D N

Data Engineer and Trained on Data Science

No responses yet