Apache Spark Logical and Physical Plan
In Apache Spark, the process of executing a query involves several steps, from parsing the query to generating a physical plan for execution. This process is known as query planning.
1️⃣ Parsed Logical Plan (Unresolved)
The first step in query planning is parsing the query to check the correctness of the query syntax. If there are any syntax errors, a ParseException is thrown. At this stage, the system only checks for the correctness of syntax and cannot identify if the entities like table name or column name used in the query exist or not. Therefore, it is unresolved.
2️⃣ Analysed Logical Plan (Resolved)
The next step is to analyze if all the entities like table names, column names, views, etc used in the query exist or not. If, for instance, a table with the name mentioned in the query doesn’t exist, an Analysis Exception is thrown. After checking for the syntax correctness in the previous stage, the system then checks for any analysis exception by cross-checking with the Catalog leading to a Resolved Logical Plan.
3️⃣ Optimised Logical Plan
The system then uses certain sets of predefined rules to optimize the query execution plan at the early stages. Examples of these optimizations include:
📌 Predicate Pushdown: In this case, the filters are pushed down or applied at the very early stages. This ensures that operations are performed on only relevant data.
📌 Combining multiple projections into a single projection.
📌 Combining multiple filters into a single operation.
4️⃣ Physical Plan
The final step is to generate a physical plan, which is used to identify or decide what kind of joins or aggregation strategies can be chosen for optimal query performance. Examples include:
📌 Whether to use Hash Aggregate or Sort Aggregate
📌 Which type of Join to be used — Broadcast Hash Join, Sort-Merge Join, or Shuffle-Hash Join.
By understanding these plans, We can better understand how Spark executes queries and how to optimize them for better performance.
#ApacheSpark #DataEngineering #BigData #QueryPlanning #LogicalPlan #PhysicalPlan #SparkOptimization #SparkPerformance