Sachin D N
2 min readMay 3, 2024

--

Understanding Hive Tables: Managed and External 🐝

Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. One of the core components of Hive is its table, and there are two types of tables in Hive: Managed Tables and External Tables.

Managed Tables πŸ“¦

πŸ”Ή In the case of Managed tables, both the data and the metadata are managed and controlled by Hive.
πŸ”Έ When you delete a managed table, Hive deletes both the metadata and the data stored in HDFS.
πŸ”Ή Managed tables are a good fit when Hive is the sole owner of the data and there are no other external tools or technologies accessing the data.

Here’s an example of creating and loading data into a Hive Managed Table:

CREATE TABLE IF NOT EXISTS managed_table (
column1 datatype,
column2 datatype,
…
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY β€˜,’
STORED AS TEXTFILE;

LOAD DATA INPATH β€˜/path/to/your/data.csv’ INTO TABLE managed_table;

External Tables 🌍

πŸ”Ή In the case of an External table, only the metadata is managed and controlled by Hive. The data is stored externally, usually in a distributed storage like HDFS or S3.
πŸ”Έ When you delete an external table, Hive only deletes the metadata. The data still remains intact in the external storage.
πŸ”Ή External tables are a good fit when there are other external tools or technologies accessing the data. This avoids the risk of losing data due to accidental deletion.

Here’s an example of creating a Hive External Table:

CREATE EXTERNAL TABLE IF NOT EXISTS external_table (
column1 datatype,
column2 datatype,
…
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY β€˜,’
STORED AS TEXTFILE;

When to use Managed Tables? βœ…

Managed tables are best used when:

πŸ“Œ Hive is the sole owner and manager of the data.
πŸ“Œ No other external tools or technologies are accessing the data.
πŸ“Œ You want Hive to handle the lifecycle of the data along with the metadata.

When to use External Tables? βœ…

External tables are best used when:

πŸ“Œ Data is used across multiple platforms and not just Hive.
πŸ“Œ Data located in the table is used outside of Hive.
πŸ“Œ You want more control over the lifecycle of the data.

Conclusion πŸŽ“

Understanding the difference between Managed and External tables in Hive is crucial when working with Hive. The choice between Managed and External tables depends on the specific use case and how the data is being accessed and managed. Always remember, Hive is a powerful tool in the Hadoop ecosystem for processing structured and semi-structured data. πŸπŸ“ŠπŸ“ˆ

#Hive #Hadoop #BigData #DataWarehouse #ManagedTables #ExternalTables #ApacheHive #DataWarehouse #BigData #DataAnalytics #Hadoop #SQL #DataStorage #Metadata #Hive #DistributedStorage #FaultTolerance #DataWarehousing

--

--