Different File Formats in Big Data
When designing a solution architecture for big data, how data is stored in the backend is a crucial consideration. Two important factors that play a major role in data storage are file formats and compression techniques.
1️⃣ Why Do We Need Different File Formats?
Different file formats are needed for:
✅ Saving storage
✅ Faster processing
✅ Reduced time for I/O operations
There are several file formats available that provide one or more of the following features:
- Faster reads
- Faster writes
- Splittable
- Schema Evolution Support
- Supports Advanced Compression Techniques
The respective file formats that meet the project requirements will be used for processing.
2️⃣ Two Broad Categories of Files Formats
➡️ Row Based
In row-based file formats, the entire record, all the column values of a row are stored together followed by the values of the subsequent records.
-📌 Used when faster writes is a requirement as appending new rows is easy in case of row based.
-📌 Slower reads, reading a subset of columns is not efficient as the entire dataset has to be scanned.
-📌 Provides Less Compression
➡️ Column Based
In column-based file formats, the values of a single column of all the records are stored together. Likewise, the subsequent column values of the next column for all the records are stored together.
- 📌Efficient reads when a subset of columns has to be read because of the way data is stored underneath. Only the relevant data can be read without having to go through the entire dataset.
- 📌Slower writes as the data has to be updated at different places to write even a single new record.
-📌 Provides very Good Compression as all the values of the same datatypes are stored together.
3️⃣ File Formats Not Suited for Big Data Processing
➡️ Text files like CSV
📍Stores the values as strings/text internally and thereby consumes a lot of memory for storing and processing.
📍 If any numeric operations like addition, subtraction, etc have to be performed on numeric like values that are internally stored as string in case of text files, then they have to be casted to desired types like integer, date, long, etc.
📍Casting / conversion is a costly and a time-consuming operation.
- Data size is more and therefore the network bandwidth required to transfer the data is also high.
📍 Since the data size is more, I/O operations will also take a lot of time.
➡️ XML and JSON
📍 All the disadvantages of the text file format are also applicable for XML and JSON files.
📍Since they have an inbuilt schema associated along with the data, these file formats are bulky.
📍These file formats are not splittable, which implies no parallelism can be achieved.
📍 Lot of I/O is involved.
4️⃣ Specialized File Formats for Big Data Domain
There are 3 main File Formats well suited for Big Data Problems:
✅ PARQUET
✅ AVRO
✅ ORC
#BigData #DataEngineering #FileFormats #CompressionTechniques #BigDataOptimization