Aug 12, 2023 7 min read Data Engineering

Apache Iceberg vs Apache Parquet: Metadata Deep Dive

Apache Iceberg is a table format for huge analytic tables. Iceberg brings the ability to treat your cloud storage data like SQL tables and makes it possible for query engines to operate on your cloud data concurrently.

Apache Parquet is a column-oriented data file format designed for efficient data storage and retrieval.

These two technologies can work together. You can have an Iceberg table that consists of Parquet files. This post isn’t an Iceberg vs Parquet comparison, as the two do different things. However, they both collect metadata about your data.

While driving adoption of Iceberg at my company, I wasn’t able to answer a question regarding what Iceberg does differently from Parquet when collecting metadata. One of the selling points of Iceberg is collecting metadata of each file to make queries faster by enabling files or chunks of files to be skipped. This is known as pruning.

So, what is the difference here between how query engines can use Parquet versus how they can use Iceberg’s metadata? What metadata is different between the two? How can Iceberg be better than simple parquet when it comes to pruning?

The Experiment

I grabbed one Parquet file from NYC’s taxi dataset here. This maybe is not ideal for a benchmark or true comparison because it’s only one file. And with only one file your not seeing the effects of pruning when you query it. But this experiment i just wanted to explore metadata specifically so that part is irrelevant.

I launched a Spark shell with:

$SPARK_HOME/bin/spark-shell --packages="org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.3.0" \
    --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
    --conf spark.sql.catalog.iceberg_catalog=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.iceberg_catalog.type=hadoop \
    --conf spark.sql.catalog.iceberg_catalog.warehouse=$PWD/warehouse

This created a local Iceberg catalog in my file system and launched a Spark shell with the Iceberg dependencies.

Then I ran this code to create an Iceberg table on this Parquet file:

val df = spark.read.parquet("yellow_tripdata_2023-01.parquet")
df.createOrReplaceTempView("temp_view")
spark.sql("create table iceberg_catalog.db.yellow_taxi using iceberg as select * from temp_view")

I also used parquet-tools on my Parquet file to inspect the footer and metadata of the file. The resulting metadata from both Iceberg and Parquet are in this GitHub Repo: https://github.com/clettieri/Parquet-vs-Iceberg.

Parquet Metadata

The Parquet file does contain metadata like column names, types, counts, min/max, and null counts in the footer. As summarized here:

Row group 0:  count: 3066766  15.54 B records  start: 4  total(compressed): 45.453 MB total(uncompressed):79.356 MB 
--------------------------------------------------------------------------------
                       type      encodings count     avg size   nulls   min / max
VendorID               INT64     G _ R     3066766   0.11 B     0       "1" / "2"
tpep_pickup_datetime   INT64     G _ R_ F  3066766   3.72 B     0       "2008-12-31T23:01:42.000000" / "2023-02-01T00:56:53.000000"
tpep_dropoff_datetime  INT64     G _ R_ F  3066766   3.83 B     0       "2009-01-01T14:29:11.000000" / "2023-02-02T09:28:47.000000"
passenger_count        DOUBLE    G _ R     3066766   0.20 B     71743   "-0.0" / "9.0"
trip_distance          DOUBLE    G _ R     3066766   1.40 B     0       "-0.0" / "258928.15"
RatecodeID             DOUBLE    G _ R     3066766   0.07 B     71743   "1.0" / "99.0"
store_and_fwd_flag     BINARY    G _ R     3066766   0.01 B     71743   "N" / "Y"
PULocationID           INT64     G _ R     3066766   0.74 B     0       "1" / "265"
DOLocationID           INT64     G _ R     3066766   1.00 B     0       "1" / "265"
payment_type           INT64     G _ R     3066766   0.13 B     0       "0" / "4"
fare_amount            DOUBLE    G _ R     3066766   1.01 B     0       "-900.0" / "1160.1"
extra                  DOUBLE    G _ R     3066766   0.25 B     0       "-7.5" / "12.5"
mta_tax                DOUBLE    G _ R     3066766   0.03 B     0       "-0.5" / "53.16"
tip_amount             DOUBLE    G _ R     3066766   1.13 B     0       "-96.22" / "380.8"
tolls_amount           DOUBLE    G _ R     3066766   0.12 B     0       "-65.0" / "196.99"
improvement_surcharge  DOUBLE    G _ R     3066766   0.02 B     0       "-1.0" / "1.0"
total_amount           DOUBLE    G _ R     3066766   1.60 B     0       "-751.0" / "1169.4"
congestion_surcharge   DOUBLE    G _ R     3066766   0.08 B     71743   "-2.5" / "2.5"
airport_fee            DOUBLE    G _ R     3066766   0.08 B     71743   "-1.25" / "1.25"

Iceberg Metadata

In Iceberg, the snapshot points to a manifest list. A manifest list stores the snapshot’s list of manifests, along with the range of values for each partition field. The manifest list has some info like row counts but points to list of manifest files. These manifest files contain more metadata and paths to individual parquet files. In my example here, there is only one manifest file. But when you look at it you see we have basically the same info as the Parquet footer (counts, sizes, min/max, nulls).

From the Iceberg documentation:

Iceberg first filters manifests using the partition value ranges in the manifest list. Then, it reads each manifest to get data files. With this scheme, the manifest list acts as an index over the manifest files, making it possible to plan without reading all manifests. Manifest files include a tuple of partition data and column-level stats for each data file.

During planning, query predicates are automatically converted to predicates on the partition data and applied first to filter data files. Next, column-level value counts, null counts, lower bounds, and upper bounds are used to eliminate files that cannot match the query predicate.

Here is what the contents of the manifest file look like:

{"status":1,"snapshot_id":{"long":320058616475059124},"data_file":{"file_path":"/Users/chris.lettieri/warehouse/db/yellow_taxi/data/00004-15-5e171517-2fc6-4cf7-8a83-ffc0b2f417c8-00001.parquet","file_format":"PARQUET","partition":{},"record_count":3066766,"file_size_in_bytes":48719399,"block_size_in_bytes":67108864,"column_sizes":{"array":[{"key":1,"value":374167},{"key":2,"value":11698526},{"key":3,"value":12009850},{"key":4,"value":641073},{"key":5,"value":4306404},{"key":6,"value":253907},{"key":7,"value":52224},{"key":8,"value":2223718},{"key":9,"value":3082435},{"key":10,"value":437749},{"key":11,"value":3138099},{"key":12,"value":785262},{"key":13,"value":115199},{"key":14,"value":3529007},{"key":15,"value":392137},{"key":16,"value":81744},{"key":17,"value":4932342},{"key":18,"value":285209},{"key":19,"value":278301}]},"value_counts":{"array":[{"key":1,"value":3066766},{"key":2,"value":3066766},{"key":3,"value":3066766},{"key":4,"value":3066766},{"key":5,"value":3066766},{"key":6,"value":3066766},{"key":7,"value":3066766},{"key":8,"value":3066766},{"key":9,"value":3066766},{"key":10,"value":3066766},{"key":11,"value":3066766},{"key":12,"value":3066766},{"key":13,"value":3066766},{"key":14,"value":3066766},{"key":15,"value":3066766},{"key":16,"value":3066766},{"key":17,"value":3066766},{"key":18,"value":3066766},{"key":19,"value":3066766}]},"null_value_counts":{"array":[{"key":1,"value":0},{"key":2,"value":0},{"key":3,"value":0},{"key":4,"value":71743},{"key":5,"value":0},{"key":6,"value":71743},{"key":7,"value":71743},{"key":8,"value":0},{"key":9,"value":0},{"key":10,"value":0},{"key":11,"value":0},{"key":12,"value":0},{"key":13,"value":0},{"key":14,"value":0},{"key":15,"value":0},{"key":16,"value":0},{"key":17,"value":0},{"key":18,"value":71743},{"key":19,"value":71743}]},"nan_value_counts":{"array":[{"key":4,"value":0},{"key":5,"value":0},{"key":6,"value":0},{"key":11,"value":0},{"key":12,"value":0},{"key":13,"value":0},{"key":14,"value":0},{"key":15,"value":0},{"key":16,"value":0},{"key":17,"value":0},{"key":18,"value":0},{"key":19,"value":0}]},"lower_bounds":{"array":[{"key":1,"value":"\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000"},{"key":2,"value":"¡ñ°__\u0004\u0000"},{"key":3,"value":"Àûá¥l_\u0004\u0000"},{"key":4,"value":"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"},{"key":5,"value":"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"},{"key":6,"value":"\u0000\u0000\u0000\u0000\u0000\u0000ð?"},{"key":7,"value":"N"},{"key":8,"value":"\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000"},{"key":9,"value":"\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000"},{"key":10,"value":"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"},{"key":11,"value":"\u0000\u0000\u0000\u0000\u0000 À"},{"key":12,"value":"\u0000\u0000\u0000\u0000\u0000\u0000\u001EÀ"},{"key":13,"value":"\u0000\u0000\u0000\u0000\u0000\u0000à¿"},{"key":14,"value":"®Gáz\u0014\u000EXÀ"},{"key":15,"value":"\u0000\u0000\u0000\u0000\u0000@PÀ"},{"key":16,"value":"\u0000\u0000\u0000\u0000\u0000\u0000ð¿"},{"key":17,"value":"\u0000\u0000\u0000\u0000\u0000xÀ"},{"key":18,"value":"\u0000\u0000\u0000\u0000\u0000\u0000\u0004À"},{"key":19,"value":"\u0000\u0000\u0000\u0000\u0000\u0000ô¿"}]},"upper_bounds":{"array":[{"key":1,"value":"\u0002\u0000\u0000\u0000\u0000\u0000\u0000\u0000"},{"key":2,"value":"@\u001Fìó\u0005\u0000"},{"key":3,"value":"Àñ\u00191´ó\u0005\u0000"},{"key":4,"value":"\u0000\u0000\u0000\u0000\u0000\u0000\"@"},{"key":5,"value":"3333\u000FA"},{"key":6,"value":"\u0000\u0000\u0000\u0000\u0000ÀX@"},{"key":7,"value":"Y"},{"key":8,"value":"\t\u0001\u0000\u0000\u0000\u0000\u0000\u0000"},{"key":9,"value":"\t\u0001\u0000\u0000\u0000\u0000\u0000\u0000"},{"key":10,"value":"\u0004\u0000\u0000\u0000\u0000\u0000\u0000\u0000"},{"key":11,"value":"fffff @"},{"key":12,"value":"\u0000\u0000\u0000\u0000\u0000\u0000)@"},{"key":13,"value":"\u0014®GázJ@"},{"key":14,"value":"ÍÌÌÌÌÌw@"},{"key":15,"value":"Ház\u0014®h@"},{"key":16,"value":"\u0000\u0000\u0000\u0000\u0000\u0000ð?"},{"key":17,"value":"E@"},{"key":18,"value":"\u0000\u0000\u0000\u0000\u0000\u0000\u0004@"},{"key":19,"value":"\u0000\u0000\u0000\u0000\u0000\u0000ô?"}]},"key_metadata":null,"split_offsets":{"array":[4]},"sort_order_id":{"int":0}}}

This is an Avro file converted to JSON. The "key" represents each column in the table.

Conclusion

While Iceberg doesn’t appear to collect any different metadata than can be found in a parquet file, it still provides performance benefits over simply using Parquet. With a folder of Parquet files your query engine needs to check each file’s metadata before knowing whether to skip reading the file or not. This can be expensive as you have to list and open all the files.

With Iceberg’s metadata located in the manifest files and partitioning pruning on the manifest list, you get two layers of indices into your data files. Your query engine only needs to read these to know what data files to go read. With Iceberg, we skip entire files without needing to open them to check the footer in the first place.

While the metadata is similar between the two, Iceberg offers a range of additional benefits over simply using Parquet files in a folder.

The Experiment

Parquet Metadata

Iceberg Metadata

Conclusion

You might also like...

Write for Yourself to Solve Problems

Rethink that Side Hustle

4 Habits for Career Growth & Enjoyment

Make Better Decisions with One Question

Words Over Code: Grow Your Impact And Career Through Writing