Introducing Apache Iceberg materialized views in AWS Glue Data Catalog

Introducing Apache Iceberg materialized views in AWS Glue Data Catalog

Hundreds of thousands of customers build artificial intelligence and machine learning (AI/ML) and analytics applications on AWS, frequently transforming data through multiple stages for improved query performance—from raw data to processed datasets to final analytical tables. Data engineers must solve complex problems, including detecting what data has changed in base tables, writing and maintaining transformation logic, scheduling and orchestrating workflows across dependencies, provisioning and managing compute infrastructure, and troubleshooting failures while monitoring pipeline health. Consider an ecommerce company where data engineers need to continuously merge clickstream logs with orders data for analytics. Each transformation requires building robust change detection mechanisms, writing complex joins and aggregations, coordinating multiple workflow steps, scaling compute resources appropriately, and maintaining operational oversight—all while supporting data quality and pipeline reliability. This complexity demands months of dedicated engineering effort and ongoing maintenance, making data transformation costly and time-intensive for organizations seeking to unlock insights from their data.

To address those challenges, AWS announced a new materialized view capability for Apache Iceberg tables in the AWS Glue Data Catalog. The new materialized view capability simplifies data pipelines and accelerates data lake query performance. A materialized view is a managed table in the AWS Glue Data Catalog that stores pre-computed results of a query in Iceberg format that is incrementally updated to reflect changes to the underlying datasets. This alleviates the need to build and maintain complex data pipelines to generate transformed datasets and accelerate query performance. Apache Spark engines across Amazon Athena, Amazon EMR, and AWS Glue support the new materialized views and intelligently rewrite queries to use materialized views that speed up performance while reducing compute costs.

In this post, we show you how Iceberg materialized view works and how to get started.

How Iceberg materialized views work

Iceberg materialized views offer a simple, managed solution built on familiar SQL syntax. Instead of building complex pipelines, you can create materialized views using standard SQL queries from Spark, transforming data with aggregates, filters, and joins without writing custom data pipelines. Change detection, incremental updates, and monitoring source tables are automatically handled in the AWS Glue Data Catalog and refreshing materialized views as new data arrive, alleviating the need for manual pipeline orchestration. Data transformations run on fully managed compute infrastructure, removing the burden of provisioning, scaling, or maintaining servers.

The resulting pre-computed data is stored as Iceberg tables in an Amazon Simple Storage Service (Amazon S3) general purpose bucket, or Amazon S3 Tables buckets within the your account, making transformed data immediately accessible to multiple query engines, including Athena, Amazon Redshift, and AWS optimized Spark runtime. Spark engines across Athena, Amazon EMR, and AWS Glue support an automatic query rewrite functionality that intelligently uses materialized views, delivering automatic performance improvement for data processing jobs or interactive notebook queries.

In the following sections, we walk through the steps to create, query, and refresh materialized views.

Pre-requisite

To follow along with this post, you must have an AWS account.

To run the instruction on Amazon EMR, complete the following steps to configure the cluster:

  1. Launch an Amazon EMR cluster 7.12.0 or higher.
  2. SSH login to the primary node of your Amazon EMR cluster, and run the following command to start a Spark application with required configurations:
    spark-sql \
      --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
      --conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog \
      --conf spark.sql.catalog.glue_catalog.type=glue \
      --conf spark.sql.catalog.glue_catalog.warehouse=s3://amzn-s3-demo-bucket/warehouse \
      --conf spark.sql.catalog.glue_catalog.glue.region=us-east-1 \
      --conf spark.sql.catalog.glue_catalog.glue.id=123456789012 \
      --conf spark.sql.catalog.glue_catalog.glue.account-id=123456789012 \
      --conf spark.sql.catalog.glue_catalog.client.region=us-east-1 \
      --conf spark.sql.catalog.glue_catalog.glue.lakeformation-enabled=true \
      --conf spark.sql.optimizer.answerQueriesWithMVs.enabled=true \
      --conf spark.sql.defaultCatalog=glue_catalog
      

To run the instruction on AWS Glue for Spark, complete the following steps to configure the job:

  1. Create an AWS Glue version 5.1 job or higher.
  2. Configure a job parameter
    1. Key: --conf
    2. Value: spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
  3. Configure your job with the following script:
    from pyspark.sql import SparkSession
    
    
    spark = (
        SparkSession.builder \
            .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
            .config("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog")
            .config("spark.sql.catalog.glue_catalog.type", "glue")
            .config("spark.sql.catalog.glue_catalog.warehouse", "s3://amzn- -demo-bucket/warehouse")
            .config("spark.sql.catalog.glue_catalog.glue.region", "us-east-1")
            .config("spark.sql.catalog.glue_catalog.glue.id", "123456789012")
            .config("spark.sql.catalog.glue_catalog.glue.account-id", "123456789012")
    		.config("spark.sql.catalog.glue_catalog.client.region", "us-east-1")
            .config("spark.sql.catalog.glue_catalog.glue.lakeformation-enabled", "true")
            .config("spark.sql.optimizer.answerQueriesWithMVs.enabled", "true")
            .config("spark.sql.defaultCatalog", "glue_catalog")
            .getOrCreate()
    )
  4. Run the following queries using Spark SQL to set up a base table. In AWS Glue, you can run them through spark.sql("QUERY STATEMENT").
    CREATE DATABASE IF NOT EXIST iceberg_mv;
    
    USE iceberg_mv;
    
    CREATE TABLE IF NOT EXISTS base_tbl (
        id INT,
        customer_name STRING,
        amount INT,
        order_date DATE);
        
    INSERT INTO base_tbl VALUES (1, 'John Doe', 150, DATE('2025-12-01')), (2, 'Jane Smith', 200, DATE('2025-12-02')), (3, 'Bob Johnson', 75, DATE('2025-12-03'));
    
    SELECT * FROM base_tbl;

In the subsequent sections, we create a materialized view with this base table.

If you want to store your materialized views in Amazon S3 Tables instead of a general Amazon S3 bucket, refer to Appendix 1 at the end of this post for the configuration details.

Create a materialized view

To create a materialized view, run the following command:

CREATE MATERIALIZED VIEW mv
AS SELECT
    customer_name, 
    COUNT(*) as mv_order_count, 
    SUM(amount) as mv_total_amount 
FROM glue_catalog.iceberg_mv.base_tbl
GROUP BY customer_name;

After you create a materialized view, AWS Spark’s in-memory metadata cache needs time to populate with information about the new materialized view. During this cache population period, queries against the base table will run normally without using the materialized view. After the cache is fully populated (typically within tens of seconds), Spark automatically detects that the materialized view can satisfy the query and rewrites it to use the pre-computed materialized view instead, improving performance.

To see this behavior, run the following EXPLAIN command immediately after creating the materialized view:

EXPLAIN EXTENDED
SELECT customer_name, COUNT(*) as mv_order_count, SUM(amount) as mv_total_amount 
FROM base_tbl
GROUP BY customer_name;

The following output shows the initial result before cache population:

== Parsed Logical Plan ==
'Aggregate ['customer_name], ['customer_name, 'COUNT(1) AS mv_order_count#0, 'SUM('amount) AS mv_total_amount#1]
+- 'UnresolvedRelation [base_tbl] , [], false

== Analyzed Logical Plan ==
customer_name: string, mv_order_count: bigint, mv_total_amount: bigint
Aggregate [customer_name#8], [customer_name#8, count(1) AS mv_order_count#0L, sum(amount#9) AS mv_total_amount#1L]
+- SubqueryAlias glue_catalog.iceberg_mv.base_tbl
   +- RelationV2[id#7, customer_name#8, amount#9, order_date#10] glue_catalog.iceberg_mv.base_tbl glue_catalog.iceberg_mv.base_tbl

== Optimized Logical Plan ==
Aggregate [customer_name#8], [customer_name#8, count(1) AS mv_order_count#0L, sum(amount#9) AS mv_total_amount#1L]
+- RelationV2[customer_name#8, amount#9] glue_catalog.iceberg_mv.base_tbl

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[customer_name#8], functions=[count(1), sum(amount#9)], output=[customer_name#8, mv_order_count#0L, mv_total_amount#1L], schema specialized)
   +- Exchange hashpartitioning(customer_name#8, 1000), ENSURE_REQUIREMENTS, [plan_id=19]
      +- HashAggregate(keys=[customer_name#8], functions=[partial_count(1), partial_sum(amount#9)], output=[customer_name#8, count#27L, sum#29L], schema specialized)
         +- BatchScan glue_catalog.iceberg_mv.base_tbl[customer_name#8, amount#9] glue_catalog.iceberg_mv.base_tbl (branch=null) [filters=, groupedBy=, pushedLimit=None] RuntimeFilters: []

In this initial execution plan, Spark scans the base_tbl directly (BatchScan glue_catalog.iceberg_mv.base_tbl) and runs aggregations (COUNT and SUM) on the raw data. This is the behavior before the materialized view metadata cache is populated.

After waiting approximately tens of seconds for the metadata cache population, run the same EXPLAIN command again. The following output shows the primary differences in the query optimization plan after cache population:

== Optimized Logical Plan ==
Aggregate [customer_name#97], [customer_name#97, coalesce(sum(mv_order_count#98L), 0) AS mv_order_count#72L, sum(mv_total_amount#99L) AS mv_total_amount#73L]
+- RelationV2[customer_name#97, mv_order_count#98L, mv_total_amount#99L] glue_catalog.iceberg_mv.mv

== Physical  Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[customer_name#97], functions=[sum(mv_order_count#98L), sum(mv_total_amount#99L)], output=[customer_name#97, mv_order_count#72L, mv_total_amount#73L], schema specialized)
   +- Exchange hashpartitioning(customer_name#97, 1000), ENSURE_REQUIREMENTS, [plan_id=51]
      +- HashAggregate(keys=[customer_name#97], functions=[partial_sum(mv_order_count#98L), partial_sum(mv_total_amount#99L)], output=[customer_name#97, sum#113L, sum#115L], schema specialized)
         +- BatchScan glue_catalog.iceberg_mv.mv[customer_name#97, mv_order_count#98L, mv_total_amount#99L] glue_catalog.iceberg_mv.mv (branch=null) [filters=, groupedBy=, pushedLimit=None] RuntimeFilters: []

After the cache is populated, Spark now scans the materialized view (BatchScan glue_catalog.iceberg_mv.mv) instead of the base table. The query has been automatically rewritten to read from the pre-computed aggregated data in the materialized view. The output specifically shows the aggregation functions now simply sum the pre-computed values (sum(mv_order_count) and sum(mv_total_amount)) rather than recalculating COUNT and SUM from raw data.

Create a materialized view with scheduling automatic refresh

By default, a newly created materialized view contains the initial query results. It’s not automatically updated when the underlying base table data changes. To keep your materialized view synchronized with the base table data, you can configure automatic refresh schedules. To enable automatic refresh, use the REFRESH EVERY clause when creating the materialized view. This clause accepts a time interval and unit, so you can specify how frequently the materialized view is updated.

The following example creates a materialized view that automatically refreshes every 24 hours:

CREATE MATERIALIZED VIEW mv
REFRESH EVERY 24 HOURS
AS SELECT
    customer_name, 
    COUNT(*) as mv_order_count, 
    SUM(amount) as mv_total_amount 
FROM glue_catalog.iceberg_mv.base_tbl
GROUP BY customer_name;

You can configure the refresh interval using any of the following time units: SECONDS, MINUTES, HOURS, or DAYS. Choose an appropriate interval based on your data freshness requirements and query patterns.

If you prefer more control over when your materialized view updates, or need to refresh it outside of the scheduled intervals, you can trigger manual refreshes at any time. We provide detailed instructions on manual refresh options, including full and incremental refresh, later in this post.

Query a materialized view

To query a materialized view on your Amazon EMR cluster and retrieve its aggregated data, you can use a standard SELECT statement:

This query retrieves all rows from the materialized view. The output shows the aggregated customer order counts and total amounts. The result displays three customers with their respective metrics:

-- Result
Jane Smith    1    200
Bob Johnson    1    75
John Doe    1    150

Additionally, you can query the same materialized view from Athena SQL. The following screenshot shows the same query run on Athena and the resulting output.

Refresh a materialized view

You can refresh materialized views using two refresh types: full refresh or incremental refresh. Full refresh re-computes the entire materialized view from all base table data. Incremental refresh processes only the changes since the last refresh. Full refresh is ideal when you need consistency or after significant data changes. Incremental refresh is preferred when you need immediate updates. The following examples show both refresh types.

To use full refresh, complete the following steps:

  1. Insert three new records into the base table to simulate new data arriving:
    INSERT INTO base_tbl VALUES 
    (4, 'Jane Smith', 350, DATE('2025-11-29')), 
    (5, 'Bob Johnson', 100, DATE('2025-11-30')), 
    (6, 'Kwaku Mensah', 40, DATE('2025-12-01'));
  2. Query the materialized view to verify it still shows the old aggregated values:
    SELECT * FROM mv;
    
    -- Result
    Jane Smith    1    200
    Bob Johnson    1    75
    John Doe    1    150
  3. Run a full refresh of the materialized view using the following command:
    REFRESH MATERIALIZED VIEW mv FULL;
  4. Query the materialized view again to verify the aggregated values now include the new records:
    SELECT * FROM mv;
    
    -- Result
    Jane Smith    2    550 // Updated
    Bob Johnson    2    175  // Updated
    John Doe    1    150
    Kwaku Mensah    1    40 // Added

To use incremental refresh, complete the following steps:

  1. Enable incremental refresh by setting the Spark configuration properties:
    SET spark.sql.optimizer.incrementalMVRefresh.enabled=true;
  2. Insert two additional records into the base table:
    INSERT INTO base_tbl VALUES 
    (7, 'Jane Smith', 120, DATE('2025-11-28')), 
    (8, 'Kwaku Mensah', 90, DATE('2025-12-02'));
  3. Run an incremental refresh using the REFRESH command without the FULL clause. To verify if incremental refresh is enabled, refer to Appendix 2 at the end of this post.
    REFRESH MATERIALIZED VIEW mv;
  4. Query the materialized view to confirm the incremental changes are reflected in the aggregated results:
    SELECT * FROM mv;
    
    --Result
    Jane Smith    3    670    3    3 // Updated
    Bob Johnson    2    175    2    2 
    John Doe    1    150    1    1
    Kwaku Mensah    2    130    2    2 // Updated

In addition to using Spark SQL, you can also trigger manual refreshes through AWS Glue APIs when you need updates outside your scheduled intervals. Run the following AWS CLI command:

$ aws glue start-materialized-view-refresh-task-run \
    --catalog-id <ACCOUNT_ID> \
    --database-name <DATABASE_NAME> \
    --table-name <MV_TABLE_NAME>

The AWS Lake Formation console displays refresh history for API-triggered updates. Open your materialized view to see the refresh type (INCREMENTAL or FULL), start and end time, status and so on:

You have learned how to use Iceberg materialized views to make your efficient data processing and queries. You created a materialized view using Spark on Amazon EMR, queried it from both Amazon EMR and Athena, and used two refresh mechanisms: full refresh and incremental refresh. Iceberg materialized views help you transform and optimize your data pipelines effortlessly.

Considerations

There are important aspects to consider for optimal usage of the capability:

  • We introduced new SQL syntax to manage materialized views in the AWS optimized Spark runtime engine only. These new SQL commands are available in Spark version 3.5.6 and above across Athena, Amazon EMR, and AWS Glue. Open source Spark is not supported.
  • Materialized views are eventually consistent with base tables. When source tables change, the materialized views are updated through background refresh processes as defined by users in the refresh schedule at creation. During the refresh window, queries directly accessing materialized views might see outdated data. However, customers who need immediate access to the most up-to-date datasets can run a manual refresh with a simple REFRESH MATERIALIZED VIEW SQL command.

Clean up

To avoid incurring future charges, clean up the resources you created during this walkthrough:

  1. Run the following commands to delete a materialized view and tables:
    DROP TABLE mv PURGE;
    -- Or, DROP MATERIALIZED VIEW mv;
    
    DROP TABLE base_tbl PURGE;
    -- If necessary, delete the database by DROP DATABASE iceberg_mv;
  2. For Amazon EMR, terminate the Amazon EMR cluster.
  3. For AWS Glue, delete the AWS Glue job.

Conclusion

This post demonstrated how Iceberg materialized views facilitate efficient data lake operations on AWS. The new materialized view capability simplifies data pipelines and improves query performance by storing pre-computed results that are automatically updated as base tables change. You can create materialized views using familiar SQL syntax, using both full and incremental refresh mechanisms to maintain data consistency. This solution alleviates the need for complex pipeline maintenance while providing seamless integration with AWS services like Athena, Amazon EMR, and AWS Glue. The automatic query rewrite functionality further optimizes performance by intelligently utilizing materialized views when applicable, making it a powerful tool for organizations looking to streamline their data transformation workflows and accelerate query performance.

Appendix 1: Spark configuration to use Amazon S3 Tables storing Apache Iceberg materialized views

As mentioned earlier in this post, materialized views are stored as Iceberg tables in Amazon S3 Tables buckets within your account. When you want to use Amazon S3 Tables as the storage location for your materialized views instead of a general Amazon S3 bucket, you must configure Spark with the Amazon S3 Tables catalog.

The difference from the standard AWS Glue Data Catalog configuration shown in the prerequisites section is the glue.id parameter format. For Amazon S3 Tables, use the format <account-id>:s3tablescatalog/<s3-tables-bucket-name> instead of just the account ID:

spark-sql \
  --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
  --conf spark.sql.catalog.s3t_catalog=org.apache.iceberg.spark.SparkCatalog \
  --conf spark.sql.catalog.s3t_catalog.type=glue \
  --conf spark.sql.catalog.s3t_catalog.warehouse="s3://amzn-s3-demo-bucket/warehouse" \
  --conf spark.sql.catalog.s3t_catalog.glue.region="us-east-1" \
  --conf spark.sql.catalog.s3t_catalog.glue.id="123456789012:s3tablescatalog/amzn-s3-demo-table-bucket" \
  --conf spark.sql.catalog.s3t_catalog.glue.account-id=123456789012 \
  --conf spark.sql.catalog.s3t_catalog.client.region="us-east-1" \
  --conf spark.sql.catalog.s3t_catalog.glue.lakeformation-enabled=true \
  --conf spark.sql.optimizer.answerQueriesWithMVs.enabled=true \
  --conf spark.sql.defaultCatalog=s3t_catalog

After you configure Spark with these settings, you can create and manage materialized views using the same SQL commands shown in this post, and the materialized views are stored in your Amazon S3 Tables bucket.

Appendix 2: Verify refreshing a materialized view with Spark SQL

Run SHOW TBLPROPERTIES in Spark SQL to check which refresh method was used:

+-------------------------------+----------------------------------------------------------------------------------------------------------------------------------+
|key                            |value                                                                                                                             |
+-------------------------------+----------------------------------------------------------------------------------------------------------------------------------+
|IMV_ansiEnabled                |false                                                                                                                             |
|IMV_catalogInfo                |[{"catalogId":"123456789012","catalogName":"glue_catalog"}]                                                                       |
|IMV_mvCatalogID                |123456789012                                                                                                                      |
|IMV_mvNamespace                |iceberg_mv                                                                                                                        |
|IMV_region                     |us-east-1                                                                                                                         |
|IMV_sparkVersion               |3.5.6-amzn-1                                                                                                                      |
|current-snapshot-id            |5750703934418352571                                                                                                               |
|format                         |iceberg/parquet                                                                                                                   |
|format-version                 |2                                                                                                                                 |
|isMaterializedView             |true                                                                                                                              |
|lastRefreshType                |INCREMENTAL                                                                                                                       |
|subObjects                     |[{"Version":"4887707562550190856","DatabaseName":"iceberg_mv","Region":"us-east-1","CatalogId":"123456789012","Name":"base_tbl"}] |
|tableVersionToken              |*********(redacted)                                                                                                               |
|viewOriginalText               |SELECT\ncustomer_name, \nCOUNT(*) as mv_order_count, \nSUM(amount) as mv_total_amount \nFROM base_tbl\nGROUP BY customer_name     |
|viewVersionId                  |5750703934418352571                                                                                                               |
|viewVersionToken               |*********(redacted)                                                                                                               |
|write.parquet.compression-codec|zstd                                                                                                                              |
+-------------------------------+----------------------------------------------------------------------------------------------------------------------------------+

About the authors

Stay Informed

Get the best articles every day for FREE. Cancel anytime.