The Amazon SageMaker Lakehouse Architecture now supports Tag-Based Access Control for federated catalogs

The Amazon SageMaker Lakehouse Architecture now supports Tag-Based Access Control for federated catalogs

The Amazon SageMaker lakehouse architecture has expanded its tag-based access control (TBAC) capabilities to include federated catalogs. This enhancement extends beyond the default AWS Glue Data Catalog resources to encompass Amazon S3 Tables, Amazon Redshift data warehouses. TBAC is also supported on federated catalogs from data sources Amazon DynamoDB, MySQL, PostgreSQL, SQL Server, Oracle, Amazon DocumentDB, Google BigQuery, and Snowflake. TBAC provides you a sophisticated permission management that uses tags to create logical groupings of catalog resources, enabling administrators to implement fine-grained access controls across their entire data landscape without managing individual resource-level permissions.

Traditional data access management often requires manual assignment of permissions at the resource level, creating significant administrative overhead. TBAC solves this by introducing an automated, inheritance-based permission model. When administrators apply tags to data resources, access permissions are automatically inherited, eliminating the need for manual policy modifications when new tables are added. This streamlined approach not only reduces administrative burden but also enhances security consistency across the data ecosystem.

TBAC can be set up through the AWS Lake Formation console, and accessible using Amazon Redshift, Amazon Athena, Amazon EMR, AWS Glue, and Amazon SageMaker Unified Studio. This makes it valuable for organizations managing complex data landscapes with multiple data sources and large datasets. TBAC is especially beneficial for enterprises implementing data mesh architectures, maintaining regulatory compliance, or scaling their data operations across multiple departments. Furthermore, TBAC enables efficient data sharing across different accounts, making it easier to maintain secure collaboration.

In this post, we illustrate how to get started with fine-grained access control of S3 Tables and Redshift tables in the lakehouse using TBAC. We also show how to access these lakehouse tables using your choice of analytics services, such as Athena, Redshift, and Apache Spark in Amazon EMR Serverless in Amazon SageMaker Unified Studio.

Solution overview

For illustration, we consider a fictional company called Example Retail Corp, as covered in the blog post Accelerate your analytics with Amazon S3 Tables and Amazon SageMaker Lakehouse. Example Retail’s leadership has decided to use the SageMaker lakehouse architecture to unify data across S3 Tables and their Redshift data warehouse. With this lakehouse architecture, they can now conduct analyses across their data to identify at-risk customers, understand the impact of personalized marketing campaigns on customer churn, and develop targeted retention and sales strategies.

Alice is a data administrator with the AWS Identity and Access Management (IAM) role LHAdmin in Example Retail Corp, and she wants to implement tag-based access control to scale permissions across their data lake and data warehouse resources. She is using S3 Tables with Iceberg transactional capability to achieve scalability as updates are streamed across billions of customer interactions, while providing the same durability, availability, and performance characteristics that S3 is known for. She already has a Redshift namespace, which contains historical and current data about sales, customers prospects, and churn information. Alice supports an extended team of developers, engineers, and data scientists who require access to the data environment to develop business insights, dashboards, ML models, and knowledge bases. This team includes:

  • Bob, a data steward with IAM role DataSteward, is the domain owner and manages access to the S3 Tables and warehouse data. He enables other teams who build reports to be shared with leadership.
  • Charlie, a data analyst with IAM role DataAnalyst, builds ML forecasting models for sales growth using the pipeline or customer conversion across multiple touchpoints, and makes those available to finance and planning teams.
  • Doug, a BI engineer with IAM role BIEngineer, builds interactive dashboards to funnel customer prospects and their conversions across multiple touchpoints, and makes those available to thousands of sales team members.

Alice decides to use the SageMaker lakehouse architecture to unify data across S3 Tables and Redshift data warehouse. Bob can now bring his domain data into one place and manage access to multiple teams requesting access to his data. Charlie can quickly build Amazon QuickSight dashboards and use his Redshift and Athena expertise to provide quick query results. Doug can build Spark-based processing with AWS Glue or Amazon EMR to build ML forecasting models.

Alice’s goal is to use TBAC to make fine-grained access much more scalable, because they can grant permissions on many resources at once and permissions are updated accordingly when tags for resources are added, changed, or removed.The following diagram illustrates the solution architecture.

Alice as Lakehouse admin and Bob as Data Steward determines that following high-level steps are needed to deploy the solution:

  1. Create an S3 Tables bucket and enable integration with the Data Catalog. This will make the resources available under the federated catalog s3tablescatalog in the lakehouse architecture with Lake Formation for access control. Create a namespace and a table under the table bucket where the data will be stored.
  2. Create a Redshift cluster with tables, publish your data warehouse to the Data Catalog, and create a catalog registering the namespace. This will make the resources available under a federated catalog in the lakehouse architecture with Lake Formation for access control.
  3. Delegate permissions to create tags and grant permissions on Data Catalog resources to DataSteward.
  4. As DataSteward, define tag ontology based on the use case and create Tags. Assign these LF-Tags to the resources (database or table) to logically group lakehouse resources for sharing based on access patterns.
  5. Share the S3 Tables catalog table and Redshift table using tag-based access control to DataAnalyst, who uses Athena for analysis and Redshift Spectrum for generating the report.
  6. Share the S3 Tables catalog table and Redshift table using tag-based access control to BIEngineer, who uses Spark in EMR Serverless to further process the datasets.

Data steward defines the tags and assignment to resources as shown:

Tags Data Resources

Domain = sales

Sensitivity = false

S3 Table:

customer(

c_salutation,              c_preferred_cust_flag,c_first_sales_date_sk,
c_customer_sk ,
c_login ,
c_current_cdemo_sk ,
c_current_hdemo_sk ,
c_current_addr_sk ,
c_customer_id ,
c_last_review_date_sk ,
c_birth_month ,
c_birth_country ,
c_birth_day ,
c_first_shipto_date_sk
)

Domain = sales

Sensitivity = true

S3 Table:

customer(

c_first_name,

c_last_name,

c_email_address,

c_birth_year)

Domain = sales

Sensitivity = false

Redshift Table:

sales.store_sales

The following table summarizes the tag expression that is granted to roles for resource access:

User Persona Permission Granted Access
Bob DataSteward SUPER_USER on catalogs Admin access on customer and store_sales.
Charlie DataAnalyst

Domain = sales

Sensitivity = false

Access to non -sensitive data that is aligned to sales domain: customer(non-sensitive columns) and store_sales.
Doug BIEngineer Domain = sales Access to all datasets that is aligned to sales domain: customer and store_sales.

Prerequisites

To follow along with this post, complete the following prerequisite steps:

  1. Have an AWS account and admin user with access to the following AWS services:
    1. Athena
    2. Amazon EMR
    3. IAM
    4. Lake Formation and the Data Catalog
    5. Amazon Redshift
    6. Amazon S3
    7. IAM Identity Center
    8. Amazon SageMaker Unified Studio
  2. Create a data lake admin (LHAdmin). For instructions, see Create a data lake administrator.
  3. Create an IAM role named DataSteward and attach permissions for AWS Glue and Lake Formation access. For instructions, refer to Data lake administrator permissions.
  4. Create an IAM role named DataAnalyst and attach permissions for Amazon Redshift and Athena access. For instructions, refer to Data analyst permissions.
  5. Create an IAM role named BIEngineer and attach permissions for Amazon EMR access. This is also the EMR runtime role that the Spark job will use to access the tables. For instructions on the role permissions, refer to Job runtime roles for EMR serverless.
  6. Create an IAM role named RedshiftS3DataTransferRole following the instructions in Prerequisites for managing Amazon Redshift namespaces in the AWS Glue Data Catalog.
  7. Create an EMR Studio and attach an EMR Serverless namespace in a private subnet to it, following the instructions in Run interactive workloads on Amazon EMR Serverless from Amazon EMR Studio.

Create data lake tables using an S3 Tables bucket and integrate with the lakehouse architecture

Alice completes the following steps to create a table bucket and enable integration with analytics services:

  1. Sign in to the Amazon S3 console as LHAdmin.
  2. Choose Table buckets in the navigation pane and create a table bucket.
  3. For Table bucket name, enter a name, such as tbacblog-customer-bucket.
  4. For Integration with AWS analytics services, choose Enable integration.
  5. Choose Create table bucket.
  6. After you create the table, click the hyperlink of the table bucket name.
  7. Choose Create table with Athena.
  8. Create a namespace and provide a namespace name. For example, tbacblog_namespace.
  9. Choose Create namespace.
  10. Now proceed to creating table schema and populating it by choosing Create table with Athena.
  11. On the Athena console, run the following SQL script to create a table:
    CREATE TABLE `tbacblog_namespace`.customer (
      c_salutation string, 
      c_preferred_cust_flag string, 
      c_first_sales_date_sk int, 
      c_customer_sk int, 
      c_login string, 
      c_current_cdemo_sk int, 
      c_first_name string, 
      c_current_hdemo_sk int, 
      c_current_addr_sk int, 
      c_last_name string, 
      c_customer_id string, 
      c_last_review_date_sk int, 
      c_birth_month int, 
      c_birth_country string, 
      c_birth_year int, 
      c_birth_day int, 
      c_first_shipto_date_sk int, 
      c_email_address string)
    TBLPROPERTIES ('table_type' = 'iceberg');
    
    
    INSERT INTO tbacblog_namespace.customer
    VALUES('Dr.','N',2452077,13251813,'Y',1381546,'Joyce',2645,2255449,'Deaton','AAAAAAAAFOEDKMAA',2452543,1,'GREECE',1987,29,2250667,'Joyce.Deaton@qhtrwert.edu'),
    ('Dr.','N',2450637,12755125,'Y',1581546,'Daniel',9745,4922716,'Dow','AAAAAAAAFLAKCMAA',2432545,1,'INDIA',1952,3,2450667,'Daniel.Cass@hz05IuguG5b.org'),
    ('Dr.','N',2452342,26009249,'Y',1581536,'Marie',8734,1331639,'Lange','AAAAAAAABKONMIBA',2455549,1,'CANADA',1934,5,2472372,'Marie.Lange@ka94on0lHy.edu'),
    ('Dr.','N',2452342,3270685,'Y',1827661,'Wesley',1548,11108235,'Harris','AAAAAAAANBIOBDAA',2452548,1,'ROME',1986,13,2450667,'Wesley.Harris@c7NpgG4gyh.edu'),
    ('Dr.','N',2452342,29033279,'Y',1581536,'Alexandar',8262,8059919,'Salyer','AAAAAAAAPDDALLBA',2952543,1,'SWISS',1980,6,2650667,'Alexander.Salyer@GxfK3iXetN.edu'),
    ('Miss','N',2452342,6520539,'Y',3581536,'Jerry',1874,36370,'Tracy','AAAAAAAALNOHDGAA',2452385,1,'ITALY',1957,8,2450667,'Jerry.Tracy@VTtQp8OsUkv2hsygIh.edu');
    
    SELECT * FROM tbacblog_namespace.customer;

You have now created the S3 Tables table customer, populated it with data, and integrated it with the lakehouse architecture.

Set up data warehouse tables using Amazon Redshift and integrate them with the lakehouse architecture

In this section, Alice sets up data warehouse tables using Amazon Redshift and integrates them with the lakehouse architecture.

Create a Redshift cluster and publish it to the Data Catalog

Alice completes the following steps to create a Redshift cluster and publish it to the Data Catalog:

  1. Create a Redshift Serverless namespace called salescluster. For instructions, refer to Get started with Amazon Redshift Serverless data warehouses.
  2. Sign in to the Redshift endpoint salescluster as an admin user.
  3. Run the following script to create a table under the dev database under the public schema:
    CREATE SCHEMA sales;
    CREATE TABLE sales.store_sales (
    sale_id INTEGER IDENTITY(1,1) PRIMARY KEY,
    customer_sk INTEGER NOT NULL,
    sale_date DATE NOT NULL,
    sale_amount DECIMAL(10, 2) NOT NULL,
    product_name VARCHAR(100) NOT NULL,
    last_purchase_date DATE
    );
    
    INSERT INTO sales.store_sales (customer_sk, sale_date, sale_amount, product_name, last_purchase_date)
    VALUES
    (13251813, '2023-01-15', 150.00, 'Widget A', '2023-01-15'),
    (29033279, '2023-01-20', 200.00, 'Gadget B', '2023-01-20'),
    (12755125, '2023-02-01', 75.50, 'Tool C', '2023-02-01'),
    (26009249, '2023-02-10', 300.00, 'Widget A', '2023-02-10'),
    (3270685, '2023-02-15', 125.00, 'Gadget B', '2023-02-15'),
    (6520539, '2023-03-01', 100.00, 'Tool C', '2023-03-01'),
    (10251183, '2023-03-10', 250.00, 'Widget A', '2023-03-10'),
    (10251283, '2023-03-15', 180.00, 'Gadget B', '2023-03-15'),
    (10251383, '2023-04-01', 90.00, 'Tool C', '2023-04-01'),
    (10251483, '2023-04-10', 220.00, 'Widget A', '2023-04-10'),
    (10251583, '2023-04-15', 175.00, 'Gadget B', '2023-04-15'),
    (10251683, '2023-05-01', 130.00, 'Tool C', '2023-05-01'),
    (10251783, '2023-05-10', 280.00, 'Widget A', '2023-05-10'),
    (10251883, '2023-05-15', 195.00, 'Gadget B', '2023-05-15'),
    (10251983, '2023-06-01', 110.00, 'Tool C', '2023-06-01'),
    (10251083, '2023-06-10', 270.00, 'Widget A', '2023-06-10'),
    (10252783, '2023-06-15', 185.00, 'Gadget B', '2023-06-15'),
    (10253783, '2023-07-01', 95.00, 'Tool C', '2023-07-01'),
    
    (10254783, '2023-07-10', 240.00, 'Widget A', '2023-07-10'), (10255783, '2023-07-15', 160.00, 'Gadget B', '2023-07-15'); SELECT * FROM sales.store_sales;

  4. On the Redshift Serverless console, open the namespace.
  5. On the Actions dropdown menu, choose Register with AWS Glue Data Catalog to integrate with the lakehouse architecture.
  6. Select the same AWS account and choose Register.

Create a catalog for Amazon Redshift

Alice completes the following steps to create a catalog for Amazon Redshift:

  1. Sign in to the Lake Formation console as the data lake administrator LHAdmin.
  2. In the navigation pane, under Data Catalog, choose Catalogs.
    Under Pending catalog invitations, you will see the invitation initiated from the Redshift Serverless namespace salescluster.
  3. Select the pending invitation and choose Approve and create catalog.
  4. Provide a name for the catalog. For example, redshift_salescatalog.
  5. Under Access from engines, select Access this catalog from Iceberg-compatible engines and choose RedshiftS3DataTransferRole for IAM role.
  6. Choose Next.
  7. Choose Add permissions.
  8. Under Principals, choose the LHAdmin role for IAM users and roles, choose Super user for Catalog permissions, and choose Add.
  9. Choose Create catalog.After you create the catalog redshift_salescatalog, you can inspect the sub-catalog dev, namespace and database sales, and table store_sales underneath it.

Alice has now completed creating an S3table catalog table and Redshift federated catalog table in the Data Catalog.

Delegate LF-Tags creation and resource permission to the DataSteward role

Alice completes the following steps to delegate LF-Tags creation and resource permission to Bob as DataSteward:

  1. Sign in to the Lake Formation console as the data lake administrator LHAdmin.
  2. In the navigation pane, choose LF Tags and permissions, then choose the LF-Tag creators tab.
  3. Choose Add LF-Tag creators.
  4. Choose DataSteward for IAM users and roles.
  5. Under Permission, select Create LF-Tag and choose Add.
  6. In the navigation pane, choose Data permissions, then choose Grant.
  7. In the Principals section, for IAM users and roles, choose the DataSteward role.
  8. In the LF-Tags or catalog resources section, select Named Data Catalog resources.
  9. Choose <account_id>:s3tablescatalog/tbacblog-customer-bucket and <account_id>:redshift_salescatalog/dev for Catalogs.
  10. In the Catalog permissions section, select Super user for permissions.
  11. Choose Grant.

You can verify permissions for DataSteward on the Data permissions page.

Alice has now completed delegating LF-tags creation and assignment permissions to Bob, the DataSteward. She had also granted catalog level permissions to Bob.

Create LF-Tags

Bob as DataSteward completes the following steps to create LF-Tags:

  1. Sign in to the Lake Formation console as DataSteward.
  2. In the navigation pane, choose LF Tags and permissions, then choose the LF-tags tab.
  3. Choose Add-LF-Tag.
  4. Create LF tags as follows:
    1. Key: Domain and Values: sales, marketing
    2. Key: Sensitivity and Values: true, false

Assign LF-Tags to the S3 Tables database and table

Bob as DataSteward completes the following steps to assign LF-Tags to the S3 Tables database and table:

  1. In the navigation pane, choose Catalogs and choose s3tablescatalog.
  2. Choose tbacblog-customer-bucket and choose tbacblog_namespace.
  3. Choose Edit LF-Tags.
  4. Assign the following tags:
    1. Key: Domain and Value: sales
    2. Key: Sensitivity and Value: false
  5. Choose Save.
  6. On the View dropdown menu, choose Tables.
  7. Choose the customer table and choose the Schema tab.
  8. Choose Edit schema and select the columns c_first_name, c_last_name, c_email_address, and c_birth_year.
  9. Choose Edit LF-Tags and modify the tag value:
    1. Key: Sensitivity and Value: true
  10. Choose Save.

Assign LF-Tags to the Redshift database and table

Bob as DataSteward completes the following steps to assign LF-Tags to the Redshift database and table:

  1. In the navigation pane, choose Catalogs and choose salescatalog.
  2. Choose dev and select sales.
  3. Choose Edit LF-Tags and assign the following tags:
    1. Key: Domain and Value: sales
    2. Key: Sensitivity and Value: false
  4. Choose Save.

Grant catalog permission to the DataAnalyst and BIEngineer roles

Bob as DataSteward completes the following steps to grant catalog permission to the DataAnalyst and BIEngineer roles (Charlie and Doug, respectively):

  1. In the navigation pane, choose Datalake permissions, then choose Grant.
  2. In the Principals section, for IAM users and roles, choose the DataAnalyst and BIEngineer roles.
  3. In the LF-Tags or catalog resources section, select Named Data Catalog resources.
  4. For Catalogs, choose <account_id>:s3tablescatalog/tbacblog-customer-bucket and <account_id>:salescatalog/dev.
  5. In the Catalog permissions section, choose Describe for permissions.
  6. Choose Grant.

Grant permission to the DataAnalyst role for the sales domain and non-sensitive data

Bob as DataSteward completes the following steps to grant permission to the DataAnalyst role (Charlie) for the sales domain for non-sensitive data:

  1. In the navigation pane, choose Datalake permissions, then choose Grant.
  2. In the Principals section, for IAM users and roles, choose the DataAnalyst role.
  3. In the LF-Tags or catalog resources section, select Resources matched by LF-Tags and provide the following values:
    1. Key: Domain and Value: sales
    2. Key: Sensitivity and Value: false

  4. In the Database permissions section, choose Describe for permissions.
  5. In the Table permissions section, select Select and Describe for permissions.
  6. Choose Grant.

Grant permission to the BIEngineer role for sales domain data

Bob as DataSteward completes the following steps to grant permission to the BIEngineer role (Doug) for all sales domain data:

  1. In the navigation pane, choose Datalake permissions, then choose Grant.
  2. In the Principals section, for IAM users and roles, choose the BIEngineer role.
  3. In the LF-Tags or catalog resources section, select Resources matched by LF-Tags and provide the following values:
    1. Key: Domain and Value: sales
  4. In the Database permissions section, choose Describe for permissions.
  5. In the Table permissions section, select Select and Describe for permissions.
  6. Choose Grant.

This completes the steps to grant S3 Tables and Redshift federated tables permissions to various data personas using LF-TBAC.

Verify data access

In this step, we log in as individual data personas and query the lakehouse tables that are available to each persona.

Use Athena to analyze customer information as the DataAnalyst role

Charlie signs in to the Athena console as the DataAnalyst role. He runs the following sample SQL query:

SELECT * FROM
"redshift_salescatalog/dev"."sales"."store_sales" s
JOIN
"s3tablescatalog/tbacblog-customer-bucket"."tbacblog_namespace"."customer" c 
ON c.c_customer_sk = s.customer_sk
LIMIT 5;

Run a sample query to access the 4 columns in the S3table customer that DataAnalyst does not have access to. You should receive an error as shown in the screenshot. This verifies column level fine grained access using LF-tags on the lakehouse tables.

Use the Redshift query editor to analyze customer data as the DataAnalyst role

Charlie signs in to the Redshift query editor v2 as the DataAnalyst role and runs the following sample SQL query:

SELECT * FROM
"dev@redshift_salescatalog"."sales"."store_sales" s
JOIN
"tbacblog-customer-bucket@s3tablescatalog"."tbacblog_namespace"."customer" c 
ON c.c_customer_sk = s.customer_sk
LIMIT 5;

This verifies the DataAnalyst access to the lakehouse tables with LF-tags based permissions, using Redshift Spectrum

Use Amazon EMR to process customer data as the BIEngineer role

Doug uses Amazon EMR to process customer data with the BIEngineer role:

  1. Sign-in to the EMR Studio as Doug, with BIEngineer role. Ensure EMR Serverless application is attached to the workspace with BIEngineer as the EMR runtime role.
    Download the PySpark notebook tbacblog_emrs.ipynb. Upload to your studio environment.
  2. Change the account id, AWS Region and resource names as per your setup. Restart kernel and clear output.
  3. Once your pySpark kernel is ready, run the cells and verify access.This verifies access using LF-tags to the lakehouse tables as the EMR runtime role. For demonstration, we are also providing the pySpark script tbacblog_sparkscript.py that you can run as EMR batch job and Glue 5.0 ETL.

Doug has also set up Amazon SageMaker Unified Studio as covered in the blog post Accelerate your analytics with Amazon S3 Tables and Amazon SageMaker Lakehouse. Doug logs in to SageMaker Unified Studio and select previously created project to perform his analysis. He navigates to the Build options and choose JupyterLab under IDE & Applications. He uses the downloaded pyspark notebook and updates it as per his Spark query requirements. He then runs the cells by selecting compute as project.spark.fineGrained.

Doug can now start using Spark SQL and start processing data as per fine grained access controlled by the Tags.

Clean up

Complete the following steps to delete the resources you created to avoid unexpected costs:

  1. Delete the Redshift Serverless workgroups.
  2. Delete the Redshift Serverless associated namespace.
  3. Delete the EMR Studio and EMR Serverless instance.
  4. Delete the AWS Glue catalogs, databases, and tables and Lake Formation permissions.
  5. Delete the S3 Tables bucket.
  6. Empty and delete the S3 bucket.
  7. Delete the IAM roles created for this post.

Conclusion

In this post, we demonstrated how you can use Lake Formation tag-based access control with the SageMaker lakehouse architecture to achieve unified and scalable permissions to your data warehouse and data lake. Now administrators can add access permissions to federated catalogs using attributes and tags, creating automated policy enforcement that scales naturally as new assets are added to the system. This eliminates the operational overhead of manual policy updates. You can use this model for sharing resources across accounts and Regions to facilitate data sharing within and across enterprises.

We encourage AWS data lake customers to try this feature and share your feedback in the comments. To learn more about tag-based access control, visit the Lake Formation documentation.

Acknowledgment: A special thanks to everyone who contributed to the development and launch of TBAC: Joey Ghirardelli, Xinchi Li, Keshav Murthy Ramachandra, Noella Jiang, Purvaja Narayanaswamy, Sandya Krishnanand.


About the Authors

Sandeep Adwankar is a Senior Product Manager with Amazon SageMaker Lakehouse . Based in the California Bay Area, he works with customers around the globe to translate business and technical requirements into products that help customers improve how they manage, secure, and access data.

Srividya Parthasarathy is a Senior Big Data Architect with Amazon SageMaker Lakehouse. She works with the product team and customers to build robust features and solutions for their analytical data platform. She enjoys building data mesh solutions and sharing them with the community.

Aarthi Srinivasan is a Senior Big Data Architect with Amazon SageMaker Lakehouse. She works with AWS customers and partners to architect lakehouse solutions, enhance product features, and establish best practices for data governance.

Stay Informed

Get the best articles every day for FREE. Cancel anytime.