
The Amazon SageMaker lakehouse architecture has expanded its tag-based access control (TBAC) capabilities to include federated catalogs. This enhancement extends beyond the default AWS Glue Data Catalog resources to encompass Amazon S3 Tables, Amazon Redshift data warehouses. TBAC is also supported on federated catalogs from data sources Amazon DynamoDB, MySQL, PostgreSQL, SQL Server, Oracle, Amazon DocumentDB, Google BigQuery, and Snowflake. TBAC provides you a sophisticated permission management that uses tags to create logical groupings of catalog resources, enabling administrators to implement fine-grained access controls across their entire data landscape without managing individual resource-level permissions.
Traditional data access management often requires manual assignment of permissions at the resource level, creating significant administrative overhead. TBAC solves this by introducing an automated, inheritance-based permission model. When administrators apply tags to data resources, access permissions are automatically inherited, eliminating the need for manual policy modifications when new tables are added. This streamlined approach not only reduces administrative burden but also enhances security consistency across the data ecosystem.
TBAC can be set up through the AWS Lake Formation console, and accessible using Amazon Redshift, Amazon Athena, Amazon EMR, AWS Glue, and Amazon SageMaker Unified Studio. This makes it valuable for organizations managing complex data landscapes with multiple data sources and large datasets. TBAC is especially beneficial for enterprises implementing data mesh architectures, maintaining regulatory compliance, or scaling their data operations across multiple departments. Furthermore, TBAC enables efficient data sharing across different accounts, making it easier to maintain secure collaboration.
In this post, we illustrate how to get started with fine-grained access control of S3 Tables and Redshift tables in the lakehouse using TBAC. We also show how to access these lakehouse tables using your choice of analytics services, such as Athena, Redshift, and Apache Spark in Amazon EMR Serverless in Amazon SageMaker Unified Studio.
Solution overview
For illustration, we consider a fictional company called Example Retail Corp, as covered in the blog post Accelerate your analytics with Amazon S3 Tables and Amazon SageMaker Lakehouse. Example Retail’s leadership has decided to use the SageMaker lakehouse architecture to unify data across S3 Tables and their Redshift data warehouse. With this lakehouse architecture, they can now conduct analyses across their data to identify at-risk customers, understand the impact of personalized marketing campaigns on customer churn, and develop targeted retention and sales strategies.
Alice is a data administrator with the AWS Identity and Access Management (IAM) role LHAdmin
in Example Retail Corp, and she wants to implement tag-based access control to scale permissions across their data lake and data warehouse resources. She is using S3 Tables with Iceberg transactional capability to achieve scalability as updates are streamed across billions of customer interactions, while providing the same durability, availability, and performance characteristics that S3 is known for. She already has a Redshift namespace, which contains historical and current data about sales, customers prospects, and churn information. Alice supports an extended team of developers, engineers, and data scientists who require access to the data environment to develop business insights, dashboards, ML models, and knowledge bases. This team includes:
- Bob, a data steward with IAM role
DataSteward
, is the domain owner and manages access to the S3 Tables and warehouse data. He enables other teams who build reports to be shared with leadership. - Charlie, a data analyst with IAM role
DataAnalyst
, builds ML forecasting models for sales growth using the pipeline or customer conversion across multiple touchpoints, and makes those available to finance and planning teams. - Doug, a BI engineer with IAM role
BIEngineer
, builds interactive dashboards to funnel customer prospects and their conversions across multiple touchpoints, and makes those available to thousands of sales team members.
Alice decides to use the SageMaker lakehouse architecture to unify data across S3 Tables and Redshift data warehouse. Bob can now bring his domain data into one place and manage access to multiple teams requesting access to his data. Charlie can quickly build Amazon QuickSight dashboards and use his Redshift and Athena expertise to provide quick query results. Doug can build Spark-based processing with AWS Glue or Amazon EMR to build ML forecasting models.
Alice’s goal is to use TBAC to make fine-grained access much more scalable, because they can grant permissions on many resources at once and permissions are updated accordingly when tags for resources are added, changed, or removed.The following diagram illustrates the solution architecture.
Alice as Lakehouse admin and Bob as Data Steward determines that following high-level steps are needed to deploy the solution:
- Create an S3 Tables bucket and enable integration with the Data Catalog. This will make the resources available under the federated catalog
s3tablescatalog
in the lakehouse architecture with Lake Formation for access control. Create a namespace and a table under the table bucket where the data will be stored. - Create a Redshift cluster with tables, publish your data warehouse to the Data Catalog, and create a catalog registering the namespace. This will make the resources available under a federated catalog in the lakehouse architecture with Lake Formation for access control.
- Delegate permissions to create tags and grant permissions on Data Catalog resources to
DataSteward
. - As
DataSteward
, define tag ontology based on the use case and create Tags. Assign these LF-Tags to the resources (database or table) to logically group lakehouse resources for sharing based on access patterns. - Share the S3 Tables catalog table and Redshift table using tag-based access control to
DataAnalyst
, who uses Athena for analysis and Redshift Spectrum for generating the report. - Share the S3 Tables catalog table and Redshift table using tag-based access control to
BIEngineer
, who uses Spark in EMR Serverless to further process the datasets.
Data steward defines the tags and assignment to resources as shown:
Tags | Data Resources |
Domain = sales Sensitivity = false |
S3 Table: customer( c_salutation, c_preferred_cust_flag,c_first_sales_date_sk, |
Domain = sales Sensitivity = true |
S3 Table: customer( c_first_name, c_last_name, c_email_address, c_birth_year) |
Domain = sales Sensitivity = false |
Redshift Table: sales.store_sales |
The following table summarizes the tag expression that is granted to roles for resource access:
User | Persona | Permission Granted | Access |
Bob | DataSteward | SUPER_USER on catalogs | Admin access on customer and store_sales. |
Charlie | DataAnalyst | Domain = sales Sensitivity = false |
Access to non -sensitive data that is aligned to sales domain: customer(non-sensitive columns) and store_sales. |
Doug | BIEngineer | Domain = sales | Access to all datasets that is aligned to sales domain: customer and store_sales. |
Prerequisites
To follow along with this post, complete the following prerequisite steps:
- Have an AWS account and admin user with access to the following AWS services:
- Athena
- Amazon EMR
- IAM
- Lake Formation and the Data Catalog
- Amazon Redshift
- Amazon S3
- IAM Identity Center
- Amazon SageMaker Unified Studio
- Create a data lake admin (
LHAdmin
). For instructions, see Create a data lake administrator. - Create an IAM role named
DataSteward
and attach permissions for AWS Glue and Lake Formation access. For instructions, refer to Data lake administrator permissions. - Create an IAM role named
DataAnalyst
and attach permissions for Amazon Redshift and Athena access. For instructions, refer to Data analyst permissions. - Create an IAM role named
BIEngineer
and attach permissions for Amazon EMR access. This is also the EMR runtime role that the Spark job will use to access the tables. For instructions on the role permissions, refer to Job runtime roles for EMR serverless. - Create an IAM role named
RedshiftS3DataTransferRole
following the instructions in Prerequisites for managing Amazon Redshift namespaces in the AWS Glue Data Catalog. - Create an EMR Studio and attach an EMR Serverless namespace in a private subnet to it, following the instructions in Run interactive workloads on Amazon EMR Serverless from Amazon EMR Studio.
Create data lake tables using an S3 Tables bucket and integrate with the lakehouse architecture
Alice completes the following steps to create a table bucket and enable integration with analytics services:
- Sign in to the Amazon S3 console as
LHAdmin
. - Choose Table buckets in the navigation pane and create a table bucket.
- For Table bucket name, enter a name, such as
tbacblog-customer-bucket
. - For Integration with AWS analytics services, choose Enable integration.
- Choose Create table bucket.
- After you create the table, click the hyperlink of the table bucket name.
- Choose Create table with Athena.
- Create a namespace and provide a namespace name. For example,
tbacblog_namespace
. - Choose Create namespace.
- Now proceed to creating table schema and populating it by choosing Create table with Athena.
- On the Athena console, run the following SQL script to create a table:
CREATE TABLE `tbacblog_namespace`.customer ( c_salutation string, c_preferred_cust_flag string, c_first_sales_date_sk int, c_customer_sk int, c_login string, c_current_cdemo_sk int, c_first_name string, c_current_hdemo_sk int, c_current_addr_sk int, c_last_name string, c_customer_id string, c_last_review_date_sk int, c_birth_month int, c_birth_country string, c_birth_year int, c_birth_day int, c_first_shipto_date_sk int, c_email_address string) TBLPROPERTIES ('table_type' = 'iceberg'); INSERT INTO tbacblog_namespace.customer VALUES('Dr.','N',2452077,13251813,'Y',1381546,'Joyce',2645,2255449,'Deaton','AAAAAAAAFOEDKMAA',2452543,1,'GREECE',1987,29,2250667,'Joyce.Deaton@qhtrwert.edu'), ('Dr.','N',2450637,12755125,'Y',1581546,'Daniel',9745,4922716,'Dow','AAAAAAAAFLAKCMAA',2432545,1,'INDIA',1952,3,2450667,'Daniel.Cass@hz05IuguG5b.org'), ('Dr.','N',2452342,26009249,'Y',1581536,'Marie',8734,1331639,'Lange','AAAAAAAABKONMIBA',2455549,1,'CANADA',1934,5,2472372,'Marie.Lange@ka94on0lHy.edu'), ('Dr.','N',2452342,3270685,'Y',1827661,'Wesley',1548,11108235,'Harris','AAAAAAAANBIOBDAA',2452548,1,'ROME',1986,13,2450667,'Wesley.Harris@c7NpgG4gyh.edu'), ('Dr.','N',2452342,29033279,'Y',1581536,'Alexandar',8262,8059919,'Salyer','AAAAAAAAPDDALLBA',2952543,1,'SWISS',1980,6,2650667,'Alexander.Salyer@GxfK3iXetN.edu'), ('Miss','N',2452342,6520539,'Y',3581536,'Jerry',1874,36370,'Tracy','AAAAAAAALNOHDGAA',2452385,1,'ITALY',1957,8,2450667,'Jerry.Tracy@VTtQp8OsUkv2hsygIh.edu'); SELECT * FROM tbacblog_namespace.customer;
You have now created the S3 Tables table customer
, populated it with data, and integrated it with the lakehouse architecture.
Set up data warehouse tables using Amazon Redshift and integrate them with the lakehouse architecture
In this section, Alice sets up data warehouse tables using Amazon Redshift and integrates them with the lakehouse architecture.
Create a Redshift cluster and publish it to the Data Catalog
Alice completes the following steps to create a Redshift cluster and publish it to the Data Catalog:
- Create a Redshift Serverless namespace called
salescluster
. For instructions, refer to Get started with Amazon Redshift Serverless data warehouses. - Sign in to the Redshift endpoint
salescluster
as an admin user. - Run the following script to create a table under the
dev
database under thepublic
schema:CREATE SCHEMA sales; CREATE TABLE sales.store_sales ( sale_id INTEGER IDENTITY(1,1) PRIMARY KEY, customer_sk INTEGER NOT NULL, sale_date DATE NOT NULL, sale_amount DECIMAL(10, 2) NOT NULL, product_name VARCHAR(100) NOT NULL, last_purchase_date DATE ); INSERT INTO sales.store_sales (customer_sk, sale_date, sale_amount, product_name, last_purchase_date) VALUES (13251813, '2023-01-15', 150.00, 'Widget A', '2023-01-15'), (29033279, '2023-01-20', 200.00, 'Gadget B', '2023-01-20'), (12755125, '2023-02-01', 75.50, 'Tool C', '2023-02-01'), (26009249, '2023-02-10', 300.00, 'Widget A', '2023-02-10'), (3270685, '2023-02-15', 125.00, 'Gadget B', '2023-02-15'), (6520539, '2023-03-01', 100.00, 'Tool C', '2023-03-01'), (10251183, '2023-03-10', 250.00, 'Widget A', '2023-03-10'), (10251283, '2023-03-15', 180.00, 'Gadget B', '2023-03-15'), (10251383, '2023-04-01', 90.00, 'Tool C', '2023-04-01'), (10251483, '2023-04-10', 220.00, 'Widget A', '2023-04-10'), (10251583, '2023-04-15', 175.00, 'Gadget B', '2023-04-15'), (10251683, '2023-05-01', 130.00, 'Tool C', '2023-05-01'), (10251783, '2023-05-10', 280.00, 'Widget A', '2023-05-10'), (10251883, '2023-05-15', 195.00, 'Gadget B', '2023-05-15'), (10251983, '2023-06-01', 110.00, 'Tool C', '2023-06-01'), (10251083, '2023-06-10', 270.00, 'Widget A', '2023-06-10'), (10252783, '2023-06-15', 185.00, 'Gadget B', '2023-06-15'), (10253783, '2023-07-01', 95.00, 'Tool C', '2023-07-01'),
(10254783, '2023-07-10', 240.00, 'Widget A', '2023-07-10'), (10255783, '2023-07-15', 160.00, 'Gadget B', '2023-07-15'); SELECT * FROM sales.store_sales; - On the Redshift Serverless console, open the namespace.
- On the Actions dropdown menu, choose Register with AWS Glue Data Catalog to integrate with the lakehouse architecture.
- Select the same AWS account and choose Register.
Create a catalog for Amazon Redshift
Alice completes the following steps to create a catalog for Amazon Redshift:
- Sign in to the Lake Formation console as the data lake administrator
LHAdmin
. - In the navigation pane, under Data Catalog, choose Catalogs.
Under Pending catalog invitations, you will see the invitation initiated from the Redshift Serverless namespacesalescluster
. - Select the pending invitation and choose Approve and create catalog.
- Provide a name for the catalog. For example,
redshift_salescatalog
. - Under Access from engines, select Access this catalog from Iceberg-compatible engines and choose
RedshiftS3DataTransferRole
for IAM role. - Choose Next.
- Choose Add permissions.
- Under Principals, choose the
LHAdmin
role for IAM users and roles, choose Super user for Catalog permissions, and choose Add.
- Choose Create catalog.After you create the catalog
redshift_salescatalog
, you can inspect the sub-catalogdev
, namespace and databasesales
, and tablestore_sales
underneath it.
Alice has now completed creating an S3table catalog table and Redshift federated catalog table in the Data Catalog.
Delegate LF-Tags creation and resource permission to the DataSteward role
Alice completes the following steps to delegate LF-Tags creation and resource permission to Bob as DataSteward
:
- Sign in to the Lake Formation console as the data lake administrator
LHAdmin
. - In the navigation pane, choose LF Tags and permissions, then choose the LF-Tag creators tab.
- Choose Add LF-Tag creators.
- Choose DataSteward for IAM users and roles.
- Under Permission, select Create LF-Tag and choose Add.
- In the navigation pane, choose Data permissions, then choose Grant.
- In the Principals section, for IAM users and roles, choose the
DataSteward
role.
- In the LF-Tags or catalog resources section, select Named Data Catalog resources.
- Choose
<account_id>:s3tablescatalog/tbacblog-customer-bucket
and<account_id>:redshift_salescatalog/dev
for Catalogs.
- In the Catalog permissions section, select Super user for permissions.
- Choose Grant.
You can verify permissions for DataSteward
on the Data permissions page.
Alice has now completed delegating LF-tags creation and assignment permissions to Bob, the DataSteward. She had also granted catalog level permissions to Bob.
Create LF-Tags
Bob as DataSteward
completes the following steps to create LF-Tags:
- Sign in to the Lake Formation console as
DataSteward
. - In the navigation pane, choose LF Tags and permissions, then choose the LF-tags tab.
- Choose Add-LF-Tag.
- Create LF tags as follows:
- Key:
Domain
and Values:sales
,marketing
- Key:
Sensitivity
and Values:true
,false
- Key:
Assign LF-Tags to the S3 Tables database and table
Bob as DataSteward
completes the following steps to assign LF-Tags to the S3 Tables database and table:
- In the navigation pane, choose Catalogs and choose
s3tablescatalog
. - Choose
tbacblog-customer-bucket
and choosetbacblog_namespace
. - Choose Edit LF-Tags.
- Assign the following tags:
- Key: Domain and Value: sales
- Key: Sensitivity and Value: false
- Choose Save.
- On the View dropdown menu, choose Tables.
- Choose the customer table and choose the Schema tab.
- Choose Edit schema and select the columns
c_first_name
,c_last_name
,c_email_address
, andc_birth_year
. - Choose Edit LF-Tags and modify the tag value:
- Key:
Sensitivity
and Value:true
- Key:
- Choose Save.
Assign LF-Tags to the Redshift database and table
Bob as DataSteward
completes the following steps to assign LF-Tags to the Redshift database and table:
- In the navigation pane, choose Catalogs and choose
salescatalog
. - Choose
dev
and selectsales
. - Choose Edit LF-Tags and assign the following tags:
- Key:
Domain
and Value:sales
- Key:
Sensitivity
and Value:false
- Key:
- Choose Save.
Grant catalog permission to the DataAnalyst and BIEngineer roles
Bob as DataSteward
completes the following steps to grant catalog permission to the DataAnalyst
and BIEngineer
roles (Charlie and Doug, respectively):
- In the navigation pane, choose Datalake permissions, then choose Grant.
- In the Principals section, for IAM users and roles, choose the
DataAnalyst
andBIEngineer
roles. - In the LF-Tags or catalog resources section, select Named Data Catalog resources.
- For Catalogs, choose
<account_id>:s3tablescatalog/tbacblog-customer-bucket
and<account_id>:salescatalog/dev
.
- In the Catalog permissions section, choose Describe for permissions.
- Choose Grant.
Grant permission to the DataAnalyst role for the sales domain and non-sensitive data
Bob as DataSteward
completes the following steps to grant permission to the DataAnalyst
role (Charlie) for the sales
domain for non-sensitive data:
- In the navigation pane, choose Datalake permissions, then choose Grant.
- In the Principals section, for IAM users and roles, choose the
DataAnalyst
role. - In the LF-Tags or catalog resources section, select Resources matched by LF-Tags and provide the following values:
- Key:
Domain
and Value:sales
- Key:
Sensitivity
and Value:false
- Key:
- In the Database permissions section, choose Describe for permissions.
- In the Table permissions section, select Select and Describe for permissions.
- Choose Grant.
Grant permission to the BIEngineer role for sales domain data
Bob as DataSteward
completes the following steps to grant permission to the BIEngineer
role (Doug) for all sales
domain data:
- In the navigation pane, choose Datalake permissions, then choose Grant.
- In the Principals section, for IAM users and roles, choose the
BIEngineer
role. - In the LF-Tags or catalog resources section, select Resources matched by LF-Tags and provide the following values:
- Key:
Domain
and Value:sales
- Key:
- In the Database permissions section, choose Describe for permissions.
- In the Table permissions section, select Select and Describe for permissions.
- Choose Grant.
This completes the steps to grant S3 Tables and Redshift federated tables permissions to various data personas using LF-TBAC.
Verify data access
In this step, we log in as individual data personas and query the lakehouse tables that are available to each persona.
Use Athena to analyze customer information as the DataAnalyst role
Charlie signs in to the Athena console as the DataAnalyst
role. He runs the following sample SQL query:
SELECT * FROM
"redshift_salescatalog/dev"."sales"."store_sales" s
JOIN
"s3tablescatalog/tbacblog-customer-bucket"."tbacblog_namespace"."customer" c
ON c.c_customer_sk = s.customer_sk
LIMIT 5;
Run a sample query to access the 4 columns in the S3table customer that DataAnalyst does not have access to. You should receive an error as shown in the screenshot. This verifies column level fine grained access using LF-tags on the lakehouse tables.
Use the Redshift query editor to analyze customer data as the DataAnalyst role
Charlie signs in to the Redshift query editor v2 as the DataAnalyst
role and runs the following sample SQL query:
SELECT * FROM
"dev@redshift_salescatalog"."sales"."store_sales" s
JOIN
"tbacblog-customer-bucket@s3tablescatalog"."tbacblog_namespace"."customer" c
ON c.c_customer_sk = s.customer_sk
LIMIT 5;
This verifies the DataAnalyst
access to the lakehouse tables with LF-tags based permissions, using Redshift Spectrum
Use Amazon EMR to process customer data as the BIEngineer role
Doug uses Amazon EMR to process customer data with the BIEngineer
role:
- Sign-in to the EMR Studio as Doug, with
BIEngineer
role. Ensure EMR Serverless application is attached to the workspace withBIEngineer
as the EMR runtime role.
Download the PySpark notebook tbacblog_emrs.ipynb. Upload to your studio environment. - Change the account id, AWS Region and resource names as per your setup. Restart kernel and clear output.
- Once your pySpark kernel is ready, run the cells and verify access.This verifies access using LF-tags to the lakehouse tables as the EMR runtime role. For demonstration, we are also providing the pySpark script tbacblog_sparkscript.py that you can run as EMR batch job and Glue 5.0 ETL.
Doug has also set up Amazon SageMaker Unified Studio as covered in the blog post Accelerate your analytics with Amazon S3 Tables and Amazon SageMaker Lakehouse. Doug logs in to SageMaker Unified Studio and select previously created project to perform his analysis. He navigates to the Build options and choose JupyterLab under IDE & Applications. He uses the downloaded pyspark notebook and updates it as per his Spark query requirements. He then runs the cells by selecting compute as project.spark.fineGrained.
Doug can now start using Spark SQL and start processing data as per fine grained access controlled by the Tags.
Clean up
Complete the following steps to delete the resources you created to avoid unexpected costs:
- Delete the Redshift Serverless workgroups.
- Delete the Redshift Serverless associated namespace.
- Delete the EMR Studio and EMR Serverless instance.
- Delete the AWS Glue catalogs, databases, and tables and Lake Formation permissions.
- Delete the S3 Tables bucket.
- Empty and delete the S3 bucket.
- Delete the IAM roles created for this post.
Conclusion
In this post, we demonstrated how you can use Lake Formation tag-based access control with the SageMaker lakehouse architecture to achieve unified and scalable permissions to your data warehouse and data lake. Now administrators can add access permissions to federated catalogs using attributes and tags, creating automated policy enforcement that scales naturally as new assets are added to the system. This eliminates the operational overhead of manual policy updates. You can use this model for sharing resources across accounts and Regions to facilitate data sharing within and across enterprises.
We encourage AWS data lake customers to try this feature and share your feedback in the comments. To learn more about tag-based access control, visit the Lake Formation documentation.
Acknowledgment: A special thanks to everyone who contributed to the development and launch of TBAC: Joey Ghirardelli, Xinchi Li, Keshav Murthy Ramachandra, Noella Jiang, Purvaja Narayanaswamy, Sandya Krishnanand.
About the Authors
Sandeep Adwankar is a Senior Product Manager with Amazon SageMaker Lakehouse . Based in the California Bay Area, he works with customers around the globe to translate business and technical requirements into products that help customers improve how they manage, secure, and access data.
Srividya Parthasarathy is a Senior Big Data Architect with Amazon SageMaker Lakehouse. She works with the product team and customers to build robust features and solutions for their analytical data platform. She enjoys building data mesh solutions and sharing them with the community.
Aarthi Srinivasan is a Senior Big Data Architect with Amazon SageMaker Lakehouse. She works with AWS customers and partners to architect lakehouse solutions, enhance product features, and establish best practices for data governance.