Getting started with Amazon S3 Tables in Amazon SageMaker Unified Studio

Getting started with Amazon S3 Tables in Amazon SageMaker Unified Studio

Modern data teams face a critical challenge: their analytical datasets are scattered across multiple storage systems and formats, creating operational complexity that slows down insights and hampers collaboration. Data scientists waste valuable time navigating between different tools to access data stored in various locations, while data engineers struggle to maintain consistent performance and governance across disparate storage solutions. Teams often find themselves locked into specific query engines or analytics tools based on where their data resides, limiting their ability to choose the best tool for each analytical task.

Amazon SageMaker Unified Studio addresses this fragmentation by providing a single environment where teams can access and analyze organizational data using AWS analytics and AI/ML services. The new Amazon S3 Tables integration solves a fundamental problem: it enables teams to store their data in a unified, high-performance table format while maintaining the flexibility to query that same data seamlessly across multiple analytics engines—whether through JupyterLab notebooks, Amazon Redshift, Amazon Athena, or other integrated services. This eliminates the need to duplicate data or compromise on tool choice, allowing teams to focus on generating insights rather than managing data infrastructure complexity.

Table buckets are the third type of S3 bucket, taking place alongside the existing general purpose buckets, directory buckets, and now the fourth type – vector buckets. You can think of a table bucket as an analytics warehouse that can store Apache Iceberg tables with various schemas. Additionally, S3 Tables deliver the same durability, availability, scalability, and performance characteristics as S3 itself, and automatically optimize your storage to maximize query performance and to minimize cost.

In this post, you learn how to integrate SageMaker Unified Studio with S3 tables and query your data using Athena, Redshift, or Apache Spark in EMR and Glue.

Integrating S3 Tables with AWS analytics services

S3 table buckets integrate with AWS Glue Data Catalog and AWS Lake Formation to allow AWS analytics services to automatically discover and access your table data. For more information, see creating an S3 Tables catalog.

Before you get started with SageMaker Unified Studio, your administrator must first create a domain in the SageMaker Unified Studio and provide you with the URL. For more information, see the SageMaker Unified Studio Administrator Guide.

If you’ve never used S3 Tables in SageMaker Studio, you can allow it to enable the S3 Tables analytics integration when you create a new S3 Tables catalog in SageMaker Unified Studio.

Note: This integration needs to be configured individually in each AWS Region.

When you integrate using SageMaker Unified Studio, it takes the following actions in your account:

  • Creates a new AWS Identity and Access Management (IAM) service role that gives AWS Lake Formation access to all your tables and table buckets in the same AWS Region where you are going to provision the resources. This allows Lake Formation to manage access, permissions, and governance for all current and future table buckets.
  • Creates a catalog from an S3 table bucket in the AWS Glue Data Catalog.
  • Add the Redshift service role (AWSServiceRoleForRedshift) as a Lake Formation Read-only administrator permissions.

Prerequisites

Creating catalogs from S3 table buckets in SageMaker Unified Studio

To get started using S3 Tables in SageMaker Unified Studio you create a new Lakehouse catalog with S3 table bucket source using the following steps.

  1. Open the SageMaker console and use the region selector in the top navigation bar to choose the appropriate AWS Region.
  2. Select your SageMaker domain.
  3. Select or create a new project you want to create a table bucket in.
  4. In the navigation menu select Data, then select + to add a new data source.
  5. Choose Create Lakehouse catalog.
  6. In the add catalog menu, choose S3 Tables as the source.
  7. Enter a name for the catalog blogcatalog.
  8. Enter database name taxidata.
  9. Choose Create catalog.
  10. The following steps will help you create these resources in your AWS account:
    1. A new S3 table bucket and the corresponding Glue child catalog under the parent Catalog s3tablescatalog.
    2. Go to Glue console, expand Data Catalog, Click databases, a new database within that Glue child catalog. The database name will match the database name you provided.
    3. Wait for the catalog provisioning to finish.
  11. Create tables in your database, then use the Query Editor or a Jupyter notebook to run queries against them.

Creating and querying S3 table buckets

After adding an S3 Tables catalog, it can be queried using the format s3tablescatalog/blogcatalog. You can begin creating tables within the catalog and query them in SageMaker Studio using the Query Editor or JupyterLab. For more information, see Querying S3 Tables in SageMaker Studio.

Note: In SageMaker Unified Studio, you can create S3 tables only using the Athena engine. However, once the tables are created, they can be queried using Athena, Redshift, or through Spark in EMR and Glue.

Using the query editor

Creating a table in the query editor

  1. Navigate to the project you created in the top center menu of the SageMaker Unified Studio home page.
  2. Expand the Build menu in the top navigation bar, then choose Query editor.
  3. Launch a new Query Editor tab. This tool functions as a SQL notebook, enabling you to query across multiple engines and build visual data analytics solutions.
  4. Select a data source for your queries by using the menu in the upper-right corner of the Query Editor.
    1. Under Connections, choose Lakehouse (Athena) to connect to your Lakehouse resources.
    2. Under Catalogs, choose S3tablescatalog/blogcatalog.
    3. Under Databases, choose the name of the database for your S3 tables.
  5. Select Choose to connect to the database and query engine.
  6. Run the following SQL query to create a new table in the catalog.
    CREATE TABLE taxidata.taxi_trip_data_iceberg (
    pickup_datetime timestamp,
    dropoff_datetime timestamp,
    pickup_longitude double,
    pickup_latitude double,
    dropoff_longitude double,
    dropoff_latitude double,
    passenger_count bigint,
    fare_amount double
    )
    PARTITIONED BY
    (day(pickup_datetime))
    TBLPROPERTIES (
    'table_type' = 'iceberg'
    
    );

    After you create the table, you can browse to it in the Data explorer by choosing S3tablescatalog →s3tableCatalog →taxidata→taxi_trip_data_iceberg.

  7. Insert data into a table with the following DML statement.
    INSERT INTO taxidata.taxi_trip_data_iceberg VALUES (
    TIMESTAMP '2025-07-20 10:00:00',
    TIMESTAMP '2025-07-20 10:45:00',
    -73.985,
    40.758,
    -73.982,
    40.761,
    2, 23.75
    );

  8. Select data from a table with the following query.
    SELECT * FROM taxidata.taxi_trip_data_iceberg
    WHERE pickup_datetime >= TIMESTAMP '2025-07-20'
    AND pickup_datetime < TIMESTAMP '2025-07-21';

You can learn more about the Query Editor and explore additional SQL examples in the SageMaker Unified Studio documentation.

Before proceeding with JupyterLab setup:

To create tables using the Spark engine via a Spark connection, you must grant the S3TableFullAccess permission to the Project Role ARN.

  1. Locate the Project Role ARN in SageMaker Unified Studio Project Overview.
  2. Go to the IAM console then select Roles.
  3. Search for and select the Project Role.
  4. Attach the S3TableFullAccess policy to the role, so that the project has full access to interact with S3 Tables.

Using JupyterLab

  1. Navigate to the project you created in the top center menu of the SageMaker Unified Studio home page.
  2. Expand the Build menu in the top navigation bar, then choose JupyterLab.
  3. Create a new notebook.
  4. Select Python3 Kernel.
  5. Choose PySpark as the connection type.
  6. Select your table bucket and namespace as the data source for your queries:
    1. For Spark engine, execute query USE s3tablescatalog_blogdata

Querying data using Redshift:

In this section, we walk through how to query the data using Redshift within SageMaker Unified Studio.

  1. From the SageMaker Studio home page, choose your project name in the top center navigation bar.
  2. In the navigation panel, expand the Redshift project folder.
  3. Open the blogdata@s3tablescatalog database.
  4. Expand the taxidata schema.
  5. Under the Tables section, locate and expand taxi_trip_data_iceberg.
  6. Review the table metadata to view all columns and their corresponding data types.
  7. Open the Sample data tab to preview a small, representative subset of records.
  8. Choose Actions.
  9. Select Preview data from the dropdown to open and view the full dataset in the data viewer.

When you select your table, the Query Editor automatically opens with a pre-populated SQL query. This default query retrieves the top 10 records from the table, giving you an instant preview of your data. It uses standard SQL naming conventions, referencing the table by its fully qualified name in the format database_schema.table_name. This approach ensures the query accurately targets the intended table, even in environments with multiple databases or schemas.

Best practices and considerations

The following are some considerations you should take note of.

  • When you create an S3 table bucket using the S3 console, integration with AWS analytics services is enabled automatically by default. You can also choose to set up the integration manually through a guided process in the console. Also, when you create S3 Table bucket programmatically using the AWS SDK, or AWS CLI, or REST APIs, the integration with AWS analytics services is not automatically configured. You need to manually perform the steps required to integrate the S3 Table bucket with AWS Glue Data Catalog and Lake Formation, allowing these services to discover and access the table data.
  • When creating an S3 table bucket for use with AWS analytics services like Athena, we recommend using all lowercase letters for the table bucket name. This requirement ensures proper integration and visibility within the AWS analytics ecosystem. Learn more about it from getting started with S3 tables.
  • S3 Tables offer automatic table maintenance features like compaction, snapshot management, and unreferenced file removal to optimize data for analytics workloads. However, there are some limitations to consider. Please read more on it from considerations and limitations for maintenance jobs.

Conclusion

In this post, we discussed how to use SageMaker Unified Studio’s integration with S3 Tables to enhance your data analytics workflows. The post explained the setup process, including creating a Lakehouse catalog with S3 table bucket source, configuring necessary IAM roles, and establishing integration with AWS Glue Data Catalog and Lake Formation. We walked you through practical implementation steps, from creating and managing Apache Iceberg based S3 tables to executing queries through both the Query Editor and JupyterLab with PySpark, as well as accessing and analyzing data using Redshift.

To get started with SageMaker Unified Studio and S3 Tables integration, visit Access Amazon SageMaker Unified Studio documentation.


About authors

Stay Informed

Get the best articles every day for FREE. Cancel anytime.