Vijay Saboo

DataCatalogDBX

Unity Catalog Overview

Author

Vijay Saboo

Cover

Slug

unity-catalog-overview

Person

Published

Date

Apr 11, 2024

Introduction

In the dynamic landscape of data management and analytics, organisations are constantly seeking innovative solutions to streamline processes, unlock insights, and drive decision-making. Unity Catalog provides a one stop solution to the streamlining of process by providing centralised access control, auditing, lineage, and data discovery capabilities across Databricks workspaces. Unity Catalog basically unifies all data and AI assets across all workspaces on any cloud platform.

Key features of Unity Catalog

Define once, secure everywhere

Unity Catalog offers a single place to administer data access policies that apply across all workspaces.

Standards-compliant security model

Unity Catalog’s security model is based on standard ANSI SQL and allows administrators to grant permissions in their existing data lake using familiar syntax, at the level of catalogs, databases (also called schemas), tables, and views.

Data discovery

Unity Catalog lets you tag and document data assets, and provides a search interface to help data consumers find data.

Built-in auditing and lineage

Unity Catalog automatically captures user-level audit logs that record access to your data. Unity Catalog also captures lineage data that tracks how data assets are created and used across all languages.

System tables (Public Preview)

Unity Catalog lets you easily access and query your account’s operational data, including audit logs, billable usage, and lineage.

Unity Catalog Object Model

In Unity Catalog, the hierarchy of primary data objects flows from metastore to table or volume:

Metastore: The top-level container for metadata. Each metastore exposes a three-level namespace (catalog.schema.table) that organises your data. It registers metadata about data and AI assets and the permissions that govern access to them.

Catalog: The first layer of the object hierarchy, used to organise your data assets. Users can see all catalogs on which they have been assigned the USE CATALOG permission.

Schema: A schema, also known as databases, are the second layer of the object hierarchy and contain tables and views. Users can see all schemas on which they have been assigned the USE SCHEMA permission, along with the USE CATALOG permission on the schema’s parent catalog.

Tables, views, and volumes: At the lowest level in the data object hierarchy are tables, views, and volumes. Volumes provide governance for non-tabular data.

Models:

Although they are not, strictly speaking, data assets, registered models can also be managed in Unity Catalog and reside at the lowest level in the object hierarchy.

How to Setup Unity Catalog in Databricks

Step 1: Confirm that your workspace is enabled for Unity Catalog

In this step, you determine whether your workspace is already enabled for Unity Catalog, where enablement is defined as having a Unity Catalog metastore attached to the workspace. If your workspace is not enabled for Unity Catalog, you must enable your workspace for Unity Catalog manually.

To confirm, do one of the following.

Use the account console to confirm Unity Catalog enablement

As a Databricks account admin, log into the account console.

Click Workspaces.

Find your workspace and check the Metastore column. If a metastore name is present, your workspace is attached to a Unity Catalog metastore and therefore enabled for Unity Catalog.

Next steps if your workspace is not enabled for Unity Catalog

If your workspace is not enabled for Unity Catalog (attached to a metastore), the next step depends on whether or not you already have a Unity Catalog metastore defined for your workspace region:

If your account already has a Unity Catalog metastore defined for your workspace region, you can simply attach your workspace to the existing metastore. Go to Enable your workspace for Unity Catalog

If there is no Unity Catalog metastore defined for your workspace’s region, you must create a metastore and then attach the workspace. Go to Create a Unity Catalog metastore.

When your workspace is enabled for Unity Catalog, go to the next step.

Step 2: Add users and assign the workspace admin role

The user who creates the workspace is automatically added as a workspace user with the workspace admin role. As a workspace admin, you can add and invite users to the workspace, can assign the workspace admin role to other users, and can create service principals and groups.

Account admins also have the ability to add users, service principals, and groups to your workspace. They can grant the account admin and metastore admin roles.

Step 3: Create clusters or SQL warehouses that users can use to run queries and create objects

To run Unity Catalog workloads, compute resources must comply with certain security requirements. Non-compliant compute resources cannot access data or other objects in Unity Catalog. SQL warehouses always comply with Unity Catalog requirements, but some cluster access modes do not. See Access modes

As a workspace admin, you can opt to make compute creation restricted to admins or let users create their own SQL warehouses and clusters. You can also create cluster policies that enable users to create their own clusters, using Unity Catalog-compliant specifications that you enforce. See Cluster access control and Create and manage compute policies.

Step 4: Grant privileges to users

To create objects and access them in Unity Catalog catalogs and schemas, a user must have permission to do so. You can have a look at the sample below to grant respective permissions to the users

Grant privileges

For example, to grant a group the ability to create new schemas in demo_instance, the catalog owner can run the following in the SQL Editor or a notebook:


GRANT CREATE SCHEMA ON demo_instance TO `data-enthusiasts`;

If your workspace was enabled for Unity Catalog automatically, the workspace admin owns the workspace catalog and can grant the ability to create new schemas:


GRANT CREATE SCHEMA ON <workspace-catalog> TO `data-enthusiasts`;

Step 5: Create new catalogs and schemas

To start using Unity Catalog, you must have at least one catalog defined. Catalogs are the primary unit of data isolation and organisation in Unity Catalog. All schemas and tables live in catalogs, as do volumes, views, and models.

Catalog creation example

The following example shows the creation of a catalog with managed storage, followed by granting the SELECT privilege on the catalog:


CREATE CATALOG IF NOT EXISTS mycatalog
  MANAGED LOCATION 's3://dept/cloud-data';

GRANT SELECT ON mycatalog TO `data-enthusiasts`;

Use Cases of Unity Catalog

Data Lineage

Unity Catalog supports end-to-end visibility of data flow through the Lakehouse visualised with data lineage. These graphs are access-control-aware and restrict access to lineage graphs based on the user’s access. One can visualise the lineage via UI or make API calls to integrate it with other catalogs. Databricks scan the underlying Spark code written in Notebooks to automatically document lineage which makes it quite easy for the industries to use this.

Capturing Data Lineage for your pipelines is also very convenient and useful, using Databricks

Data Governance

The Databricks Unity Catalog is designed to provide a search and discovery experience enabled by a central repository of all data assets, such as files, tables, views, dashboards, etc. This, coupled with a data governance framework and an extensive audit log of all the actions performed on the data stored in a Databricks account, makes Unity Catalog very attractive for businesses

Challenges and Limitations

Unity Catalog limitations vary by Databricks Runtime and access mode. Structured Streaming workloads have additional limitations based on Databricks Runtime and access mode.

R Workloads Limitations: Workloads in R don't support dynamic views for row-level or column-level security.

Shallow Clones: Shallow clones to create Unity Catalog managed tables are supported from Databricks Runtime 13.1 and above.

Bucketing Limitations: Bucketing isn't supported for Unity Catalog tables and attempts to create them result in exceptions.

Cross-Region Writing: Writing to the same path or Delta Lake table from workspaces in multiple regions can lead to unreliable performance.

Partition Scheme Limitations: Custom partition schemes created using commands like ALTER TABLE ADD PARTITION aren't supported for Unity Catalog tables.

Overwrite Mode for DataFrame Write Operations: Overwrite mode for DataFrame write operations into Unity Catalog is supported only for Delta tables.

Workspace-Level Groups Limitation: Groups created at the workspace level can't be used in Unity Catalog GRANT statements. Use account-level groups instead.

Thread Pool Usage: Standard Scala thread pools aren't supported; use special thread pools in org.apache.spark.util.ThreadUtils

The following limitations apply for all object names in Unity Catalog:

Object names cannot exceed 255 characters.

The following special characters are not allowed:

Period (.)
Space ( )
Forward slash (/)
All ASCII control characters (00-1F hex)
The DELETE character (7F hex)

Unity Catalog stores all object names as lowercase.

When referencing UC names in SQL, you must use backticks to escape names that contain special characters such as hyphens (-)

Vijay Saboo

Unity Catalog Overview

Introduction

Key features of Unity Catalog

Unity Catalog Object Model

How to Setup Unity Catalog in Databricks

Step 1: Confirm that your workspace is enabled for Unity Catalog

Step 2: Add users and assign the workspace admin role

Step 3: Create clusters or SQL warehouses that users can use to run queries and create objects

Step 4: Grant privileges to users

Step 5: Create new catalogs and schemas

Use Cases of Unity Catalog

Challenges and Limitations

Related Posts

GenAI Powered Crew for Migrating Legacy ETL Pipelines

Chander Pal, Mohit Srivastava, Vibhore Kumar

Graph Theory for Advanced Data Deduplication

Shaswat Pandey

Data Quality Testing for Migrated Pipeline: From On-Premise to Databricks

Nitesh Ranka

Escaping the Shadow of Dark Data

Grace Primavera

Price Optimization- Machine Learning for HomePro, TH

Gitanjali Kurup

Introduction to the Modern Data Stack

Himanshu Shrivastava