Announcing the General Availability of Row and Column Level Security with Databricks Unity Catalog

Inside Look: Exploring Ollama for On-Device AI - PyImageSearch


We are excited to announce the general availability of Row Filters and Column Masks in Unity Catalog on AWS, Azure and GCP! Managing fine-grained access controls on rows and columns in tables is critical to ensure data security and meet compliance. With Unity Catalog, you can use standard SQL functions to define row filters and column masks, allowing fine-grained access controls on rows and columns. Row Filters let you control which subsets of your tables’ rows are visible to hierarchies of groups and users within your organization. Column Masks let you redact your table values based on the same dimensions.

“Distributing data governance through Databricks Unity Catalog transformed Akamai’s approach to managing and governing data. With Unity Catalog, we are now managing and governing over six petabytes of data with fine-grained access controls on rows and columns.”

— Gilad Asulin, Big Data Team Leader, Akamai

This blog discusses how you can enable fine-grained access controls using Row Filters and Column Masks.

What are Coarse-Grained Entity-Level Permissions?

Before this announcement, Unity Catalog already supported entity-level permissions. For example, you can use GRANT and REVOKE SQL commands over securable objects such as tables and functions to adjust which users and groups are allowed to inspect, query, or modify them:

USE CATALOG main;
CREATE SCHEMA accounts;
CREATE TABLE accounts.purchase_history(
  amount_cents BIGINT,
  region STRING,
  payment_type STRING,
  purchase_date DATE DEFAULT CURRENT_DATE())
USING DELTA;

We can grant read access to the accounts_team:

GRANT SELECT ON TABLE accounts.purchase_history TO accounts_team;

Now, the accounts_team has access to query (but not modify) the purchase_history table.

Prior Approaches for Sharing Subsets of Data with Different Groups

But what if we have separate accounts teams for different regions? So far, we could create a daily job to copy subsets of data into different tables and set their permissions accordingly:

-- Create a table for data from the EMEA region and grant
-- read access to the corresponding accounts group.
CREATE TABLE accounts.purchase_history_emea(
  amount_cents INT,
  payment_type STRING,
  purchase_date DATE DEFAULT CURRENT_DATE())
USING DELTA;

GRANT SELECT ON TABLE accounts.purchase_history_emea TO accounts_team_emea;

-- Run this daily to update the custom table.
-- Use the previous day to make sure all the data is available before
-- copying it.
INSERT INTO accounts.purchase_history_emea
SELECT * EXCEPT (region) FROM accounts.purchase_history
WHERE region = 'EMEA' AND purchase_date = DATE_SUB(CURRENT_DATE(), 1);

While this approach effectively addresses query needs, it comes with drawbacks. By duplicating data, we increase storage and compute usage. Also, the duplicated data lags behind the original, introducing staleness. Moreover, this solution caters solely to queries due to restricted user permissions, limiting write access to the primary table.

Another strategy uses dynamic views. Until this point, you can define a view specifically intended for consumption by specific user(s) or group(s):

CREATE VIEW accounts.purchase_history_emea
AS SELECT amount_cents, payment_type, purchase_date
FROM accounts.purchase_history
WHERE region = 'EMEA';

GRANT SELECT ON VIEW accounts.purchase_history_emea
TO accounts_team_emea;

Now we’ve solved the data copying problem, but users still have to remember to query the accounts.purchase_history_emea table if they are in the EMEA region or the accounts.purchase_history_apac table if they are in the APAC region, and so on.

Dynamic views from an administrator’s perspective also create complexity for several reasons:

  • Must create and maintain numerous views for each region
  • Shared SQL logic is cumbersome to reuse across different regional teams
  • Causes clutter in the Catalog Explorer
  • Limited to queries
  • Cannot insert or update data within views

Introducing Row Filters

With row filters, you can apply predicates to a table, ensuring that only rows meeting specific criteria are returned in subsequent queries.

Each row filter is implemented as a SQL user-defined function (UDF). To begin, write a SQL UDF with a boolean result whose parameter type(s) are the same as the column(s) of your target table that you want to control access by.

For consistency, let’s continue using the region column of the previous accounts.purchase_history table for this purpose.

CREATE FUNCTION accounts.purchase_history_row_filter(region STRING)
RETURN CASE
  WHEN IS_ACCOUNT_GROUP_MEMBER('emea') THEN region = 'EMEA'
  WHEN IS_ACCOUNT_GROUP_MEMBER('admin') THEN TRUE
  ELSE FALSE
END;

We can test this logic by performing a few queries over the target table and applying the function directly. For someone in the accounts_team_emea group, such a query might look like this:

SELECT amount_cents,
  region,
  accounts.purchase_history_row_filter(region) AS filtered 
FROM accounts.purchase_history;

+--------------+--------+----------+
| amount_cents | region | filtered |
+--------------+--------+----------+
| 42           | EMEA   | TRUE     |
| 1042         | EMEA   | TRUE     |
| 2042         | APAC   | FALSE    |
+--------------+--------+----------+

Or for someone in the admin group who is setting the access control logic in the first place, we find that all rows from the table are returned:

SELECT amount_cents, region, purchase_history_row_filter(region) AS filtered 
FROM accounts.purchase_history;

+--------------+--------+----------+
| amount_cents | region | filtered |
+--------------+--------+----------+
| 42           | EMEA   | TRUE     |
| 1042         | EMEA   | TRUE     |
| 2042         | APAC   | TRUE     |
+--------------+--------+----------+

Now we’re ready to apply this logic to our target table as a policy function, and grant read access to the accounts_team_emea group:

ALTER TABLE accounts.purchase_history
SET ROW FILTER accounts.purchase_history_row_filter ON (region);

GRANT SELECT ON TABLE accounts.purchase_history TO accounts_team_emea;

Or, we can assign this policy directly to the table at creation time to make sure there is no period where the table exists, but the policy does not yet apply:

CREATE TABLE accounts.purchase_history_emea(
  amount_cents INT,
  payment_type STRING,
  purchase_date DATE DEFAULT CURRENT_DATE())
USING DELTA
WITH ROW FILTER purchase_history_row_filter ON (region);

GRANT SELECT ON TABLE accounts.purchase_history TO accounts_team_emea;

After that, querying from the table should return the subsets of rows corresponding to the results of our testing above. For example, the accounts_team_emea members will receive the following result:

SELECT amount_cents, region FROM accounts.purchase_history;

+--------------+--------+
| amount_cents | region |
+--------------+--------+
| 42           | EMEA   |
| 1042         | EMEA   |
+--------------+--------+

Now, we can share the same accounts.purchase_history table with different groups without copying the data or adding many new names into our namespace.

You can view this information on the Catalog Explorer. Looking at the purchase_history table, we see that a row filter applies:

Clicking on the row filter, we can see the policy function name:

Image of a complex data visualization, likely a Databricks dashboard or report, featuring various charts, tables, and graphs.

Following the “view” button reveals the function contents:

Data visualization chart.

Introducing Column Masks

We have demonstrated how to create and apply fine-grained access controls to tables using row filters, selectively filtering out rows that the invoking user does not have access to read at query time. But what if we want to control access to columns instead, eliding some column values and leaving others intact within each row?

Here we announce column masks!

Each column mask is also implemented as a SQL user-defined function (UDF). However, unlike row filter functions returning boolean results, each column mask policy function accepts one argument and returns the same type as this input argument.

Let’s go ahead and mask out the purchase amount column of the accounts.purchase_history table when the value is more than one thousand:

CREATE FUNCTION accounts.purchase_history_mask(amount_cents INT)
RETURN CASE
  WHEN IS_ACCOUNT_GROUP_MEMBER('admin') THEN amount_cents
  WHEN amount_cents < 1000 THEN amount_cents
  ELSE NULL
END;

Now, only administrators have permission to look at the purchase amounts of $10 or greater.

Let’s go ahead and test the policy function. Non-admin users see this:

SELECT amount_cents,
  accounts.purchase_history_mask(amount_cents) AS masked,
  region
FROM accounts.purchase_history;

+--------------+--------+----------+
| amount_cents | masked | region   |
+--------------+--------+----------+
| 42           | 42     | EMEA     |
| 1042         | NULL   | EMEA     |
| 2042         | NULL   | APAC     |
+--------------+--------+----------+

But administrators have access to view all the data:

SELECT amount_cents,
  accounts.purchase_history_mask(amount_cents) AS masked,
  region
FROM accounts.purchase_history;

+--------------+--------+----------+
| amount_cents | masked | region   |
+--------------+--------+----------+
| 42           | 42     | EMEA     |
| 1042         | 1042   | EMEA     |
| 2042         | 2042   | APAC     |
+--------------+--------+----------+

Looks great! Let’s apply the mask to our table:

ALTER TABLE accounts.purchase_history
ALTER COLUMN amount_cents
SET MASK accounts.purchase_history_mask;

After that, querying from the table should redact specific column values corresponding to the results of our testing above. For example, non-administrators will receive the following result:

SELECT amount_cents, region FROM accounts.purchase_history;

+--------------+--------+
| amount_cents | region |
+--------------+--------+
| 42           | EMEA   |
| NULL         | EMEA   |
| NULL         | APAC   |
+--------------+--------+

It works correctly.

We can also inspect the values of other columns to make our masking decision. For example, we can modify the function to look at the region column instead of the purchase amount:

ALTER TABLE accounts.purchase_history ALTER COLUMN amount_cents DROP MASK;

CREATE FUNCTION accounts.purchase_history_region_mask(
  amount_cents INT,
  region STRING)
RETURN CASE
  WHEN IS_ACCOUNT_GROUP_MEMBER('admin') THEN amount_cents
  WHEN region = 'APAC' THEN amount_cents
  ELSE NULL
END;

Now we can apply the mask with the USING COLUMNS clause to specify the additional column name(s) to pass into the policy function:

ALTER TABLE accounts.purchase_history
ALTER COLUMN amount_cents
SET MASK accounts.purchase_history_mask
USING COLUMNS (region);

Thereafter, querying from the table should redact certain column values differently for non-administrators:

SELECT amount_cents, region FROM accounts.purchase_history;

+--------------+--------+
| amount_cents | region |
+--------------+--------+
| NULL         | EMEA   |
| NULL         | EMEA   |
| 2042         | APAC   |
+--------------+--------+

We can look at the mask by looking at the table column in the Catalog Explorer:

Image of a graphical representation of a database schema.

Like before, following the “view” button reveals the function contents:

Databricks blog post with a complex data visualization.

Storing Access Control Lists in Mapping Tables

Row filter and column mask policy functions almost always need to refer to the current user and compare it against a list of allowed users or check its group memberships against an explicit list of allowed groups. Listing these user and group allowlists in the policy functions themselves works well for lists of reasonable sizes. For larger lists or cases where we would prefer extra assurance that the identities of the users or groups themselves are hidden from view for users, we can take advantage of mapping tables instead.

These mapping tables act like personalized gatekeepers, deciding which data rows users or groups can access in your original table. The beauty of mapping tables lies in their seamless integration with fact tables, making your data security strategy more effective.

This approach is a game-changer for various custom requirements:

  • Tailored User Access: You can impose restrictions based on individual user profiles while accommodating specific rules for user groups. This ensures that each user sees only what they should.
  • Handling Complex Hierarchies: Whether it’s intricate organizational structures or diverse sets of rules, mapping tables can navigate the complexities, ensuring that data access adheres to your unique hierarchy.
  • Seamless External Model Replication: Replicating complex security models from external source systems becomes a breeze. Mapping tables help you mirror these intricate setups without breaking a sweat.

For example:

CREATE TABLE accounts.purchase_history_groups
AS VALUES ('emea'), ('apac') t(group);

CREATE OR REPLACE FUNCTION accounts.purchase_history_row_filter(region STRING)
RETURN EXISTS(SELECT 1 FROM accounts.purchase_history_groups phg
WHERE IS_ACCOUNT_GROUP_MEMBER(phg.group));

Now, we can extend the accounts.purchase_history_groups table to large numbers of groups without making the policy function itself complex, and also restrict access to the rows of that table to only the administrator that created the accounts.purchase_history_row_filter SQL UDF.

Using Row and Column Level Security with Lakehouse Federation

With Lakehouse Federation, Unity Catalog solves critical data management challenges to simplify how organizations handle disparate data systems. This provides the ability to create a unified view of your entire data estate, structured and unstructured, enabling secure access and exploration for all users regardless of data source. It allows efficient querying and data combination through a single engine, accelerating various data analysis and AI applications without requiring data ingestion. Additionally, it provides a consistent permission model for data security, applying access rules and ensuring compliance across different platforms.

The fine-grained access controls announced here work seamlessly with Lakehouse Federation tables to support sharing access to federated tables within your organizations with custom row and column level access policies for different groups, without any need to copy data or create many duplicate or similar table/view names in your catalogs.

For example, you can create a federated connection to an existing MySQL database. Then, browse the Catalog Explorer to inspect the foreign catalog:

Diagram of a complex data processing system, likely used in data analytics or business intelligence applications.

Inside the catalog, we find a mysql_demo_nyc_pizza_rating table:

Image with a complex structure and a large amount of data.
Graphical representation of a Databricks blog post about handling image data in Spark DataFrames

Let’s apply our row filter to that table:

ALTER TABLE mysql_catalog.qf_mysql_demo_database.mysql_demo_nyc_pizza_rating 
SET ROW FILTER main.accounts.purchase_history_row_filter ON (name);

Looking at the table overview afterwards, it reflects the change:

Databricks blog post highlighting the top nine use cases and applications for large language models (LLMs) in cybersecurity.

Clicking on the row filter reveals the name of the function, just like before:

Windows analysis report from Joe Sandbox, detailing malware configuration and system information.

Now, queries over this federated MySQL table will return different subsets of rows depending on each invoking user’s identity and group memberships. We’ve successfully integrated fine-grained access control with Lakehouse Federation, resulting in simplified usability and unified governance for Delta Lake and MySQL tables in the same organization.

Getting started with Row and Column Level Security

With Row Filters and Column Masks, you now gain the power to streamline your data management, making excessive ETL pipelines and data copies a thing of the past. This is your gateway to a new world of unified data security, where you can confidently share data with multiple users and groups, all while maintaining control and ensuring that sensitive information remains protected.

To get started with Row Filters and Column Masks, check out our documentation on AWS and Azure and GCP.

Our team will discuss this release and other advanced access controls in Unity Catalog in our Data + AI Summit 2024 session, “Attribute-Based Access Controls in Unity Catalog—Building a Scalable Access Management Framework.” We hope to see you the week of June 10th. Register for Data + AI Summit today!



Source link
lol

By stp2y

Leave a Reply

Your email address will not be published. Required fields are marked *

No widgets found. Go to Widget page and add the widget in Offcanvas Sidebar Widget Area.