We are excited to announce the general availability of Row Filters and Column Masks in Unity Catalog on AWS, Azure and GCP! Managing fine-grained access controls on rows and columns in tables is critical to ensure data security and meet compliance. With Unity Catalog, you can use standard SQL functions to define row filters and column masks, allowing fine-grained access controls on rows and columns. Row Filters let you control which subsets of your tables’ rows are visible to hierarchies of groups and users within your organization. Column Masks let you redact your table values based on the same dimensions.
“Distributing data governance through Databricks Unity Catalog transformed Akamai’s approach to managing and governing data. With Unity Catalog, we are now managing and governing over six petabytes of data with fine-grained access controls on rows and columns.”
— Gilad Asulin, Big Data Team Leader, Akamai
This blog discusses how you can enable fine-grained access controls using Row Filters and Column Masks.
What are Coarse-Grained Entity-Level Permissions?
Before this announcement, Unity Catalog already supported entity-level permissions. For example, you can use GRANT
and REVOKE
SQL commands over securable objects such as tables and functions to adjust which users and groups are allowed to inspect, query, or modify them:
USE CATALOG main;
CREATE SCHEMA accounts;
CREATE TABLE accounts.purchase_history(
amount_cents BIGINT,
region STRING,
payment_type STRING,
purchase_date DATE DEFAULT CURRENT_DATE())
USING DELTA;
We can grant read access to the accounts_team
:
GRANT SELECT ON TABLE accounts.purchase_history TO accounts_team;
Now, the accounts_team
has access to query (but not modify) the purchase_history
table.
Prior Approaches for Sharing Subsets of Data with Different Groups
But what if we have separate accounts teams for different regions? So far, we could create a daily job to copy subsets of data into different tables and set their permissions accordingly:
-- Create a table for data from the EMEA region and grant
-- read access to the corresponding accounts group.
CREATE TABLE accounts.purchase_history_emea(
amount_cents INT,
payment_type STRING,
purchase_date DATE DEFAULT CURRENT_DATE())
USING DELTA;
GRANT SELECT ON TABLE accounts.purchase_history_emea TO accounts_team_emea;
-- Run this daily to update the custom table.
-- Use the previous day to make sure all the data is available before
-- copying it.
INSERT INTO accounts.purchase_history_emea
SELECT * EXCEPT (region) FROM accounts.purchase_history
WHERE region = 'EMEA' AND purchase_date = DATE_SUB(CURRENT_DATE(), 1);
While this approach effectively addresses query needs, it comes with drawbacks. By duplicating data, we increase storage and compute usage. Also, the duplicated data lags behind the original, introducing staleness. Moreover, this solution caters solely to queries due to restricted user permissions, limiting write access to the primary table.
Another strategy uses dynamic views. Until this point, you can define a view specifically intended for consumption by specific user(s) or group(s):
CREATE VIEW accounts.purchase_history_emea
AS SELECT amount_cents, payment_type, purchase_date
FROM accounts.purchase_history
WHERE region = 'EMEA';
GRANT SELECT ON VIEW accounts.purchase_history_emea
TO accounts_team_emea;
Now we’ve solved the data copying problem, but users still have to remember to query the accounts.purchase_history_emea
table if they are in the EMEA region or the accounts.purchase_history_apac
table if they are in the APAC region, and so on.
Dynamic views from an administrator’s perspective also create complexity for several reasons:
- Must create and maintain numerous views for each region
- Shared SQL logic is cumbersome to reuse across different regional teams
- Causes clutter in the Catalog Explorer
- Limited to queries
- Cannot insert or update data within views
Introducing Row Filters
With row filters, you can apply predicates to a table, ensuring that only rows meeting specific criteria are returned in subsequent queries.
Each row filter is implemented as a SQL user-defined function (UDF). To begin, write a SQL UDF with a boolean result whose parameter type(s) are the same as the column(s) of your target table that you want to control access by.
For consistency, let’s continue using the region
column of the previous accounts.purchase_history
table for this purpose.
CREATE FUNCTION accounts.purchase_history_row_filter(region STRING)
RETURN CASE
WHEN IS_ACCOUNT_GROUP_MEMBER('emea') THEN region = 'EMEA'
WHEN IS_ACCOUNT_GROUP_MEMBER('admin') THEN TRUE
ELSE FALSE
END;
We can test this logic by performing a few queries over the target table and applying the function directly. For someone in the accounts_team_emea
group, such a query might look like this:
SELECT amount_cents,
region,
accounts.purchase_history_row_filter(region) AS filtered
FROM accounts.purchase_history;
+--------------+--------+----------+
| amount_cents | region | filtered |
+--------------+--------+----------+
| 42 | EMEA | TRUE |
| 1042 | EMEA | TRUE |
| 2042 | APAC | FALSE |
+--------------+--------+----------+
Or for someone in the admin
group who is setting the access control logic in the first place, we find that all rows from the table are returned:
SELECT amount_cents, region, purchase_history_row_filter(region) AS filtered
FROM accounts.purchase_history;
+--------------+--------+----------+
| amount_cents | region | filtered |
+--------------+--------+----------+
| 42 | EMEA | TRUE |
| 1042 | EMEA | TRUE |
| 2042 | APAC | TRUE |
+--------------+--------+----------+
Now we’re ready to apply this logic to our target table as a policy function, and grant read access to the accounts_team_emea
group:
ALTER TABLE accounts.purchase_history
SET ROW FILTER accounts.purchase_history_row_filter ON (region);
GRANT SELECT ON TABLE accounts.purchase_history TO accounts_team_emea;
Or, we can assign this policy directly to the table at creation time to make sure there is no period where the table exists, but the policy does not yet apply:
CREATE TABLE accounts.purchase_history_emea(
amount_cents INT,
payment_type STRING,
purchase_date DATE DEFAULT CURRENT_DATE())
USING DELTA
WITH ROW FILTER purchase_history_row_filter ON (region);
GRANT SELECT ON TABLE accounts.purchase_history TO accounts_team_emea;
After that, querying from the table should return the subsets of rows corresponding to the results of our testing above. For example, the accounts_team_emea
members will receive the following result:
SELECT amount_cents, region FROM accounts.purchase_history;
+--------------+--------+
| amount_cents | region |
+--------------+--------+
| 42 | EMEA |
| 1042 | EMEA |
+--------------+--------+
Now, we can share the same accounts.purchase_history
table with different groups without copying the data or adding many new names into our namespace.
You can view this information on the Catalog Explorer. Looking at the purchase_history
table, we see that a row filter applies:
Clicking on the row filter, we can see the policy function name:
Following the “view” button reveals the function contents:
Introducing Column Masks
We have demonstrated how to create and apply fine-grained access controls to tables using row filters, selectively filtering out rows that the invoking user does not have access to read at query time. But what if we want to control access to columns instead, eliding some column values and leaving others intact within each row?
Here we announce column masks!
Each column mask is also implemented as a SQL user-defined function (UDF). However, unlike row filter functions returning boolean results, each column mask policy function accepts one argument and returns the same type as this input argument.
Let’s go ahead and mask out the purchase amount column of the accounts.purchase_history
table when the value is more than one thousand:
CREATE FUNCTION accounts.purchase_history_mask(amount_cents INT)
RETURN CASE
WHEN IS_ACCOUNT_GROUP_MEMBER('admin') THEN amount_cents
WHEN amount_cents < 1000 THEN amount_cents
ELSE NULL
END;
Now, only administrators have permission to look at the purchase amounts of $10 or greater.
Let’s go ahead and test the policy function. Non-admin users see this:
SELECT amount_cents,
accounts.purchase_history_mask(amount_cents) AS masked,
region
FROM accounts.purchase_history;
+--------------+--------+----------+
| amount_cents | masked | region |
+--------------+--------+----------+
| 42 | 42 | EMEA |
| 1042 | NULL | EMEA |
| 2042 | NULL | APAC |
+--------------+--------+----------+
But administrators have access to view all the data:
SELECT amount_cents,
accounts.purchase_history_mask(amount_cents) AS masked,
region
FROM accounts.purchase_history;
+--------------+--------+----------+
| amount_cents | masked | region |
+--------------+--------+----------+
| 42 | 42 | EMEA |
| 1042 | 1042 | EMEA |
| 2042 | 2042 | APAC |
+--------------+--------+----------+
Looks great! Let’s apply the mask to our table:
ALTER TABLE accounts.purchase_history
ALTER COLUMN amount_cents
SET MASK accounts.purchase_history_mask;
After that, querying from the table should redact specific column values corresponding to the results of our testing above. For example, non-administrators will receive the following result:
SELECT amount_cents, region FROM accounts.purchase_history;
+--------------+--------+
| amount_cents | region |
+--------------+--------+
| 42 | EMEA |
| NULL | EMEA |
| NULL | APAC |
+--------------+--------+
It works correctly.
We can also inspect the values of other columns to make our masking decision. For example, we can modify the function to look at the region column instead of the purchase amount:
ALTER TABLE accounts.purchase_history ALTER COLUMN amount_cents DROP MASK;
CREATE FUNCTION accounts.purchase_history_region_mask(
amount_cents INT,
region STRING)
RETURN CASE
WHEN IS_ACCOUNT_GROUP_MEMBER('admin') THEN amount_cents
WHEN region = 'APAC' THEN amount_cents
ELSE NULL
END;
Now we can apply the mask with the USING COLUMNS
clause to specify the additional column name(s) to pass into the policy function:
ALTER TABLE accounts.purchase_history
ALTER COLUMN amount_cents
SET MASK accounts.purchase_history_mask
USING COLUMNS (region);
Thereafter, querying from the table should redact certain column values differently for non-administrators:
SELECT amount_cents, region FROM accounts.purchase_history;
+--------------+--------+
| amount_cents | region |
+--------------+--------+
| NULL | EMEA |
| NULL | EMEA |
| 2042 | APAC |
+--------------+--------+
We can look at the mask by looking at the table column in the Catalog Explorer:
Like before, following the “view” button reveals the function contents:
Storing Access Control Lists in Mapping Tables
Row filter and column mask policy functions almost always need to refer to the current user and compare it against a list of allowed users or check its group memberships against an explicit list of allowed groups. Listing these user and group allowlists in the policy functions themselves works well for lists of reasonable sizes. For larger lists or cases where we would prefer extra assurance that the identities of the users or groups themselves are hidden from view for users, we can take advantage of mapping tables instead.
These mapping tables act like personalized gatekeepers, deciding which data rows users or groups can access in your original table. The beauty of mapping tables lies in their seamless integration with fact tables, making your data security strategy more effective.
This approach is a game-changer for various custom requirements:
- Tailored User Access: You can impose restrictions based on individual user profiles while accommodating specific rules for user groups. This ensures that each user sees only what they should.
- Handling Complex Hierarchies: Whether it’s intricate organizational structures or diverse sets of rules, mapping tables can navigate the complexities, ensuring that data access adheres to your unique hierarchy.
- Seamless External Model Replication: Replicating complex security models from external source systems becomes a breeze. Mapping tables help you mirror these intricate setups without breaking a sweat.
For example:
CREATE TABLE accounts.purchase_history_groups
AS VALUES ('emea'), ('apac') t(group);
CREATE OR REPLACE FUNCTION accounts.purchase_history_row_filter(region STRING)
RETURN EXISTS(SELECT 1 FROM accounts.purchase_history_groups phg
WHERE IS_ACCOUNT_GROUP_MEMBER(phg.group));
Now, we can extend the accounts.purchase_history_groups
table to large numbers of groups without making the policy function itself complex, and also restrict access to the rows of that table to only the administrator that created the accounts.purchase_history_row_filter
SQL UDF.
Using Row and Column Level Security with Lakehouse Federation
With Lakehouse Federation, Unity Catalog solves critical data management challenges to simplify how organizations handle disparate data systems. This provides the ability to create a unified view of your entire data estate, structured and unstructured, enabling secure access and exploration for all users regardless of data source. It allows efficient querying and data combination through a single engine, accelerating various data analysis and AI applications without requiring data ingestion. Additionally, it provides a consistent permission model for data security, applying access rules and ensuring compliance across different platforms.
The fine-grained access controls announced here work seamlessly with Lakehouse Federation tables to support sharing access to federated tables within your organizations with custom row and column level access policies for different groups, without any need to copy data or create many duplicate or similar table/view names in your catalogs.
For example, you can create a federated connection to an existing MySQL database. Then, browse the Catalog Explorer to inspect the foreign catalog:
Inside the catalog, we find a mysql_demo_nyc_pizza_rating
table:
Let’s apply our row filter to that table:
ALTER TABLE mysql_catalog.qf_mysql_demo_database.mysql_demo_nyc_pizza_rating
SET ROW FILTER main.accounts.purchase_history_row_filter ON (name);
Looking at the table overview afterwards, it reflects the change:
Clicking on the row filter reveals the name of the function, just like before:
Now, queries over this federated MySQL table will return different subsets of rows depending on each invoking user’s identity and group memberships. We’ve successfully integrated fine-grained access control with Lakehouse Federation, resulting in simplified usability and unified governance for Delta Lake and MySQL tables in the same organization.
Getting started with Row and Column Level Security
With Row Filters and Column Masks, you now gain the power to streamline your data management, making excessive ETL pipelines and data copies a thing of the past. This is your gateway to a new world of unified data security, where you can confidently share data with multiple users and groups, all while maintaining control and ensuring that sensitive information remains protected.
To get started with Row Filters and Column Masks, check out our documentation on AWS and Azure and GCP.
Our team will discuss this release and other advanced access controls in Unity Catalog in our Data + AI Summit 2024 session, “Attribute-Based Access Controls in Unity Catalog—Building a Scalable Access Management Framework.” We hope to see you the week of June 10th. Register for Data + AI Summit today!
Source link
lol