Introducing Collations to Databricks

Building global enterprise applications means handling diverse languages and inconsistent data entry. How does a database know to sort “Äpfel” after “Apfel” in German or treat “ç” as “c” in French? Or handle users typing “John Smith” versus “john smith” and decide if they’re the same?

Collations streamline data processing by defining rules for sorting and comparing text in ways that respect language and case sensitivity. Collations make databases language- and context-aware, ensuring they handle text as users expect.

We are excited to share that collations are now available in Public Preview with Databricks Runtime 16.1 (coming soon to Databricks SQL and Databricks Delta Live Tables). Collations provide a mechanism for defining string comparison rules tailored to specific language requirements, such as case sensitivity and accent sensitivity. In this blog, we’ll explore how collations work, why they matter, and how to choose the right one for your needs.

Now with Collations, users can choose from over 100 language-specific collation rules to implement within their data workflows, facilitating operations such as sorting, searching, and joining multilingual text datasets. Collation support will make it easier to apply the same rules when migrating from legacy database systems. This functionality will significantly improve performance and simplify code, especially for common queries that require case-insensitive and accent-insensitive comparisons.

Key features of collation support

Databricks collation support includes:

Over 100 languages, with case and accent sensitivity variations
Over 100 Spark & SQL expressions
Compatibility with all data operations (joins, sorting, aggregation, clustering, etc.)
Photon-optimized implementation
Native support for Delta tables, including performance optimizations such as data skipping, z-ordering, liquid clustering, dynamic partition and file pruning
Simplifies migrations from legacy database systems

Collation support is fully open-sourced and integrated within Apache Spark™ and Delta Lake.

Using collations in your queries

Collations offer a robust integration with established Spark functionalities, enabling operations such as joins, aggregates, window functions, and filters to function seamlessly with collated data. Most string expressions are compatible with collations, allowing for their use in various expressions like CONTAINS, STARTSWITH, REPLACE, TRIM, among others. More details are in the collation documentation.

Solving common tasks with collations

To get started with collations, create (or modify) a table column with the appropriate collation. For Greek names, you would use the EL_AI collation, where EL is the language identifier for Greek and AI stands for accent-insensitive. For English names (which don’t have accents), you would use UTF8_LCASE.

To showcase the scenarios unlocked by collations, let’s perform the following tasks:

Use case-insensitive comparison to find English names
Use Greek alphabet ordering to sort Greek names
Search for Greek names in an accent-insensitive manner

We will use a table containing the names of heroes from Homer’s Iliad in both Greek and English to demonstrate:

To list all available collations you can query collations TVF – SELECT * FROM collations().

You should run the ANALYZE command after the ALTER commands to make sure that subsequent queries are able to leverage data skipping:

Now, you no longer need to do LOWER before explicitly comparing English names. File pruning will also happen under the hood.

To sort according to Greek language rules, you can simply use ORDER BY. Note that the result will be different from sorting without the EL_AI collation.

And for searching, in an accent-insensitive manner, let’s say all rows that refer to Agamemnon (or Ἀγαμέμνων in Greek), you just apply a filter that will match against the accented version of the Greek name:

Performance with collations

Collation support eliminates the need to perform costly operations to achieve case-insensitive results, streamlining the process and improving efficiency. The graph below compares execution time using the LOWER SQL function versus collation support to get case-insensitive results. The comparison was done on 1B randomly generated strings. The query aims to filter, in some column ‘col’, all strings equal to ‘abc’ in a case-insensitive manner. In the scenario where the legacy UTF8_BINARY collation is used, the filter condition is LOWER(col) == ‘abc’. When the column ‘col’ is collated with the UTF8_LCASE collation, the filter condition is simply col == ‘abc’, which achieves the same result. Using collation yields up to 22x faster query execution by leveraging Delta file-skipping (in this case, Photon is not used in either query).

With Photon, the performance improvement can be even more significant (actual speeds vary depending on the collation, function and data). The graph below shows speeds with and without Photon for equality comparison, STARTSWITH, ENDSWITH, and CONTAINS SQL functions with UTF8_LCASE collation. The functions were run on a dataset of randomly generated ASCII-only strings of 1000-char length. In the example, STARTSWITH and ENDSWITH showed 10x performance speedup when using collations.

Collations with Photon

With the exception of the Photon-optimized implementation, all collations features are available in open source Spark. There are no data format changes, meaning data remains UTF-8 encoded in the underlying files, and all features are supported across both open source Spark and Delta Lake. This means customers are not locked-in and should view their code as portable across the Spark ecosystem.

What’s next

In the near future, customers will be able to set collations at the Catalog, Schema, or Table level. Support for RTRIM is also coming soon, allowing string comparisons to ignore undesired trailing white spaces. Stay tuned to the Databricks Homepage and What’s Coming documentation pages for updates.

Getting started

Get started with collations, read the Databricks documentation.

To learn more about Databricks SQL, visit our website or read the documentation. You can also check out the product tour for Databricks SQL. If you want to migrate your existing warehouse to a high-performance, serverless data warehouse with a great user experience and lower total cost, then Databricks SQL is the solution — try it for free.

Source link
lol