We’re excited to announce native support in Databricks for ingesting XML data.
XML is a popular file format for representing complex data structures in different use cases for manufacturing, healthcare, law, travel, finance, and more. As these industries find new opportunities for analytics and AI, they increasingly need to leverage their troves of XML data. Databricks customers ingest this data into the Data Intelligence Platform, where other capabilities like Mosaic AI and Databricks SQL can then be used to drive business value.
However, it can take a lot of work to build resilient XML pipelines. Since XML files are semi-structured and arbitrarily large, they’re often complex to process. Until now, XML ingestion has required the use of open source packages or the conversion of XML into another file format, which in turn requires data engineers to maintain these complex pipelines.
To streamline that process, we’ve developed native support for XML files within Auto Loader and COPY INTO. (Note that Auto Loader for XML works with Delta Live Tables and Databricks Workflows.) This support enables direct ingestion, querying, and parsing without any external packages or file type conversions. Users can also take advantage of powerful capabilities like schema inference and evolution in Auto Loader.
Example 1: Ingest an XML file for batch workloads
df = (spark.read
.option("rowTag", "book")
.xml(inputPath))
For a sample input file containing the following XML:
<books>
<book id="103">
<author>Corets, Eva</author>
<title>Maeve Ascendant</title>
</book>
<book id="104">
<author>Corets, Eva</author>
<title>Oberon's Legacy</title>
</book>
</books>
The query above infers the following schema and parsed result:
root
|-- _id: long (nullable = true)
|-- author: string (nullable = true)
|-- title: string (nullable = true)
+---+-----------+---------------+
|_id|author |title |
+---+-----------+---------------+
|103|Corets, Eva|Maeve Ascendant|
|104|Corets, Eva|Oberon's Legacy|
+---+-----------+---------------+
Customers also benefit from new, XML-specific features. For example, they can now validate each row-level XML record against an XML schema definition (XSD). They can also use the from_xml Apache Spark function to parse XML strings that are embedded in SQL columns or streaming data sources (like Apache Kafka, Amazon Kinesis, and so on).
Example 2: Ingest an XML file using Auto Loader for streaming workloads.
This example demonstrates schema inference, schema evolution, and XSD validation.
(spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "xml")
.option("rowTag", "book")
.option("rowValidationXSDPath", xsdPath)
.option("cloudFiles.schemaLocation", schemaPath)
.option("cloudFiles.schemaEvolutionMode", "addNewColumns")
.load(inputPath)
.writeStream
.format("delta")
.option("mergeSchema", "true")
.option("checkpointLocation", checkPointPath)
.trigger(Trigger.AvailableNow()))
XML data ingestion at Lufthansa
Lufthansa Industry Solutions ingests XML data sources for their Lufthansa Cargo data solution, built on the Data Intelligence Platform. The new XML support has helped the team streamline ingestion and automate much of the data engineering burden. As a result, practitioners can focus on innovation, instead of maintaining complex pipelines.
“Lufthansa Cargo managed to streamline the integration of XML data with Auto Loader which marks a significant advancement in handling complex airfreight booking data. Cost-efficiency, reliable data “landing”, schema inference and evolution are enabling an “autopilot” mode. Overall, the collaboration with Databricks and Lufthansa Industry Solutions enables our teams to focus on critical tasks and innovation.”
— Björn Roccor, Head of AD&M BI Analytics, Lufthansa Cargo & Jens Weppner, Technology Manager Analytics, Lufthansa Cargo
Next Steps
Native XML support is now in Public Preview on all cloud platforms and is available in both Delta Live Tables and Databricks SQL. Learn more by reading the documentation.
Source link
lol