Confluent Inc. today announced new features in its cloud service that make it easier for users of its Apache Kafka-based streaming engine to store data in the Apache Iceberg table format.
The new Confluent Tableflow enables users to convert Kafka topics, associated schemas and metadata to Iceberg tables with one click and better supports feeding analytic workloads in data lakes and data warehouses.
That compares with what had previously been a “painful” process, said Addison Huddy, vice president of Kafka at Confluent. “Today, you have to think about how to partition data, consume it and write it out to S3 in a cost-performant and stable way,” he said, referring to Amazon Web Services Inc.’s object storage format. “You end up with really small files in S3 that need to be compacted, and often you lose the schema. You end up reorganizing, grouping and adding a schema with a whole bunch of pipelines,” built with Apache Spark and Apache Flink.
“Tableflow takes all that complexity and makes it pushbutton simple,” he said. “You look at a topic in Kafka. It already has a schema, so you know its shape. You push a button, and it flips that stream and turns it into a table. With the Iceberg metadata interface, we can expose it as a Kafka and an S3 endpoint. I think of it like getting Iceberg data bottled at the source.”
Defacto standards
Kafka has a nearly 39% market share in the fragmented queueing, messaging and background processing market, according to 6sense Insights Inc. It’s used by more than 80% of Fortune 100 companies.
Apache Iceberg is an open-source table format popular in data lakes for its flexibility and consistency. Iceberg supports schema evolution, hidden partitioning and snapshot isolation for reliability. It can also scale to manage petabytes of data across billions of rows.
Tableflow works with Confluent’s data streaming platform’s existing capabilities, including stream governance features and stream processing with Apache Flink, an open-source, unified stream-processing and batch-processing framework for large data volumes.
Iceberg uses the open-source Parquet file format with columnar storage written as files. “Our job is to ensure the files are well organized,” Huddy said. “A whole ecosystem has evolved around using Iceberg maps to get all your Parquet data.” Tableflow handles the translation and makes updates available in real time. It’s currently available as part of an early access program.
Confluent is also expanding the number of connectors to other data sources to more than 80 and adding support for private networks using DNS forwarding and Egress Access Point on Amazon and Microsoft Corp. Azure cloud platforms. Provisioning time has been reduced, and the data transfer throughput price has been reduced to 2.5 cents per gigabyte. “You can now set up a connection much more quickly and know right away that it’s working,” Huddy said.
Confluent Cloud customers will also now have the company’s Stream Governance platform automatically enabled in their environments, providing access to a schema registry, a data portal, real-time stream lineage and other features.
“Kafka is the first time in a streaming pipeline that data is written, so you want to make sure data is governed the minute it’s written,” Huddy said. “With data masking policies, you can immediately apply governance the minute data is created.”
A component of stream governance called Schema Registry helps enforce universal data standards to ensure data quality and consistency. The enterprise-focused Stream Governance Advanced now offers a 99.99% service-level agreement for Schema Registry.
Image: Flickr CC
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU
Source link
lol