Delta Lake gives Apache Spark data sets new powers

Serdar YegulalpApril 24, 2019

8 1 minute read

Databricks, the main commercial backer for Apache Spark, has released Delta Lake, an open source storage layer for Spark that provides ACID transactions and other data-management functions for machine learning and other big data work.

Many kinds of data work need features like ACID transactions or schema enforcement for consistency, metadata management for security, and the ability to work with discrete versions of data. Features like those don’t come standard with every data source out there, so Delta Lake provides those features for any Spark DataFrame data source.

Delta Lake can be used as a drop-in replacement to access storage systems like HDFS. Data ingested into Spark through Delta Lake is stored in Parquet format in a cloud storage service of your choice. Devlopers can use their choice of Java, Python, or Scala to access Delta Lake’s API set.

Delta Lake supports most of the existing Spark SQL DataFrame functions for reading and writing data. It also supports Spark Structured Streaming as a source or destination, although not the DStream API. Every read and write through Delta Lake has an ACID transaction guarantee, so that multiple writers will have their writes serialized and multiple readers will see consistent snapshots.

Reading a specific version of a data set—what the Delta Lake documentation calls “time travel”—works by simply reading a DataFrame with an associated time stamp or version ID. Delta Lake also ensures the schema of the DataFrame being written matches the table it’s being written to; if there’s a mismatch, it throws an exception rather than change the schema. (Spark’s file APIs will replace the table in such a case.)

Future releases of Delta Lake may support more of Spark’s public API set, although DataFrameReader/Writer are the main focus for now.

Serdar YegulalpApril 24, 2019

8 1 minute read

Delta Lake gives Apache Spark data sets new powers

Serdar Yegulalp

Mobile: Expert Review: Samsung Propel ™ Pro Cellular Phone

Callcentric price plans – technology made budget-friendly

Why a Slack acquisition would make sense for Salesforce

Best practices for working with Amazon Aurora Serverless

Automating database migration monitoring with AWS DMS

The Chosen one

…. The GAME ….

gsmarena_012-jpg

exploring mysql binlog server ripple

Callcentric price plans – technology made budget-friendly

Computer Networking Fundamentals

7 Tips for Training Children Scientific Research

Mobile: Expert Review: Samsung Jack ™ Cell PhoneCall high…

Mobile: Expert Review: Samsung Propel ™ Pro Cellular Phone

Mobile: Best of the Mobile WebOffering up place as well as …

Linux: Find Files Containing Text

image captionUS regulatory authorities will review authorizations for two coronavirus vaccines this month

Are you prepared? 10 steps to becoming a millionaire in your thirties

Samsung Display teases tri-folding screen and rollable devices

iPhone 12 Pro Max vs. Mate 40 Pro vs. Xperia 1 II vs. Galaxy Note20 Ultra

Callcentric price plans – technology made budget-friendly

Why a Slack acquisition would make sense for Salesforce

Best practices for working with Amazon Aurora Serverless

Automating database migration monitoring with AWS DMS

PHP MySQL BLOB PDF: Display in Browser

The Chosen one

Mobile : Best of the Mobile Web

Mobile : Expert Review: Samsung Propel™ Pro Cell Phone

Mobile : Expert Review: Samsung Jack™ Cell Phone

Mobile : Expert Review: Samsung Epix™ Cell Phone

Mobile : Expert Review: Samsung Gravity™2 Cell Phone

With Product You Purchase

Subscribe to our mailing list to get the new updates!

Related Articles