- Delta lake in python [Local]
- [Create a table](https://docs.delta.io/latest/quick-start.html#id4)
- Py code
- output
- Result
- Read data
- Py code
- [Update table data](https://docs.delta.io/latest/quick-start.html#id6)
- Overwrite
- Update with condition
- Delete with condition
- Upsert
- [Read older versions of data using time travel](https://docs.delta.io/latest/quick-start.html#id9)
- Vacuum
Create a table
The first time you use delta lake to create table will download some jars
Py codefrom delta import * import pyspark if __name__ == '__main__': builder = pyspark.sql.SparkSession.builder.appName("MyApp") .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") spark = spark = configure_spark_with_delta_pip(builder).getOrCreate() data = spark.range(0, 5) data.write.format("delta").save("/tmp/delta-table")output
/Users/gavin/PycharmProjects/pythonProject/venv/bin/python /Users/gavin/PycharmProjects/pythonProject/venv/spark/delta/test.py :: loading settings :: url = jar:file:/Users/gavin/PycharmProjects/pythonProject/venv/lib/python3.8/site-packages/pyspark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml Ivy Default Cache set to: /Users/gavin/.ivy2/cache The jars for the packages stored in: /Users/gavin/.ivy2/jars io.delta#delta-core_2.12 added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent-bc2db27b-406e-4a59-a5a6-81e4ceefcdd9;1.0 confs: [default] found io.delta#delta-core_2.12;1.0.0 in central found org.antlr#antlr4;4.7 in central found org.antlr#antlr4-runtime;4.7 in local-m2-cache found org.antlr#antlr-runtime;3.5.2 in central found org.antlr#ST4;4.0.8 in central found org.abego.treelayout#org.abego.treelayout.core;1.0.3 in central found org.glassfish#javax.json;1.0.4 in central found com.ibm.icu#icu4j;58.2 in central downloading https://repo1.maven.org/maven2/io/delta/delta-core_2.12/1.0.0/delta-core_2.12-1.0.0.jar ... [SUCCESSFUL ] io.delta#delta-core_2.12;1.0.0!delta-core_2.12.jar (4279ms) downloading https://repo1.maven.org/maven2/org/antlr/antlr4/4.7/antlr4-4.7.jar ... [SUCCESSFUL ] org.antlr#antlr4;4.7!antlr4.jar (1363ms) downloading file:/Users/gavin/.m2/repository/org/antlr/antlr4-runtime/4.7/antlr4-runtime-4.7.jar ... [SUCCESSFUL ] org.antlr#antlr4-runtime;4.7!antlr4-runtime.jar (5ms) downloading https://repo1.maven.org/maven2/org/antlr/antlr-runtime/3.5.2/antlr-runtime-3.5.2.jar ... [SUCCESSFUL ] org.antlr#antlr-runtime;3.5.2!antlr-runtime.jar (310ms) downloading https://repo1.maven.org/maven2/org/antlr/ST4/4.0.8/ST4-4.0.8.jar ... [SUCCESSFUL ] org.antlr#ST4;4.0.8!ST4.jar (349ms) downloading https://repo1.maven.org/maven2/org/abego/treelayout/org.abego.treelayout.core/1.0.3/org.abego.treelayout.core-1.0.3.jar ... [SUCCESSFUL ] org.abego.treelayout#org.abego.treelayout.core;1.0.3!org.abego.treelayout.core.jar(bundle) (211ms) downloading https://repo1.maven.org/maven2/org/glassfish/javax.json/1.0.4/javax.json-1.0.4.jar ... [SUCCESSFUL ] org.glassfish#javax.json;1.0.4!javax.json.jar(bundle) (255ms) downloading https://repo1.maven.org/maven2/com/ibm/icu/icu4j/58.2/icu4j-58.2.jar ... [SUCCESSFUL ] com.ibm.icu#icu4j;58.2!icu4j.jar (5898ms) :: resolution report :: resolve 21270ms :: artifacts dl 12676ms :: modules in use: com.ibm.icu#icu4j;58.2 from central in [default] io.delta#delta-core_2.12;1.0.0 from central in [default] org.abego.treelayout#org.abego.treelayout.core;1.0.3 from central in [default] org.antlr#ST4;4.0.8 from central in [default] org.antlr#antlr-runtime;3.5.2 from central in [default] org.antlr#antlr4;4.7 from central in [default] org.antlr#antlr4-runtime;4.7 from local-m2-cache in [default] org.glassfish#javax.json;1.0.4 from central in [default] --------------------------------------------------------------------- | | modules || artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| --------------------------------------------------------------------- | default | 8 | 8 | 8 | 0 || 8 | 8 | --------------------------------------------------------------------- :: retrieving :: org.apache.spark#spark-submit-parent-bc2db27b-406e-4a59-a5a6-81e4ceefcdd9 confs: [default] 8 artifacts copied, 0 already retrieved (15452kB/21ms) 21/12/01 17:32:17 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Process finished with exit code 0
Note: download some jars
Resultgavin@GavindeMacBook-Pro delta-table % ll -R total 48 drwxr-xr-x 3 gavin wheel 96 Dec 1 17:32 _delta_log -rw-r--r-- 1 gavin wheel 296 Dec 1 17:32 part-00000-860e9aa3-7a9e-4511-a38f-0fc76096bf3e-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 1 17:32 part-00002-d47de088-a3aa-4ead-b28f-b2f59e749cc3-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 1 17:32 part-00004-2b098d53-196a-40de-9591-98f70edca7f0-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 1 17:32 part-00007-a5b43491-c803-41fa-9bac-54d9423f3122-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 1 17:32 part-00009-b5972d12-16ea-4108-8e19-2ae72ee35acf-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 1 17:32 part-00011-b21c674e-d9f1-47f2-b96f-5555622570d6-c000.snappy.parquet ./_delta_log: total 8 -rw-r--r-- 1 gavin wheel 1602 Dec 1 17:32 00000000000000000000.json
Json file contains commitInfo/protocol/metaData and every commit record:
{"commitInfo":{"timestamp":1638351150505,"operation":"WRITE","operationParameters":{"mode":"ErrorIfExists","partitionBy":"[]"},"isBlindAppend":true,"operationMetrics":{"numFiles":"6","numOutputBytes":"2611","numOutputRows":"5"}}} {"protocol":{"minReaderVersion":1,"minWriterVersion":2}} {"metaData":{"id":"11a935b4-4f4b-435c-bd40-84a3a07aaf1f","format":{"provider":"parquet","options":{}},"schemaString":"{"type":"struct","fields":[{"name":"id","type":"long","nullable":true,"metadata":{}}]}","partitionColumns":[],"configuration":{},"createdTime":1638351148221}} {"add":{"path":"part-00000-860e9aa3-7a9e-4511-a38f-0fc76096bf3e-c000.snappy.parquet","partitionValues":{},"size":296,"modificationTime":1638351150000,"dataChange":true}} {"add":{"path":"part-00002-d47de088-a3aa-4ead-b28f-b2f59e749cc3-c000.snappy.parquet","partitionValues":{},"size":463,"modificationTime":1638351150000,"dataChange":true}} {"add":{"path":"part-00004-2b098d53-196a-40de-9591-98f70edca7f0-c000.snappy.parquet","partitionValues":{},"size":463,"modificationTime":1638351150000,"dataChange":true}} {"add":{"path":"part-00007-a5b43491-c803-41fa-9bac-54d9423f3122-c000.snappy.parquet","partitionValues":{},"size":463,"modificationTime":1638351150000,"dataChange":true}} {"add":{"path":"part-00009-b5972d12-16ea-4108-8e19-2ae72ee35acf-c000.snappy.parquet","partitionValues":{},"size":463,"modificationTime":1638351150000,"dataChange":true}} {"add":{"path":"part-00011-b21c674e-d9f1-47f2-b96f-5555622570d6-c000.snappy.parquet","partitionValues":{},"size":463,"modificationTime":1638351150000,"dataChange":true}}Read data Py code
from delta import * import pyspark if __name__ == '__main__': builder = pyspark.sql.SparkSession.builder.appName("MyApp") .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") spark_session = configure_spark_with_delta_pip(builder).getOrCreate() df = spark_session.read.format("delta").load("/tmp/delta-table") df.show() ''' result: +---+ | id| +---+ | 1| | 4| | 0| | 2| | 3| +---+ '''Update table data Overwrite
from delta import * import pyspark if __name__ == '__main__': builder = pyspark.sql.SparkSession.builder.appName("MyApp") .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") spark = spark_session = configure_spark_with_delta_pip(builder).getOrCreate() data = spark.range(5, 10) data.write.format("delta").mode("overwrite").save("/tmp/delta-table") df = spark_session.read.format("delta").load("/tmp/delta-table") df.show() ''' result: +---+ | id| +---+ | 9| | 6| | 8| | 5| | 7| +---+ '''
Files in dir:
gavin@GavindeMacBook-Pro delta-table % ll -R total 96 drwxr-xr-x 4 gavin wheel 128 Dec 3 15:54 _delta_log -rw-r--r-- 1 gavin wheel 296 Dec 1 17:32 part-00000-860e9aa3-7a9e-4511-a38f-0fc76096bf3e-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 296 Dec 3 15:53 part-00000-c57742fe-94a4-45c7-b5e1-6320e9092be3-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 1 17:32 part-00002-d47de088-a3aa-4ead-b28f-b2f59e749cc3-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 15:53 part-00002-e7c58054-17b4-4cb8-8796-4c9f2ab31f4a-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 1 17:32 part-00004-2b098d53-196a-40de-9591-98f70edca7f0-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 15:53 part-00004-7a09f511-5473-474b-9612-b804c56917a8-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 15:53 part-00007-78c06975-6ecf-4f3f-a3ee-eb7a5e2aa046-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 1 17:32 part-00007-a5b43491-c803-41fa-9bac-54d9423f3122-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 15:53 part-00009-1cf0e1ef-1f0c-436f-869f-81fa918c08b4-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 1 17:32 part-00009-b5972d12-16ea-4108-8e19-2ae72ee35acf-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 15:53 part-00011-5edd12d4-ac7f-4f40-acf4-5a309c5cdbc8-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 1 17:32 part-00011-b21c674e-d9f1-47f2-b96f-5555622570d6-c000.snappy.parquet ./_delta_log: total 16 -rw-r--r-- 1 gavin wheel 1602 Dec 1 17:32 00000000000000000000.json -rw-r--r-- 1 gavin wheel 2475 Dec 3 15:53 00000000000000000001.jsonUpdate with condition
from delta import * import pyspark from delta.tables import * from pyspark.sql.functions import * builder = pyspark.sql.SparkSession.builder.appName("MyApp") .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") spark = configure_spark_with_delta_pip(builder).getOrCreate() deltaTable = DeltaTable.forPath(spark, "/tmp/delta-table") # Update every even value by adding 100 to it deltaTable.update( condition = expr("id % 2 == 0"), set = { "id": expr("id + 100") }) deltaTable.toDF().show() ''' result: +---+ | id| +---+ | 9| | 5| |106| | 7| |108| +---+ '''
Files in dir:
gavin@GavindeMacBook-Pro delta-table % ll -R total 112 drwxr-xr-x 5 gavin wheel 160 Dec 3 16:03 _delta_log -rw-r--r-- 1 gavin wheel 463 Dec 3 16:03 part-00000-37c74c73-e771-41d7-a1d3-a7ce0e382787-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 296 Dec 1 17:32 part-00000-860e9aa3-7a9e-4511-a38f-0fc76096bf3e-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 296 Dec 3 15:53 part-00000-c57742fe-94a4-45c7-b5e1-6320e9092be3-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:03 part-00001-f80990f3-0902-4926-8fcd-370364d10ffe-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 1 17:32 part-00002-d47de088-a3aa-4ead-b28f-b2f59e749cc3-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 15:53 part-00002-e7c58054-17b4-4cb8-8796-4c9f2ab31f4a-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 1 17:32 part-00004-2b098d53-196a-40de-9591-98f70edca7f0-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 15:53 part-00004-7a09f511-5473-474b-9612-b804c56917a8-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 15:53 part-00007-78c06975-6ecf-4f3f-a3ee-eb7a5e2aa046-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 1 17:32 part-00007-a5b43491-c803-41fa-9bac-54d9423f3122-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 15:53 part-00009-1cf0e1ef-1f0c-436f-869f-81fa918c08b4-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 1 17:32 part-00009-b5972d12-16ea-4108-8e19-2ae72ee35acf-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 15:53 part-00011-5edd12d4-ac7f-4f40-acf4-5a309c5cdbc8-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 1 17:32 part-00011-b21c674e-d9f1-47f2-b96f-5555622570d6-c000.snappy.parquet ./_delta_log: total 24 -rw-r--r-- 1 gavin wheel 1602 Dec 1 17:32 00000000000000000000.json -rw-r--r-- 1 gavin wheel 2475 Dec 3 15:53 00000000000000000001.json -rw-r--r-- 1 gavin wheel 1075 Dec 3 16:03 00000000000000000002.jsonDelete with condition
from delta import * import pyspark from delta.tables import * from pyspark.sql.functions import * builder = pyspark.sql.SparkSession.builder.appName("MyApp") .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") spark = configure_spark_with_delta_pip(builder).getOrCreate() deltaTable = DeltaTable.forPath(spark, "/tmp/delta-table") # Delete every even value deltaTable.delete(condition = expr("id % 2 == 0")) deltaTable.toDF().show() ''' result: +---+ | id| +---+ | 9| | 5| | 7| +---+ '''Upsert
from delta import * import pyspark from delta.tables import * from pyspark.sql.functions import * builder = pyspark.sql.SparkSession.builder.appName("MyApp") .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") spark = configure_spark_with_delta_pip(builder).getOrCreate() deltaTable = DeltaTable.forPath(spark, "/tmp/delta-table") # Upsert (merge) new data newData = spark.range(0, 20) deltaTable.alias("oldData") .merge( newData.alias("newData"), "oldData.id = newData.id") .whenMatchedUpdate(set = { "id": col("newData.id") }) .whenNotMatchedInsert(values = { "id": col("newData.id") }) .execute() deltaTable.toDF().show() ''' result: +---+ | id| +---+ | 12| | 9| | 5| | 19| | 16| | 10| | 18| | 17| | 3| | 14| | 11| | 8| | 4| | 15| | 7| | 1| | 6| | 13| | 0| | 2| +---+ '''
Files in dir:
gavin@GavindeMacBook-Pro delta-table % ll -R total 288 drwxr-xr-x 7 gavin wheel 224 Dec 3 16:09 _delta_log -rw-r--r-- 1 gavin wheel 296 Dec 3 16:07 part-00000-220c8521-a0b2-4a17-91c8-bd1aa000364a-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:03 part-00000-37c74c73-e771-41d7-a1d3-a7ce0e382787-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 296 Dec 1 17:32 part-00000-860e9aa3-7a9e-4511-a38f-0fc76096bf3e-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 296 Dec 3 15:53 part-00000-c57742fe-94a4-45c7-b5e1-6320e9092be3-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 296 Dec 3 16:09 part-00000-eab34ac9-29df-4863-8021-fe200de0de64-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:03 part-00001-f80990f3-0902-4926-8fcd-370364d10ffe-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 1 17:32 part-00002-d47de088-a3aa-4ead-b28f-b2f59e749cc3-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 15:53 part-00002-e7c58054-17b4-4cb8-8796-4c9f2ab31f4a-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 1 17:32 part-00004-2b098d53-196a-40de-9591-98f70edca7f0-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 15:53 part-00004-7a09f511-5473-474b-9612-b804c56917a8-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00004-d22db67c-fd60-4b3c-b873-54e210693b4c-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00005-cebbe7c3-e06d-4a28-8fea-63fe0f447c9c-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 15:53 part-00007-78c06975-6ecf-4f3f-a3ee-eb7a5e2aa046-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 1 17:32 part-00007-a5b43491-c803-41fa-9bac-54d9423f3122-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 15:53 part-00009-1cf0e1ef-1f0c-436f-869f-81fa918c08b4-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 1 17:32 part-00009-b5972d12-16ea-4108-8e19-2ae72ee35acf-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 15:53 part-00011-5edd12d4-ac7f-4f40-acf4-5a309c5cdbc8-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 1 17:32 part-00011-b21c674e-d9f1-47f2-b96f-5555622570d6-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00011-d2e04069-de33-4f4b-9f5f-b375d0d81469-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00045-531a3391-78ee-42e6-8f05-ddf2ec84bbce-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00049-da28086b-f548-4ee2-897f-a2bf70c42d30-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00058-9527c574-c1ef-4ed3-ae46-b929ef76925a-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00068-03a6df86-a802-45f2-8110-d4f198903a52-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00069-cb5bfa45-68cf-4ea8-a504-fb873131ab19-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00077-82317c4e-d4de-4e0a-bd25-6df7c16ad9de-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00107-08d1051b-63df-4bef-992d-3b55d979ae25-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00112-fdcd61ef-b173-4078-9a67-331709ec2367-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00116-6e0d3ddd-1b77-4124-9a16-7e0a9f097e79-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00121-4a1199a0-c7f8-4b79-9f0b-9d5509d1cd26-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00128-b401d079-7583-4542-8aaa-822e15bd0a64-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00140-9578ffcf-073d-4a3f-b9c2-fd2b28bce50f-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00143-b99b9424-bc51-4069-8619-b258eb452773-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00150-a000aae6-c7b9-4349-98ee-b3094a997b98-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00154-81ed9b75-4401-4849-9ef1-41249c34cd9c-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00164-e68e77a9-cb7c-431b-8196-28e5abbf1526-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00190-222d17a8-92f6-4fde-b0df-c6ee0ce36f84-c000.snappy.parquet ./_delta_log: total 48 -rw-r--r-- 1 gavin wheel 1602 Dec 1 17:32 00000000000000000000.json -rw-r--r-- 1 gavin wheel 2475 Dec 3 15:53 00000000000000000001.json -rw-r--r-- 1 gavin wheel 1075 Dec 3 16:03 00000000000000000002.json -rw-r--r-- 1 gavin wheel 910 Dec 3 16:07 00000000000000000003.json -rw-r--r-- 1 gavin wheel 4747 Dec 3 16:09 00000000000000000004.jsonRead older versions of data using time travel
from delta import * import pyspark from delta.tables import * from pyspark.sql.functions import * builder = pyspark.sql.SparkSession.builder.appName("MyApp") .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") spark = configure_spark_with_delta_pip(builder).getOrCreate() df0 = spark.read.format("delta").option("versionAsOf", 0).load("/tmp/delta-table") df0.show() ''' +---+ | id| +---+ | 1| | 4| | 0| | 2| | 3| +---+ ''' df1 = spark.read.format("delta").option("versionAsOf", 1).load("/tmp/delta-table") df1.show() ''' +---+ | id| +---+ | 9| | 6| | 8| | 5| | 7| +---+ ''' df2 = spark.read.format("delta").option("versionAsOf", 2).load("/tmp/delta-table") df2.show() ''' +---+ | id| +---+ | 9| | 5| |106| | 7| |108| +---+ ''' df3 = spark.read.format("delta").option("versionAsOf", 3).load("/tmp/delta-table") df3.show() ''' +---+ | id| +---+ | 9| | 5| | 7| +---+ ''' df4 = spark.read.format("delta").option("versionAsOf", 4).load("/tmp/delta-table") df4.show() ''' +---+ | id| +---+ | 12| | 9| | 5| | 19| | 16| | 10| | 18| | 17| | 3| | 14| | 11| | 8| | 4| | 15| | 7| | 1| | 6| | 13| | 0| | 2| +---+ '''Vacuum
Current files as bellows:
gavin@GavindeMacBook-Pro delta-table % ll -R total 288 drwxr-xr-x 7 gavin wheel 224 Dec 3 16:09 _delta_log -rw-r--r-- 1 gavin wheel 296 Dec 3 16:07 part-00000-220c8521-a0b2-4a17-91c8-bd1aa000364a-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:03 part-00000-37c74c73-e771-41d7-a1d3-a7ce0e382787-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 296 Dec 1 17:32 part-00000-860e9aa3-7a9e-4511-a38f-0fc76096bf3e-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 296 Dec 3 15:53 part-00000-c57742fe-94a4-45c7-b5e1-6320e9092be3-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 296 Dec 3 16:09 part-00000-eab34ac9-29df-4863-8021-fe200de0de64-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:03 part-00001-f80990f3-0902-4926-8fcd-370364d10ffe-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 1 17:32 part-00002-d47de088-a3aa-4ead-b28f-b2f59e749cc3-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 15:53 part-00002-e7c58054-17b4-4cb8-8796-4c9f2ab31f4a-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 1 17:32 part-00004-2b098d53-196a-40de-9591-98f70edca7f0-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 15:53 part-00004-7a09f511-5473-474b-9612-b804c56917a8-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00004-d22db67c-fd60-4b3c-b873-54e210693b4c-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00005-cebbe7c3-e06d-4a28-8fea-63fe0f447c9c-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 15:53 part-00007-78c06975-6ecf-4f3f-a3ee-eb7a5e2aa046-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 1 17:32 part-00007-a5b43491-c803-41fa-9bac-54d9423f3122-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 15:53 part-00009-1cf0e1ef-1f0c-436f-869f-81fa918c08b4-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 1 17:32 part-00009-b5972d12-16ea-4108-8e19-2ae72ee35acf-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 15:53 part-00011-5edd12d4-ac7f-4f40-acf4-5a309c5cdbc8-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 1 17:32 part-00011-b21c674e-d9f1-47f2-b96f-5555622570d6-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00011-d2e04069-de33-4f4b-9f5f-b375d0d81469-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00045-531a3391-78ee-42e6-8f05-ddf2ec84bbce-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00049-da28086b-f548-4ee2-897f-a2bf70c42d30-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00058-9527c574-c1ef-4ed3-ae46-b929ef76925a-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00068-03a6df86-a802-45f2-8110-d4f198903a52-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00069-cb5bfa45-68cf-4ea8-a504-fb873131ab19-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00077-82317c4e-d4de-4e0a-bd25-6df7c16ad9de-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00107-08d1051b-63df-4bef-992d-3b55d979ae25-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00112-fdcd61ef-b173-4078-9a67-331709ec2367-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00116-6e0d3ddd-1b77-4124-9a16-7e0a9f097e79-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00121-4a1199a0-c7f8-4b79-9f0b-9d5509d1cd26-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00128-b401d079-7583-4542-8aaa-822e15bd0a64-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00140-9578ffcf-073d-4a3f-b9c2-fd2b28bce50f-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00143-b99b9424-bc51-4069-8619-b258eb452773-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00150-a000aae6-c7b9-4349-98ee-b3094a997b98-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00154-81ed9b75-4401-4849-9ef1-41249c34cd9c-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00164-e68e77a9-cb7c-431b-8196-28e5abbf1526-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00190-222d17a8-92f6-4fde-b0df-c6ee0ce36f84-c000.snappy.parquet ./_delta_log: total 48 -rw-r--r-- 1 gavin wheel 1602 Dec 1 17:32 00000000000000000000.json -rw-r--r-- 1 gavin wheel 2475 Dec 3 15:53 00000000000000000001.json -rw-r--r-- 1 gavin wheel 1075 Dec 3 16:03 00000000000000000002.json -rw-r--r-- 1 gavin wheel 910 Dec 3 16:07 00000000000000000003.json -rw-r--r-- 1 gavin wheel 4747 Dec 3 16:09 00000000000000000004.json
To drop files before 1 hour ago, code:
from delta import * import pyspark from delta.tables import * from pyspark.sql.functions import * builder = pyspark.sql.SparkSession.builder.appName("MyApp") .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") .config("spark.databricks.delta.retentionDurationCheck.enabled","false") spark = configure_spark_with_delta_pip(builder).getOrCreate() deltaTable = DeltaTable.forPath(spark,"/tmp/delta-table") # vacuum files not required by versions more than 1 hours old deltaTable.vacuum(1)
Result:
gavin@GavindeMacBook-Pro delta-table % gavin@GavindeMacBook-Pro delta-table % date Fri Dec 3 16:53:20 CST 2021 gavin@GavindeMacBook-Pro delta-table % gavin@GavindeMacBook-Pro delta-table % ll -R total 240 drwxr-xr-x 7 gavin wheel 224 Dec 3 16:09 _delta_log -rw-r--r-- 1 gavin wheel 296 Dec 3 16:07 part-00000-220c8521-a0b2-4a17-91c8-bd1aa000364a-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:03 part-00000-37c74c73-e771-41d7-a1d3-a7ce0e382787-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 296 Dec 3 15:53 part-00000-c57742fe-94a4-45c7-b5e1-6320e9092be3-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 296 Dec 3 16:09 part-00000-eab34ac9-29df-4863-8021-fe200de0de64-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:03 part-00001-f80990f3-0902-4926-8fcd-370364d10ffe-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 15:53 part-00002-e7c58054-17b4-4cb8-8796-4c9f2ab31f4a-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 15:53 part-00004-7a09f511-5473-474b-9612-b804c56917a8-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00004-d22db67c-fd60-4b3c-b873-54e210693b4c-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00005-cebbe7c3-e06d-4a28-8fea-63fe0f447c9c-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 15:53 part-00007-78c06975-6ecf-4f3f-a3ee-eb7a5e2aa046-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 15:53 part-00009-1cf0e1ef-1f0c-436f-869f-81fa918c08b4-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 15:53 part-00011-5edd12d4-ac7f-4f40-acf4-5a309c5cdbc8-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00011-d2e04069-de33-4f4b-9f5f-b375d0d81469-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00045-531a3391-78ee-42e6-8f05-ddf2ec84bbce-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00049-da28086b-f548-4ee2-897f-a2bf70c42d30-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00058-9527c574-c1ef-4ed3-ae46-b929ef76925a-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00068-03a6df86-a802-45f2-8110-d4f198903a52-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00069-cb5bfa45-68cf-4ea8-a504-fb873131ab19-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00077-82317c4e-d4de-4e0a-bd25-6df7c16ad9de-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00107-08d1051b-63df-4bef-992d-3b55d979ae25-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00112-fdcd61ef-b173-4078-9a67-331709ec2367-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00116-6e0d3ddd-1b77-4124-9a16-7e0a9f097e79-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00121-4a1199a0-c7f8-4b79-9f0b-9d5509d1cd26-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00128-b401d079-7583-4542-8aaa-822e15bd0a64-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00140-9578ffcf-073d-4a3f-b9c2-fd2b28bce50f-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00143-b99b9424-bc51-4069-8619-b258eb452773-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00150-a000aae6-c7b9-4349-98ee-b3094a997b98-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00154-81ed9b75-4401-4849-9ef1-41249c34cd9c-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00164-e68e77a9-cb7c-431b-8196-28e5abbf1526-c000.snappy.parquet -rw-r--r-- 1 gavin wheel 463 Dec 3 16:09 part-00190-222d17a8-92f6-4fde-b0df-c6ee0ce36f84-c000.snappy.parquet ./_delta_log: total 48 -rw-r--r-- 1 gavin wheel 1602 Dec 1 17:32 00000000000000000000.json -rw-r--r-- 1 gavin wheel 2475 Dec 3 15:53 00000000000000000001.json -rw-r--r-- 1 gavin wheel 1075 Dec 3 16:03 00000000000000000002.json -rw-r--r-- 1 gavin wheel 910 Dec 3 16:07 00000000000000000003.json -rw-r--r-- 1 gavin wheel 4747 Dec 3 16:09 00000000000000000004.json gavin@GavindeMacBook-Pro delta-table %
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)