Unable to interface with data written from Spark Databricks · Issue #1651 · delta-io/delta-rs

link管理

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

相关文章推荐

含蓄的木耳 · 21世纪最强的春秋笔法：“海边的西塞罗”被封 ...· 2 月前 ·

害羞的鸡蛋面 · 「過期食品」就不能吃嗎？ ...· 3 月前 ·

善良的钥匙扣 · Common Errors - AWS ...· 4 月前 ·

气宇轩昂的萝卜 · 金华市人民政府办公室关于印发《金华市综合交通 ...· 4 月前 ·

鼻子大的柑橘 · Let’s join the ...· 7 月前 ·

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement . We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What happened :
When attempting to interface with Databricks we're getting inconsistent results for the encoding of the partition resulting in inability to interface across clients.

After the fix from #1613 we're getting closer but still not consistent with Databricks.

Python

When writing from python the partitions are formatted as:
partition_date=2023-09-15%2000%3A00%3A00.000000

When writing from rust we see it as:
partition_date=2023-09-15 00:00:00

Databricks

When writing from Spark Databricks to Delta Lake we see partial encoding:
partition_date=2023-09-15 00%3A00$3A00

What you expected to happen :

Would expect to have consistent encoding across platforms.

How to reproduce it :

Write to Azure using Databricks, see partition layout.

Sample Python run locally:

import datetime
from deltalake import write_deltalake
import pyarrow as pa
data = pa.table({"data": pa.array(["mydata"]),
                 "inserted_at": pa.array([datetime.datetime.now()]),
                 "partition_column": pa.array([datetime.datetime(2023, 9, 15)])})
write_deltalake(table_or_uri="./unqueryable_table2", \
  mode="append", \
  data=data, \
  partition_by=["partition_column"]
When attempting to interface with Databricks we're getting inconsistent results for the encoding of the partition resulting in inability to interface across clients.
When you say encoding of partition, are you referring to the one in the log? Or just the file paths?
FWIW, the file paths shouldn't be consequential as long as they can be read and recognized. The partition values are taken from the log, not the directory structure.
          I still seem to be running into issues reading from delta tables partitioned by datetime
# write_table.py
import datetime
from deltalake import write_deltalake
import pyarrow as pa
data = pa.table({"id": pa.array([425], type=pa.int32()),
                 "data": pa.array(["python-module-test-write"]),
                 "t": pa.array([datetime.datetime(2023, 9, 15)])})
write_deltalake(table_or_uri="./dt", \
  mode="append", \
  data=data, \
  partition_by=["t"]
# read_table.py
from deltalake import DeltaTable
dt = DeltaTable(table_uri="./dt")
dataset = dt.to_pyarrow_dataset()
print(dataset.count_rows())
> python read_table.py
Traceback (most recent call last):
  File "/Users/crathbone/offline-spark/simple/read_table.py", line 4, in <module>
    dataset = dt.to_pyarrow_dataset()
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/deltalake/table.py", line 540, in to_pyarrow_dataset
    for file, part_expression in self._table.dataset_partitions(
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/scalar.pxi", line 88, in pyarrow.lib.Scalar.cast
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: error parsing '2023-09-15%2000%3A00%3A00.000000' as scalar of type timestamp[us]