You signed in with another tab or window.
Reload
to refresh your session.
You signed out in another tab or window.
Reload
to refresh your session.
You switched accounts on another tab or window.
Reload
to refresh your session.
I've upgraded python
delta-spark
to 1.2.0 (and
pyspark
to 3.3.0). When I run
pyspark
, I see the below error (in observed results).
I believe as part of
#954
, the
LogStore
class was
changed
to now use
io.delta.storage.S3SingleDriverLogStore
.
The problem appears to be that the
LogStore
class isn't present in the
python artefact
leading to the below error. Is this an oversight in the packaging, or should I be specifying an extra dependency to bring in this class?
I also see errors about the
S3SingleDriverLogStore
, which I assume are consequences of this.
Steps to reproduce
Upgrade to
delta-spark
1.2.0
, and run via python (with the default logger).
Observed results
py4j.protocol.Py4JJavaError: An error occurred while calling o2176.execute.
5679E : com.google.common.util.concurrent.ExecutionError: java.lang.NoClassDefFoundError: io/delta/storage/LogStore
Expected results
A successful run, using the LogStore class to log to S3.
Further details
Environment information
Delta Lake version: 1.2.0
Spark version: 3.2.1
Scala version: n/a
Willingness to contribute
The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?
Yes. I can contribute a fix for this bug independently.
Yes. I would be willing to contribute a fix for this bug with guidance from the Delta Lake community.
No. I cannot contribute a bug fix at this time.
Thanks @zsxwing - bringing down delta-storage from maven does the trick.
For future reference, is this the expected approach from a python standpoint, or do we expect the jar to be available as part of pip or similar?
Thanks!
[Feature Request] Include a better error message for NoClassDefFoundError: io/delta/storage/LogStore#1199
Quick question - are you purposefully trying to use the old S3SingleDriverLogStore LogStore implementation? We have reimplemented it in our new delta-storage artifact, with class name io.delta.storage.S3SingleDriverLogStorehere.
And by the way, for paths with S3 scheme, we will automatically use this io.delta.storage.S3SingleDriverLogStore implementation. There's no need to specify spark.delta.logStore.class.
Great, thanks @scottsand-db. I'm happy to move onto the new LogStore implementation (i.e. drop S3SingleDriverLogStore), and to confirm - zsxwing's solution fixes it for me. And thanks for the tip about the path too - that's useful.
No more questions from me - thanks again both
hi @bobrippling@scottsand-db I am using spark 3.2.0 in EMR, I still have this issue after specifying the packages.
Do you have any idea on it? Thanks
The error is: >>> spark.range(1,5).write.format("delta").save("s3://eth-etl-delta/test/delta-table") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 740, in save self._jwrite.save(path) File "/usr/lib/spark/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py", line 1310, in __call__ File "/usr/lib/spark/python/pyspark/sql/utils.py", line 111, in deco return f(*a, **kw) File "/usr/lib/spark/python/lib/py4j-0.10.9.2-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o78.save. : com.google.common.util.concurrent.ExecutionError: java.lang.NoClassDefFoundError: io/delta/storage/S3SingleDriverLogStore at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2261) at com.google.common.cache.LocalCache.get(LocalCache.java:4000) at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4789) at org.apache.spark.sql.delta.DeltaLog$.getDeltaLogFromCache$1(DeltaLog.scala:577) at org.apache.spark.sql.delta.DeltaLog$.apply(DeltaLog.scala:584) at org.apache.spark.sql.delta.DeltaLog$.forTable(DeltaLog.scala:472) at org.apache.spark.sql.delta.sources.DeltaDataSource.createRelation(DeltaDataSource.scala:149)
22/08/29 14:13:38 INFO SparkContext: Added JAR file:/tmp/spark-5d4cffaa-b1e1-4eeb-9f44-fe3dd4cf3340/delta-core_2.12-1.2.1.jar at spark://py-stats-0dd6da82e9f3422a-driver-svc.streaming.svc:7078/jars/delta-core_2.12-1.2.1.jar with timestamp 1661782417538
22/08/29 14:13:38 INFO SparkContext: Added JAR file:/tmp/spark-5d4cffaa-b1e1-4eeb-9f44-fe3dd4cf3340/delta-storage-1.2.1.jar at spark://py-stats-0dd6da82e9f3422a-driver-svc.streaming.svc:7078/jars/delta-storage-1.2.1.jar with timestamp 1661782417538
22/08/29 14:13:38 INFO SparkContext: Added JAR file:/tmp/spark-5d4cffaa-b1e1-4eeb-9f44-fe3dd4cf3340/delta-contribs_2.12-1.2.1.jar at spark://py-stats-0dd6da82e9f3422a-driver-svc.streaming.svc:7078/jars/delta-contribs_2.12-1.2.1.jar with timestamp 1661782417538
@0xdarkman could you check your Spark's jars directory and see if there are any delta jars there?
I am not able to write anything. I see _delta_log I see the data but it seems I have some problem with writting. Would you mind to look into the question I have just opened in SO?
I'm using Delta 2.3. So downloaded delta-storage jar and also passing this as dependency during spark session, now getting different error during write emp_details.write.format("delta").mode("overwrite").save(delta_path)
23/08/08 16:33:43 WARN TaskSetManager: Lost task 3.0 in stage 1.0 (TID 4) (cluster-b6b8-w-1.c.test-382806.internal executor 2): java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.sql.catalyst.expressions.ScalaUDF.f of type scala.Function1 in instance of org.apache.spark.sql.catalyst.expressions.ScalaUDF
at java.base/java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2076)
at java.base/java.io.ObjectStreamClass$FieldReflector.checkObjectFieldValueTypes(ObjectStreamClass.java:2039)
at java.base/java.io.ObjectStreamClass.checkObjFieldValueTypes(ObjectStreamClass.java:1293)
at java.base/java.io.ObjectInputStream.defaultCheckFieldValues(ObjectInputStream.java:2512)
at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2419)
PySpark version (pyspark.version) : 3.4.1 and spark version (spark.version): 3.3.0 is from Jupyter notebook attached to Dataproc cluster. On Dataproc shell, below is the version.
tried with delta 2.4. Also tried with delta 2.3 directly on spark shell instead of Jupyter. In both cases running into same error "cannot assign instance of java.lang.invoke.SerializedLambda..."
Earlier I have executed pip install delta-spark==2.3.0, looking at the logs it brought pyspark 3.4, hence uninstalled it. Have below configuration now. But still running into same error "java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to..."