添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接

Contact Us

If you still have questions or prefer to get help directly from an agent, please submit a request. We’ll get back to you as soon as possible.

Please enter the details of your request. A member of our support staff will respond as soon as possible.

Problem

When you attempt to rerun an Apache Spark write operation by cancelling the currently running job, the following error occurs:

Error: org.apache.spark.sql.AnalysisException: Cannot create the managed table('`testdb`.` testtable`').
The associated location ('dbfs:/user/hive/warehouse/testdb.db/metastore_cache_ testtable) already exists.;

Cause

This problem is due to a change in the default behavior of Spark in version 2.4.

This problem can occur if:

  • The cluster is terminated while a write operation is in progress.
  • A temporary network issue occurs.
  • The job is interrupted.

Once the metastore data for a particular table is corrupted, it is hard to recover except by dropping the files in that location manually. Basically, the problem is that a metadata directory called _STARTED isn’t deleted automatically when Databricks tries to overwrite it.

You can reproduce the problem by following these steps:

  1. Create a DataFrame:
    val df = spark.range(1000)
  2. Write the DataFrame to a location in overwrite mode:
    df.write.mode(SaveMode.Overwrite).saveAsTable("testdb.testtable")
  3. Cancel the command while it is executing.
  4. Re-run the write command.

Solution

Set the flag spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation to true . This flag deletes the _STARTED directory and returns the process to the original state. For example, you can set it in the notebook:

%python
spark.conf.set("spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation","true")

Or you can set it in the cluster level Spark config ( AWS | Azure | GCP ):

spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation true

Another option is to manually clean up the data directory specified in the error message. You can do this with dbutils.fs.rm .

%scala
dbutils.fs.rm("<path-to-directory>", true)