Escaping double quotes in spark dataframe - Cloudera Community

link管理

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

相关文章推荐

深沉的书包 · Typescript扩展对象，实现具有动态属 ...· 5 月前 ·

爽快的春卷 · 广州市科学技术局网站· 7 月前 ·

一身肌肉的咖啡 · 全球体育营销大事件TOP10：麦当劳终止40 ...· 8 月前 ·

才高八斗的地瓜 · 通过正则化避免过拟合 - ...· 8 月前 ·

健壮的充值卡 · 年内开建！成都地铁10号线三期、17号线二期 ...· 1 年前 ·

I am reading a csv file into a spark dataframe. i have the double quotes ("") in some of the fields and i want to escape it. can anyone let me know how can i do this?. since double quotes is used in the parameter list for options method, i dont know how to escape double quotes in the data

val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("delimiter", "|"). option("escape", -----

12|34|"56|78"|9A "AB"|"CD"|EF|"GH:"|:"IJ"

If I load it with Spark I get

val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true")
                   .option("delimiter", "|").option("escape", ":").load("/tmp/test.csv")
df.show()
+----+----+-----+-------+
|Col1|Col2| Col3|   Col4|
+----+----+-----+-------+
|  12|  34|56|78|     9A|
|  AB|  CD|   EF|GH"|"IJ|
+----+----+-----+-------+

So the example contains delimiter in quotes and escaped quotes. I use ":" to escape quotes, you can many other characters (don't use e.g. "#")

Is this something you want to achieve?

There is an issue with the space in front of "EF":

Let's use (you don't need the "escape" option, it can be used to e.g. get quotes into the dataframe if needed)

val df = sqlContext.read.format("com.databricks.spark.csv")
          .option("header", "true")
          .option("delimiter", "|")
          .load("/tmp/test.csv")
df.show()

With space in front of "EF"

+----+----+----+-----+
|Col1|Col2|Col3| Col4|
+----+----+----+-----+
|  AB|  CD|  DE| "EF"|
+----+----+----+-----+

Without space in front of "EF":

+----+----+----+----+
|Col1|Col2|Col3|Col4|
+----+----+----+----+
|  AB|  CD|  DE|  EF|
+----+----+----+----+

Can you remove the space before loading the csv into Spark?

Terms & Conditions

Unsubscribe / Do Not Sell My Personal Information

Supported Browsers Policy

Apache Hadoop and associated open source project names are trademarks of the Apache Software Foundation. For a complete list of trademarks, click here.

推荐文章

深沉的书包 · Typescript扩展对象，实现具有动态属性的接口_用Typescript实现对象方法的动态属性访问_具有动态属性名称的Typescript - 腾讯云开发者社区 - 腾讯云

5 月前

爽快的春卷 · 广州市科学技术局网站

7 月前

一身肌肉的咖啡 · 全球体育营销大事件TOP10：麦当劳终止40年奥运赞助史 | 界面 · 财经号

8 月前

才高八斗的地瓜 · 通过正则化避免过拟合 - 里列昂遗失的记事本 - 博客园

8 月前

健壮的充值卡 · 年内开建！成都地铁10号线三期、17号线二期、18号线三期最新消息来了_四川在线

1 年前