Knowledge Center Monthly Newsletter - August 2025

Stay up to date with the latest from the Knowledge Center. See all new Knowledge Center articles published in the last month, and re:Post’s top contributors.

AWS Glue Dynamic Dataframe relationalize

【以下的问题经过翻译处理】我加载了json数据，并在动态数据帧上使用了relationalize方法，以扁平化原本嵌套的json对象，并将其保存为parquet格式。问题是，一旦保存为parquet格式，用于更快的Athena查询的列名包含点，这违反了Athena SQL查询语法，因此我无法进行列特定的查询。

为了解决这个问题，我还在Glue作业中重命名了列名，以排除点，而是用下划线代替。我的问题是，两种方法中哪种方法更好，为什么？（效率-内存？节点执行速度？等等）。

此外，由于令人难以置信的AWS Glue文档，我无法找到动态框架的唯一解决方案。我无法以动态方式获取列名，因此我正在利用toDF()。

第一种方法涉及从从动态数据帧中提取的df中获取列名

relationalize1 = Relationalize.apply(frame=datasource0, transformation_ctx="relationalize1").select("roottable")
df_relationalize1 = relationalize1.toDF()
for field in df_relationalize1.schema.fields:
    relationalize1 = RenameField.apply(frame = relationalize1, old_name = "`"+field.name+"`", new_name = field.name.replace(".","_"), transformation_ctx = "renamefield_" + field.name)
第二种方法是从动态数据帧中提取df，并在pyspark df上执行重命名字段（而不是动态数据帧），然后将其转换回动态数据帧并以parquet格式保存。
是否有更好的方法？网络爬虫能否重命名列？.fromDF()方法有多快？有没有比PDF开发人员指南更好的函数和方法文档？

Topics: Analytics
Tags: AWS Glue
Language: 中文 (简体)


               
                
                 EXPERT
                
               
               
                
                 
                  rePost Polyglot
                 
                
               
               
                
                 asked
                 
                 
                 2 years ago
                
                
                 37 views


              1 Answer

Newest
Most votes
Most comments


                  
                   
                    
                   
                   Are these answers helpful? Upvote the correct answer to help the community benefit from your knowledge.

【以下的回答经过翻译处理】在包含点的列名的parquet表中，可以运行特定列的查询。


                    
                     表：parquet_table
                    
                    |-- name: string
|-- url: string
|-- sample.key: string