link管理

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

3.6. S3 选择操作（技术预览）

作为开发人员，您可以使用 S3 为高分析应用程序（如 Spark-SQL）选择 API 来提高延迟和吞吐量。例如，带有几 GB 数据的 CSV S3 对象，用户可以使用以下查询提取由另一个列过滤的单列：

select customerid from s3Object where age>30 and age<65;

目前，S3 对象必须在过滤和提取数据前通过 Ceph OSD 检索数据。当对象较大且查询更为具体时，性能会提高性能。

3.6.1. 先决条件

一个正在运行的 Red Hat Ceph Storage 集群。 RESTful 客户端。创建用户访问权限创建的 S3 用户。

3.6.2. S3 从对象中选择内容

选择的对象内容 API 通过结构化查询语言(SQL)过滤对象的内容。在请求中，您必须将数据序列化格式指定为对象的逗号分隔值(CSV)来检索指定的内容。Amazon Web Services(AWS)命令行界面(CLI)选择对象内容使用 CSV 格式将对象数据解析为记录，仅返回查询中指定的记录。您必须为响应指定数据序列化格式。这个操作必须具有 s3:GetObject 权限。

POST /BUCKET/KEY?select&select-type=2 HTTP/1.1\r\n

POST /testbucket/sample1csv?select&select-type=2 HTTP/1.1\r\n

请求实体

Bucket

描述: 要从中选择对象内容的存储桶。对象密钥。 1 的最小长度.

SelectObjectContentRequest

描述: 选择对象内容请求参数的根级别标签。

表达式

描述: 用于查询对象的表达式。

ExpressionType

描述: 示例 SQL 提供的表达式的类型。

InputSerialization

描述: 描述正在查询的对象中的数据格式。

OutputSerialization

描述: 以逗号分隔符和换行符中返回的数据格式。如果操作成功，服务会返回 HTTP 200 响应。服务以 XML 格式返回数据：

payload

描述: 有效负载参数的根级别标签。

Records

描述: 记录事件。 Base64 编码的二进制数据对象

Stats

描述

stats 事件。 Ceph 对象网关支持以下响应：

{:event-type,records} {:content-type,application/octet-stream} :message-type,event}

aws --endpoint-url http://localhost:80 s3api select-object-content
 --bucket BUCKET_NAME
 --expression-type 'SQL'
 --input-serialization
 '{"CSV": {"FieldDelimiter": "," , "QuoteCharacter": "\"" , "RecordDelimiter" : "\n" , "QuoteEscapeCharacter" : "\\" , "FileHeaderInfo": "USE" }, "CompressionType": "NONE"}'
 --output-serialization '{"CSV": {}}'
 --key OBJECT_NAME
 --expression "select count(0) from stdin where int(_1)<10;" output.csv

aws --endpoint-url http://localhost:80 s3api select-object-content
 --bucket testbucket
 --expression-type 'SQL'
 --input-serialization
 '{"CSV": {"FieldDelimiter": "," , "QuoteCharacter": "\"" , "RecordDelimiter" : "\n" , "QuoteEscapeCharacter" : "\\" , "FileHeaderInfo": "USE" }, "CompressionType": "NONE"}'
 --output-serialization '{"CSV": {}}'
 --key testobject
 --expression "select count(0) from stdin where int(_1)<10;" output.csv

支持的功能

目前只支持 AWS s3 选择命令的一部分：

功能详情描述示例

算术运算符 ^ * % / + - ( ) select (int(_1)+int(_2))*int(_9) from stdin; 算术运算符 % modulo select count(*) from stdin where cast(_1 as int)%2 == 0; 算术运算符 ^ power-of select cast(2^10 as int) from stdin; 比较运算符 > < >= ⇐ == != select _1,_2 from stdin where (int(_1)+int(_3))>int(_5); 逻辑运算符 AND 或 NOT 从 stdin (int (1)>123 和 int (_5)<200)）中选择 countprincipal; 逻辑运算符 is null 为表达式中的 null 返回 true/false 逻辑运算符和 NULL is not null 为表达式中的 null 返回 true/false 逻辑运算符和 NULL 查看 null-handle，并观察使用 NULL 的逻辑操作的结果。查询返回 0 。 从 null 和(3>2)的 stdin 中选择 countMTU; 带有 NULL 的算术运算符查看 null-handle，并观察使用 NULL 的二进制操作的结果。查询返回 0 。 从 stdin 中选择 countprincipal where (null+1) and (3>2); 与 NULL 进行比较回顾空客户端并观察与 NULL 比较操作的结果。查询返回 0 。 select count(*) from stdin where (null*1.5) != 3; 从 _1 为 null 的 stdin 中选择 count (*)； projection 列与 if or then or else 类似 when (1+1==(2+1)*3) then ‘case_1’ when 4*3）==（12 then 'case_2' else 'case_else' end, age*2 from stdin; 逻辑运算符 coalesce 返回第一个非null 参数 select coalesce(nullif(5,5),nullif(1,1.0),age+12) from stdin; 逻辑运算符 nullif 返回 null，如果两个参数都相等，或者第一个参数是 nullif(1,1)=NULL nullif(null,1)=NULL nullif(2,1)=2 select nullif(cast(_1 as int),cast(_2 as int)) from stdin; 逻辑运算符 {expression} in ( .. {expression} ..) select count (*) from stdin where 'ben' in (trim (_5),substring （_1,char_length (_1)-3,3),last_name）; 逻辑运算符 {expression} 和 {expression} 之间的 {expression} 从 stdin 选择 count(*)，其中 substring（_3,char_length(_3),1）"x" 和 trim(_1)和 substring（_3,char_length(_3)-1,1 == ":"; 逻辑运算符 {expression} like {match-pattern} select count ( ) from stdin where first_name like '%de_'; select count ( ) from stdin where _1 like "%a[r-s]; 广播操作员 从 stdin 中选择 cast (123 as int)%2 ； 广播操作员 从 stdin 中选择 cast (123.456 as float)%2 ； 广播操作员 select cast ('ABC0-9' as string),cast (substr ('ab12cd',3,2) as int)*4 from stdin; 广播操作员 select cast (substring ('publish on 2007-01-01',12,10) as timestamp) from stdin; 非 AWS casting operator select int(_1),int( 1.2 + 3.4) from stdin; 非 AWS casting operator select float(1.2) from stdin; 非 AWS casting operator 从 stdin 选择 timestamp ('1999:10:10-12:23:44'); select sum(int(_1)) from stdin; select avg (cast (_1 a float)+ cast (_2 as int)) from stdin; select avg (cast (_1 a float)+ cast (_2 as int)) from stdin; select max(float(_1)),min(int(_5)) from stdin; select countprincipal from stdin where (int (1)+int (_3))>int (_5); 时间戳功能 extract select countprincipal from stdin where extract ('year',timestamp (_2))> 1950 and extract ('year',timestamp (_1))< 1960; 时间戳功能 dateadd select count (0) from stdin where datediff ('year',timestamp (_1),dateadd （'day',366,timestamp (_1))）== 1; 时间戳功能 datediff select count(0) from stdin where datediff(‘month’,timestamp(_1),timestamp(_2))) == 2; 时间戳功能 utcnow select count(0) from stdin where datediff(‘hours’,utcnow(),dateadd(‘day’,1,utcnow())) == 24 字符串函数 select count (0) from stdin where int (substring (_1,1,4))>1950 and int (substring (_1,1,4))<1960; 字符串函数 select trim(‘ foobar ‘) from stdin; 字符串函数 select trim (trailing from ' foobar ') from stdin; 字符串函数 select trim (leading from ' foobar ') from stdin; 字符串函数 select trim (both '12' from '1112211foobar22211122') from stdin; 字符串函数 select trim (both '12' from '1112211foobar22211122') from stdin; 字符串函数 char_length, character_length select count(*) from stdin where char_length(_3)==3; 复杂的查询 select sum (cast (_1 as int)),max (cast (_3 as int)), substring ('abcdefghijklm',(2-1)*3+sum （cast (_1 as int))/sum (cast (_1 as int))+1,(count （)+ count (0)）/count (0)）from stdin; select int (_1) as a1, int (_2) as a2 ,(a1+a2) as a3 from stdin where a3>100 and a3<300; 如需了解更多详细信息，请参阅 Amazon 的 S3 Select Object Content API 。

3.6.3. S3 支持的选择功能

S3 选择支持以下功能：.Timestamp

timestamp(string)

描述: 将字符串转换为基本时间戳类型。目前它转换：yyyyyy:mm:dd hh:mi:dd

extract(date-part,timestamp)

描述: 根据 date-part 从输入时间戳中提取的整数。 date-part: year,month,week,day。

dateadd(date-part ,integer,timestamp)

描述: 返回时间戳，根据输入时间戳和日期部分的结果计算。日期-部分：年、月、天.

datediff(date-part,timestamp,timestamp)

描述: 返回整数，根据 date-part 在两个时间戳之间计算的结果。日期-部分：年、月、天、小时.

utcnow()

描述: 返回当前时间的时间戳。

聚合

count()

描述: 根据与某个条件匹配的行数返回整数（如果只有一个条件）。

sum(expression)

描述: 如果出现某个条件，则每行上返回表达式摘要。

avg(expression)

描述: 如果每个行中返回一个平均表达式，与条件匹配一次。

max(expression)

描述: 如果出现某个条件，则返回与条件匹配的所有表达式的最大结果。

min(expression)

描述: 返回匹配某个条件的所有表达式的最小结果。

字符串

substring(string,from,to)

描述: 将字符串从输入字符串从输入字符串返回，到输入中。

Char_length

描述: 返回字符串中的多个字符。Character_length 也实现相同的目的。

Trim

描述: 从目标字符串中修剪前导或尾随字符，默认为空白字符。

Upper\lower

描述: 将字符转换为大写或小写。 NULL 值缺失或未知，即 NULL 无法为任何算术操作生成一个值。同样适用于算术比较，对 NULL 的任何比较都是未知的 NULL 。

A is NULL 结果(NULL=UNKNOWN)

Not A A 或 False A or True A 或 A A 和 False False A 和 True A 和 A 如需了解更多详细信息，请参阅 Amazon 的 S3 Select Object Content API 。

3.6.4. S3 别名编程结构

别名编程结构是 s3 选择语言的重要组成部分，因为它可以为包含许多列或复杂查询的对象启用更好的编程。当解析别名结构的声明时，它会将别名替换为对右分列的引用，并在查询执行时评估引用与其他表达式一样。别名维护结果缓存，如果某个别名使用多次，则不会评估相同的表达式，并且返回相同的结果，因为使用了缓存的结果。目前，红帽支持列别名。

select int(_1) as a1, int(_2) as a2 , (a1+a2) as a3 from s3object where a3>100 and a3<300;")

3.6.5. S3 CSV 解析解释

您可以使用输入序列化定义 CSV 定义，使用默认值：将 {\n}' 用于 row-delimiter。使用 {"} 括起内容。使用 {\} 转义字符。 csv-header-info 被解析，这是包含该 schema 的输入对象中的第一行。目前，不支持输出序列化和压缩类型。S3 选择引擎具有 CSV 解析器，它解析 S3-objects：每行以 row-delimiter 结束。 field-separator 会分离相邻的列。 successive 字段分隔符定义 NULL 列。 quote-character 覆盖了感叹号，即引号之间的任何字符。转义字符将禁用除行分隔符外的任何特殊字符。以下是 CSV 解析规则的示例：

表 3.5. CSV 解析
功能	描述	输入（令牌）
成功字段分隔符 `,,1,,2, =⇒ {null}{null}{1}{null}{2}{null}` `QUOTE` quote 字符覆盖字段分隔符。 `11,22,”a,b,c,d”,last =⇒ {11}{22}{“a,b,c,d”}{last}` `Escape` 转义字符覆盖元字符。对象所有者 `ID` 和 `DisplayName` 的容器没有封闭的引号，行分隔符是右线。 `11,22,a=”str,44,55,66 =⇒ {11}{22}{a=”str,44,55,66}` `CSV 标头信息` FileHeaderInfo 标签