openLooKeng documentation

<property>
     <!-- https://community.hortonworks.com/content/supportkb/247055/errorjavalangunsupportedoperationexception-storage.html -->
     <name>metastore.storage.schema.reader.impl</name>
     <value>org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader</value>
 </property>

connector.name=hive-hadoop2
hive.metastore.uri=thrift://example.net:9083

hive.config.resources=/etc/hadoop/conf/core-site.xml,/etc/hadoop/conf/hdfs-site.xml

-DHADOOP_USER_NAME=hdfs_user

属性名称	说明	默认值
hive.metastore	Hive元存储类型	`thrift`
hive.config.resources	以逗号分隔的可选HDFS配置文件列表。这些文件必须存在于运行openLooKeng的机器上。该属性仅在访问HDFS绝对必要的情况下指定。示例： `/etc/hdfs-site.xml`
hive.recursive-directories	允许从表或分区位置的子目录读取数据。如果禁用，子目录将被忽略。这相当于Hive中的 `hive.mapred.supports.subdirectories` 属性。
`hive.storage-format`	创建新表时使用的默认文件格式。	`ORC`
`hive.compression-codec`	写入文件时使用的压缩编解码器。	`GZIP`
`hive.force-local-scheduling`	强制将分片调度到与分片数据的Hadoop DataNode进程同一节点。这对于openLooKeng与每个DataNode并置的安装非常有用。	`false`
`hive.respect-table-format`	应该使用现有的表格式还是默认的openLooKeng格式写入新的分区？	`true`
`hive.immutable-partitions`	新数据是否可以插入到现有分区中？	`false`
`hive.create-empty-bucket-files`	对于没有数据的桶，是否应该创建空文件？	`false`
`hive.max-partitions-per-writers`	每个写入进程最大分区数。	100
`hive.max-partitions-per-scan`	单表扫描最大分区数。	100000
`hive.hdfs.authentication.type`	HDFS身份验证类型。取值为 `NONE` 或 `KERBEROS` 。	`NONE`
`hive.hdfs.impersonation.enabled`	启用HDFS端用户模拟。	`false`
`hive.hdfs.presto.principal`	openLooKeng在连接到HDFS时将使用的Kerberos主体。
`hive.hdfs.presto.keytab`	HDFS客户端keytab位置。
`hive.security`	参见 Hive安全配置。
`security.config-file`	`hive.security=file` 时使用的配置文件的路径。有关详细信息，请参阅基于文件的授权。
`hive.non-managed-table-writes-enabled`	允许对非托管（外部）Hive表的写入。	`false`
`hive.non-managed-table-creates-enabled`	允许创建非托管（外部）Hive表。	`true`
`hive.collect-column-statistics-on-write`	启用写入时自动收集列级统计信息。详见表统计信息。	`true`
`hive.s3select-pushdown.enabled`	允许向AWS S3 Select服务的查询下推。	`false`
`hive.s3select-pushdown.max-connections`	S3 Select下推同时打开到S3的最大连接数。	500
`hive.orc.use-column-names`	为了支持alter表drop列，建议在Hive属性中添加 `hive.orc.use-column-names=true` ，否则drop列可能无法正常工作。	false
`hive.orc-predicate-pushdown-enabled`	在读取ORC文件时启用算子下推（predicates pushdown）处理。	`false`
`hive.orc.time-zone`	为未声明时区的旧ORC文件设置默认时区。	JVM默认值
`hive.parquet.time-zone`	将时间戳值调整到特定的时区。对于Hive 3.1+，该值应设置为UTC。	JVM默认值
`hive.rcfile.time-zone`	将二进制编码的时间戳值调整到特定的时区。对于Hive 3.1+，该值应设置为UTC。	JVM默认值
`hive.vacuum-service-threads`	清空服务中运行的线程数。	2
`hive.auto-vacuum-enabled`	对Hive表启用自动清空功能。要在引擎侧启用自动清空，请在协调节点的config.properties中添加 `auto-vacuum.enabled=true` 。	`false`
`hive.vacuum-delta-num-threshold`	允许不压缩的增量目录的最大数量。最小值为2。	10
`hive.vacuum-delta-percent-threshold`	允许不压缩的增量目录的最大百分比。值应在0.1到1.0之间。	0.1
`hive.vacuum-cleanup-recheck-interval`	清空清理任务重新提交的间隔。最小值为5分钟	`5 Minutes`
`hive.vacuum-collector-interval`	清空回收器任务重新提交的间隔。	`5 Minutes`
`hive.max-splits-to-group`	可分组的最大拆分数。如果值为1，则不分组。最小值为1。小的拆分越多，创建的驱动越多，因此需要更多内存、调度、上下文切换，这会影响读取性能。将小拆分分组在一起可以减少拆分和创建驱动的数量，因此需要的资源较少，从而提高性能。	1
`hive.metastore-client-service-threads`	元存储客户端与Hive元存储通信的并行线程数。	4
`hive.worker-metastore-cache-enabled`	在工作节点上也开启对Hive元存储的缓存。	`false`
`hive.metastore-write-batch-size`	每个请求中发送到元存储的分区数。	8
`hive.metastore-cache-ttl`	表和分区元数据的元存储缓存淘汰时间。	`0s`
`hive.metastore-refresh-interval`	从Hive元存储刷新表和分区元数据的元存储缓存条目的时间。	`1s`
`hive.metastore-db-cache-ttl`	数据库、角色、配置、表和视图列表对象的元存储缓存淘汰时间。	`0s`
`hive.metastore-db-refresh-interval`	从Hive元存储中刷新数据库、表列表、视图列表、角色对象的元存储缓存条目的时间。	`1s`

属性名称	说明
`hive.metastore.uri`	使用Thrift协议连接Hive元存储的URI。如果提供了多个URI，则默认使用第一个URI，其余URI为回退元存储。该属性必选。示例： `thrift://192.0.2.3:9083` 或 `thrift://192.0.2.3:9083,thrift://192.0.2.4:9083`
`hive.metastore.username`	openLooKeng用于访问Hive metastore的用户名。
`hive.metastore.authentication.type`	Hive元存储身份验证类型。取值为 `NONE` 或 `KERBEROS` （默认为 `NONE` ）。
`hive.metastore.thrift.impersonation.enabled`	启用Hive元存储用户模拟。
`hive.metastore.thrift.client.ssl.enabled`	连接元存储时使用SSL。默认为 `false` 。当为true时，表示需要keystore或truststore其中一个。keystore/truststore的路径和密码需要在 `jvm.config` 中设置。密钥列表如下： `-Djavax.net.ssl.keystoreType= e.g. jks` `-Djavax.net.ssl.keyStore=` `-Djavax.net.ssl.keyStorePassword=` `-Djavax.net.ssl.trustStore=` `-Djavax.net.ssl.trustStorePassword=`
`hive.metastore.service.principal`	Hive元存储服务的Kerberos主体。
`hive.metastore.client.principal`	openLooKeng在连接到Hive元存储服务时将使用的Kerberos主体。
`hive.metastore.client.keytab`	Hive元存储客户端keytab位置。
`hive.metastore.thrift.is-role-name-case-sensitive`	角色名是否区分大小写，默认值为false。
`hive.metastore.krb5.conf.path`	Kerberos配置文件位置。

属性名称	说明
`hive.metastore.glue.region`	Glue目录的AWS区域名称。当不在EC2中运行时，或者当目录位于不同区域时，这都是必需的。示例： `us-east-1`
`hive.metastore.glue.pin-client-to-current-region`	Pin Glue请求与openLooKeng运行所在的EC2实例具有相同的区域（默认为 `false` ）。
`hive.metastore.glue.max-connections`	Glue最大并发连接数（默认为 `5` ）。
`hive.metastore.glue.default-warehouse-dir`	Hive Glue元存储默认仓库目录
`hive.metastore.glue.aws-access-key`	要用于连接到Glue目录的AWS访问密钥。如果同时指定 `hive.metastore.glue.aws-secret-key` ，则该参数优先于 `hive.metastore.glue.iam-role` 生效。
`hive.metastore.glue.aws-secret-key`	要用于连接到Glue目录的AWS密钥。如果同时指定 `hive.metastore.glue.aws-access-key` ，则该参数优先于 `hive.metastore.glue.iam-role` 生效。
`hive.metastore.glue.iam-role`	连接Glue目录时，IAM角色的ARN。

属性名称	说明
`hive.s3.use-instance-credentials`	使用EC2元数据服务检索API凭证（默认为 `true` ）。这与EC2中的IAM角色一起使用。
`hive.s3.aws-access-key`	默认使用的AWS访问密钥。
`hive.s3.aws-secret-key`	默认使用的AWS密钥。
`hive.s3.iam-role`	使用的IAM角色。
`hive.s3.endpoint`	S3存储端点服务器。可用于对接兼容S3的存储系统而不是AWS。当使用v4签名时，建议将该属性设置为AWS区域特定端点（例如 `http[s]://.s3-.amazonaws.com` ）。
`hive.s3.signer-type`	为S3兼容存储指定不同的签名者类型。示例：对于v2签名者类型为 `S3SignerType`
`hive.s3.path-style-access`	对S3兼容存储的所有请求使用路径式访问。此属性针对不支持虚拟主机式访问的S3兼容存储。（默认为 `false` ）
`hive.s3.staging-directory`	写入S3的本地暂存目录。默认为JVM系统属性 `java.io.tmpdir` 指定的Java临时目录。
`hive.s3.pin-client-to-current-region`	Pin S3请求与openLooKeng运行所在的EC2实例具有相同的区域（默认为 `false` ）。
`hive.s3.ssl.enabled`	使用https协议与S3 API通信（默认为 `true` ）。
`hive.s3.sse.enabled`	使用S3服务端加密（默认为 `false` ）。
`hive.s3.sse.type`	S3服务端加密的密钥管理类型。S3托管密钥使用 `S3` 或对于KMS托管密钥使用 `KMS` （默认为 `S3` ）。
`hive.s3.sse.kms-key-id`	用于使用KMS托管密钥进行S3服务器端加密的KMS密钥ID。如果不设置，则使用默认密钥。
`hive.s3.kms-key-id`	如果设置了，则使用S3客户端加密，并使用AWS KMS存储加密密钥，并使用此属性的值作为新创建的对象的KMS密钥 ID。
`hive.s3.encryption-materials-provider`	如果设置了，则使用S3客户机端加密，并使用此属性的值作为实现AWS SDK的 `EncryptionMaterialsProvider` 接口的Java类的完全限定名。如果类也从Hadoop API实现 `Configurable` ，那么在创建对象之后，Hadoop配置将被传入。
`hive.s3.upload-acl-type`	上传文件到S3时可以使用的Canned ACL（默认为 `Private` ）。
`hive.s3.skip-glacier-objects`	忽略Glacier对象，而不是使查询失败。这将跳过可能属于表或分区的数据。默认为 `false` 。

属性名称	说明	默认值
`hive.s3.max-error-retries`	S3客户端上设置的最大错误重试次数。	`10`
`hive.s3.max-client-retries`	最大读重试次数。	`5`
`hive.s3.max-backoff-time`	在与S3通信时，使用从1秒开始到此最大值的指数退避。	`10 minutes`
`hive.s3.max-retry-time`	重试与S3通信的最大时间。	`10 minutes`
`hive.s3.connect-timeout`	TCP连接超时。	`5 seconds`
`hive.s3.socket-timeout`	TCP套接字读取超时。	`5 seconds`
`hive.s3.max-connections`	同时连接到S3的最大开放连接数。	`500`
`hive.s3.multipart.min-file-size`	使用分段上传到S3之前的最小文件大小。	`16 MB`
`hive.s3.multipart.min-part-size`	分段上传任务的最小段大小。	`5 MB`

属性名称	说明
`hive.gcs.json-key-file-path`	用来与Google云存储进行身份验证的JSON密钥文件。
`hive.gcs.use-access-token`	使用客户端提供的OAuth令牌访问Google云存储。这与全局JSON密钥文件互斥。

属性名称	说明	默认值
`hive.orc.file-tail.cache.enabled`	启用ORC文件尾缓存	`false`
`hive.orc.file-tail.cache.ttl`	ORC文件尾缓存TTL	`4 hours`
`hive.orc.file-tail.cache.limit`	ORC文件尾缓存最大条目数	`50,000`
`hive.orc.stripe-footer.cache.enabled`	启用ORC分条页脚缓存	`false`
`hive.orc.stripe-footer.cache.ttl`	ORC分条页脚缓存的TTL	`4 hours`
`hive.orc.stripe-footer.cache.limit`	ORC分条页脚缓存最大条目数	`250,000`
`hive.orc.row-index.cache.enabled`	启用ORC行索引缓存	`false`
`hive.orc.row-index.cache.ttl`	ORC行索引缓存TTL	`4 hours`
`hive.orc.row-index.cache.limit`	ORC行索引缓存最大条目数	`250,000`
`hive.orc.bloom-filters.cache.enabled`	启用ORC布隆过滤器缓存	`false`
`hive.orc.bloom-filters.cache.ttl`	ORC布隆过滤器缓存TTL	`4 hours`
`hive.orc.bloom-filters.cache.limit`	ORC布隆过滤器缓存最大条目数	`250,000`
`hive.orc.row-data.block.cache.enabled`	启用ORC行组块缓存	`false`
`hive.orc.row-data.block.cache.ttl`	ORC行组缓存TTL	`4 hours`
`hive.orc.row-data.block.cache.max.weight`	ORC行组缓存最大权重。	`20 GB`

列类型	可收集的统计数据
`TINYINT`	Null值个数，非重复值个数，最小值/最大值
`SMALLINT`	Null值个数，非重复值个数，最小值/最大值
`INTEGER`	Null值个数，非重复值个数，最小值/最大值
`BIGINT`	Null值个数，非重复值个数，最小值/最大值
`DOUBLE`	Null值个数，非重复值个数，最小值/最大值
`REAL`	Null值个数，非重复值个数，最小值/最大值
`DECIMAL`	Null值个数，非重复值个数，最小值/最大值
`DATE`	Null值个数，非重复值个数，最小值/最大值
`TIMESTAMP`	Null值个数，非重复值个数，最小值/最大值
`VARCHAR`	Null值数，非重复值数
`CHAR`	Null值数，非重复值数
`VARBINARY`	Null值数
`BOOLEAN`	Null值个数，true/false值数

ANALYZE table_name WITH (
    partitions = ARRAY[
        ARRAY['p1_value1', 'p1_value2'],
        ARRAY['p2_value1', 'p2_value2']])

CREATE TABLE hive_acid_table (
    id int,
    name string )
  WITH (format='ORC', transactional=true);

INSERT INTO hive_acid_table
  VALUES
     (1, 'foo'),
     (2, 'bar');

UPDATE hive_acid_table
  SET name='john'
  WHERE id=2;

lk:default> SELECT * FROM hive_acid_table;
id | name
----+------
  2 | bar
  1 | foo
(2 rows)

lk:default> SELECT * FROM hive_acid_table;
 id | name
----+------
  2 | john
  1 | foo
(2 rows)

DELETE FROM hive_acid_table
  WHERE id=2;

lk:default> SELECT * FROM hive_acid_table;
 id | name
----+------
  2 | john
  1 | foo
(2 rows)

lk:default> SELECT * FROM hive_acid_table;
 id | name
----+------
  1 | foo
(1 row)

VACUUM TABLE hive_acid_table;

VACUUM TABLE hive_acid_table
  FULL;

CREATE TABLE hive_acid_table_partitioned (
    id int,
    name string,
    class int) 
WITH (format='ORC', transactional=true, partitioned_by=ARRAY['class']);
INSERT INTO hive_acid_table_partitioned
  VALUES
    (1, 'foo', 5),
    (2, 'bar', 10);

VACUUM TABLE hive_acid_table_partitioned
   PARTITION 'class=5';

VACUUM TABLE hive_acid_table
  AND WAIT;

VACUUM TABLE hive_acid_table
  AND WAIT;

VACUUM TABLE hive_acid_table_partitioned
  PARTITION 'class=5'
  AND WAIT;

CREATE TABLE hive.avro.avro_data (
   id bigint
WITH (
   format = 'AVRO',
   avro_schema_url = '/usr/local/avro_data.avsc'

CREATE SCHEMA hive.web
WITH (location = 's3://my-bucket/')

DELETE FROM hive.web.page_views
WHERE ds = DATE '2016-08-09'
  AND country = 'US'

CALL system.create_empty_partition(
    schema_name => 'web',
    table_name => 'page_views',
    partition_columns => ARRAY['ds', 'country'],
    partition_values => ARRAY['2016-08-09', 'US']);

SELECT * FROM hive.web.page_views

SELECT * FROM hive.web."page_views$partitions"

CREATE TABLE hive.web.request_logs (
  request_time timestamp,
  url varchar,
  ip varchar,
  user_agent varchar
WITH (
  format = 'TEXTFILE',
  external_location = 's3://my-bucket/data/logs/'

ANALYZE hive.web.request_logs;

DROP TABLE hive.web.request_logs

DROP SCHEMA hive.web

# Table & Partition Cache specific configurations
hive.metastore-cache-ttl=24h
hive.metastore-refresh-interval=23h
# DB, Table & View list, Roles, configurations related cache configuration
hive.metastore-db-cache-ttl=4m
hive.metastore-db-refresh-interval=3m

REFRESH META CACHE

```
SET SESSION task_writer_count=<num>;
#Note: `num' is default number of local parallel table writer jobs per worker, must be a power of 2.
#Recommended value: 50% of the total cpu cores available in the worker node can be given here
```
- - ```
  VACUUM TABLE catalog_sales FULL UNIFY;
```
- - ```
  SET SESSION hive.write_partition_distribution=true
  #Default: false
```
- ```
hive.metastore-timeout=<TimeWithUnit>;
#说明：'TimeWithUnit'为时间，单位为秒或分钟。 
#默认值：10s（其中's'表示秒）
#推荐值：对于大分区表中的操作，值可为60s或更大，需要根据数据量进行配置。此处显示的值仅供参考，建议根据实际情况进行调整。
```
```
SET SESSION hive.metastore-client-service-threads = 4
#Default: 4
#Recommended: The number of running hive metastore service instances * 4.
```
```
hive.metastore-write-batch-size = 64
#Default: 8
#Recommended: 64 or higher writes to batch together per request to hive metastore service.
```

版本 : 1.10.0

Hive连接器

概述

支持的文件类型

配置

多Hive集群

HDFS配置

HDFS用户名和权限

访问Kerberos身份验证保护的Hadoop集群

Hive配置属性

Hive Thrift 元存储配置属性说明

AWS Glue目录配置属性

Amazon S3配置

S3配置属性

S3凭据

自定义S3凭据提供程序

调优属性

S3数据加密

S3 Select下推

S3 Select是否适合我的工作负载？

注意事项和限制

开启S3 Select下推

了解和调优最大连接数

Google云存储配置

GCS配置属性

ORC缓存配置

ORC缓存属性

表统计信息

更新表和分区统计信息

Hive ACID支持

使用Hive连接器创建事务表

对事务表执行INSERT

对事务表执行UPDATE

对事务表执行DELETE

对事务表执行VACUUM

VACUUM

VACUUM FULL

对分区表的VACUUM操作

AND WAIT选项

模式演进

Avro模式演进

限制

操作步骤

示例

清理

元存储缓存：

性能调优说明：

INSERT

Hive元存储超时

并行元存储操作