添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接

Apache Hive 是一个可实现大规模分析的分布式容错数据仓库系统。该数据仓库集中存储信息,您可以轻松对此类信息进行分析,从而做出明智的数据驱动决策。Hive 让用户可以利用 SQL 读取、写入和管理 PB 级数据。

Hive

简单地理解,就是将数据存储逻辑化,将分布式文件的存储转化为开发人员熟悉的类 SQL 命令进行操作,可以高效地处理 PB 级别的离线数据。

配置后端存储

Hive 为数据创建的索引信息和一些其他数据需要存储在后端数据库中,数据库允许使用多种类型的关系型数据库,既可外置,也可使用内置,因 MySQL 数据库的 SQL 较为简单,并且受众较多,因此使用 MySQL 为例。

安装 MySQL 5.7 服务端(其他版本也可以,本文以 MySQL 5.7 为例,可以在 官网下载 安装包。)

sudo dnf install mysql-server

启动数据库

sudo systemctl start mysqld
sudo systemctl enable mysqld

登录数据库(默认如果没有密码,按两次回车即可。)

mysql -uroot -p

小贴士:后续的版本密码在启动服务时会随机生成,可以使用命令获取。

sudo grep -i 'password' /var/log/mysqld.log

执行 SQL 命令创建 Hive 存储所使用的数据库和专用用户,不建议配置及使用 root 用户。

> CREATE USER 'hive'@'%' IDENTIFIED BY 'Hadoop@2022';
> GRANT ALL ON *.* TO 'hive'@'%' IDENTIFIED BY 'Hadoop@2022';
> FLUSH PRIVILEGES;

常见问题

a) 如果错误提示如下

ERROR 1820 (HY000): You must reset your password using ALTER USER statement before executing this statement.

这是因为官方源安装的 MySQL 需要先重置默认密码,然后才能执行命令

SET PASSWORD = PASSWORD('Hadoop@2022!');

b) 如果错误提示如下

ERROR 1819 (HY000): Your password does not satisfy the current policy requirements

这是因为官方默认的密码策略较为严格,需要包含大小写字母和半角符号,请修改密码为高强度密码。

下载

下载 Hive 3.1.3 Stable 版本

wget https://dlcdn.apache.org/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz

解压

sudo tar xf apache-hive-3.1.3-bin.tar.gz -C /opt/

授权

sudo chown -R $USER:$USER /opt/apache-hive-3.1.3-bin/

下载 MySQL 的 Java-Connector 连接驱动

cd $HIVE_HOME/lib
wget https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.18/mysql-connector-java-8.0.18.jar

配置

创建 Hive 配置文件

cd $HIVE_HOME
vim conf/hive-site.xml

写入以下内容

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true&amp;characterEncoding=UTF-8&amp;useSSL=false&amp;serverTimezone=GMT</value>
    <description>
      JDBC connect string for a JDBC metastore.
      To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.
      For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.
    </description>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.cj.jdbc.Driver</value>
    <description>Driver class name for a JDBC metastore</description>
  </property>
  <property>
    <!-- 修改为实际使用的 MySQL 用户 -->
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>hive</value>
    <description>Username to use against metastore database</description>
  </property>
  <property>
    <!-- 修改为实际使用的 MySQL 密码 -->
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>Hadoop@2022</value>
    <description>password to use against metastore database</description>
  </property>
  <property>
    <name>hive.metastore.warehouse.dir</name>
    <value>/user/hive/warehouse</value>
    <description>location of default database for the warehouse</description>
  </property>
  <property>
    <name>hive.server2.webui.host</name>
    <value>0.0.0.0</value>
    <description>The host address the HiveServer2 WebUI will listen on</description>
  </property>
  <property>
    <name>hive.server2.webui.port</name>
    <value>10002</value>
    <description>The port the HiveServer2 WebUI will listen on. This can beset to 0 or a negative integer to disable the web UI</description>
  </property>
  <property>
    <name>hive.server2.webui.max.threads</name>
    <value>50</value>
    <description>The max HiveServer2 WebUI threads</description>
  </property>
  <property>
    <name>hive.metastore.schema.verification</name>
    <value>false</value>
    <description>
      Enforce metastore schema version consistency.
      True: Verify that version information stored in is compatible with one from Hive jars.  Also disable automatic
            schema migration attempt. Users are required to manually migrate schema after Hive upgrade which ensures
            proper metastore schema migration. (Default)
      False: Warn if the version information stored in metastore doesn't match with one from in Hive jars.
    </description>
  </property>
  <property>
    <name>hive.metastore.schema.verification.record.version</name>
    <value>false</value>
    <description>
      When true the current MS version is recorded in the VERSION table. If this is disabled and verification is
       enabled the MS will be unusable.
    </description>
  </property>
  <property>
    <name>hive.cli.print.current.db</name>
    <value>true</value>
    <description>Whether to include the current database in the Hive prompt.</description>
  </property>
  <property>
    <name>hive.cli.print.header</name>
    <value>true</value>
    <description>Whether to print the names of the columns in query output.</description>
  </property>
  <property>
    <name>hive.server2.thrift.port</name>
    <value>10000</value>
    <description>Port number of HiveServer2 Thrift interface when hive.server2.transport.mode is 'binary'.</description>
  </property>
  <property>
    <name>hive.server2.thrift.bind.host</name>
    <value>0.0.0.0</value>
    <description>Bind host on which to run the HiveServer2 Thrift service.</description>
  </property>
  <property>
    <name>hive.metastore.port</name>
    <value>9083</value>
    <description>Hive metastore listener port</description>
  </property>
  <property>
    <name>hive.metastore.uris</name>
    <value>thrift://hadoop1:9083</value>
    <description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description>
  </property>
  <property>
    <name>hive.metastore.event.db.notification.api.auth</name>
    <value>false</value>
    <description>
      Should metastore do authorization against database notification related APIs such as get_next_notification.
      If set to true, then only the superusers in proxy settings have the permission
    </description>
  </property>
  <property>
    <name>hive.server2.enable.doAs</name>
    <value>false</value>
    <description>
      Setting this property to true will have HiveServer2 execute
      Hive operations as the user making the calls to it.
    </description>
  </property>
</configuration>

初始化

创建 HDFS 的 Hive 临时目录和存储目录,并授权

hdfs dfs -mkdir /tmp
hdfs dfs -chmod g+w /tmp
hdfs dfs -mkdir -p /user/hive/warehouse
hdfs dfs -chmod g+w /user/hive/warehouse

检查 Hive 的 HDFS 路径是否创建成功

$ hdfs dfs -ls /
Found 2 items
drwx-w----   - hadoop supergroup          0 2022-11-10 10:48 /tmp
drwxr-xr-x   - hadoop supergroup          0 2022-11-10 10:48 /user

初始化 Hive 存储

schematool -initSchema -dbType mysql

然后会看到大量的提示信息,稍等片刻后显示下面的关键字即为初始化数据成功。

Initialization script completed
schemaTool completed

启动服务

启动服务(调试状态下可以多创建两个窗口执行)

hive --service metastore
hive --service hiveserver2
测试无误后,可以通过命令后台执行
nohup hive --service metastore > /dev/null 2>&1 &
nohup hive --service hiveserver2 > /dev/null 2>&1 &

测试服务

测试 hive-cli 连接是否正常(双井号后是实际执行的 hive 命令,注意甄别)

$ hive
which: no hbase in (/opt/apache-hive-3.1.3-bin/bin:/opt/hadoop-3.3.1//bin:/opt/hadoop-3.3.1//sbin:/home/uos/.local/bin:/home/uos/bin:/opt/hadoop-3.3.1//bin:/opt/hadoop-3.3.1//sbin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin)
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/apache-hive-3.1.3-bin/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop-3.3.1/share/hadoop/common/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Hive Session ID = 24afea97-b330-4711-9dd6-a584224b158d
Logging initialized using configuration in file:/opt/apache-hive-3.1.3-bin/conf/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Hive Session ID = f17ed485-640e-4f08-96f9-2dfa9f9f3371
## 查询表
hive (default)> show databases;
database_name
default
Time taken: 1.268 seconds, Fetched: 1 row(s)

测试 beeline 连接是否正常(替换 10.10.10.10 为实际 hadoop1 的地址)

$ beeline shell
Beeline version 3.1.3 by Apache Hive
## 连接数据库
beeline> !connect jdbc:hive2://10.10.10.10:10000
Connecting to jdbc:hive2://10.10.10.10:10000
Enter username for jdbc:hive2://10.10.10.10:10000: hive
Enter password for jdbc:hive2://10.10.10.10:10000: **********
Connected to: Apache Hive (version 3.1.3)
Driver: Hive JDBC (version 3.1.3)
Transaction isolation: TRANSACTION_REPEATABLE_READ
## 查询表
0: jdbc:hive2://10.10.10.10:10000> show databases;
INFO  : Compiling command(queryId=uos_20221115163538_4fff1545-e191-4c47-9ebf-85fdb7b9b86b): show databases
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:database_name, type:string, comment:from deserializer)], properties:null)
INFO  : Completed compiling command(queryId=uos_20221115163538_4fff1545-e191-4c47-9ebf-85fdb7b9b86b); Time taken: 1.718 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=uos_20221115163538_4fff1545-e191-4c47-9ebf-85fdb7b9b86b): show databases
INFO  : Starting task [Stage-0:DDL] in serial mode
INFO  : Completed executing command(queryId=uos_20221115163538_4fff1545-e191-4c47-9ebf-85fdb7b9b86b); Time taken: 0.08 seconds
INFO  : OK
INFO  : Concurrency mode is disabled, not creating a lock manager
+----------------+
| database_name  |
+----------------+
| default        |
+----------------+
1 row selected (2.6 seconds)

使用浏览器访问 http://hadoop1:10002 可以看到 HiveServer2 管理页面

常见问题

a) 返回大量无用 INFO

如果使用客户端时不想显示大量的 INFO 信息,可以在添加 --hiveconf hive.server2.logging.operation.level=NONE 参数

beeline shell --hiveconf hive.server2.logging.operation.level=NONE

或者在 hive-site.xml 中禁用

  <property>
    <name>hive.server2.logging.operation.level</name>
    <value>NONE</value>
    <description>
      Expects one of [none, execution, performance, verbose].
      HS2 operation logging mode available to clients to be set at session level.
      For this to work, hive.server2.logging.operation.enabled should be set to true.
        NONE: Ignore any logging
        EXECUTION: Log completion of tasks
        PERFORMANCE: Execution + Performance logs
        VERBOSE: All logs
    </description>
  </property>

b) 使用 Systemd 守护进程

cat /usr/lib/systemd/system/hive-metastore.service
[Unit]
Description=Apache Hive Metastore
[Service]
User=hadoop
Group=hadoop
Environment="HIVE_HOME=/data/apache-hive-3.1.3-bin"
Environment="HADOOP_HOME=/data/hadoop-3.2.2"
WorkingDirectory=/data/apache-hive-3.1.3-bin
ExecStart=/bin/bash /data/apache-hive-3.1.3-bin/bin/hive --service metastore
Restart=always
[Install]
WantedBy=multi-user.target
$ cat /usr/lib/systemd/system/hive-server2.service  
[Unit]
Description=Apache Hive Metastore
[Service]
User=hadoop
Group=hadoop
Environment="HIVE_HOME=/data/apache-hive-3.1.3-bin"
Environment="HADOOP_HOME=/data/hadoop-3.2.2"
WorkingDirectory=/data/apache-hive-3.1.3-bin
ExecStart=/bin/bash /data/apache-hive-3.1.3-bin/bin/hive --service hiveserver2
Restart=always
[Install]
WantedBy=multi-user.target

附录

参考链接