Kerberos 是最为头疼的鉴权配置,但是 Hadoop 全家桶绕不开,只能硬着头皮干了。本文以 Trino 和 StarRocks 为例,讲述如何在非 EMR 的节点上,通过一系列魔幻配置连上阿里云 EMR 的 Kerberos。StarRocks 和 Trino 的配置风格有点不同,Trino 因为在 catalog properties 已经暴露了 Kerberos 相关的配置,所以可以替代部分 xxx-site.xml
里面的内容。而 StarRocks 因为什么 Kerberos 接口都没暴露,只能纯靠 xxx-site.xml
进行配置,所以 StarRocks 这套配置方法理论上可以应用于所有调 Hadoop 包的软件上。
购买一个新的 EMR 和一台测试 ECS 服务器
首先,你得保证创建的时候,EMR 和 ECS 网络是在同一个安全组里面,要是网都不通的话,就别瞎搞了。
拿到新的 EMR 后,我们会输入 hive
,创建一张测试表,然后你上来就会遇到如下报错:
[root@master-1-1(172.26.95.71) ~]# hive
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/apps/HIVE/hive-3.1.3-hadoop3.1-1.0.4/lib/log4j-slf4j-impl-2.17.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/apps/HADOOP-COMMON/hadoop-3.2.1-1.2.7-alinux3/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Hive Session ID = 1b337056-6f87-46e8-b3b7-52665e5622bf
Logging initialized using configuration in file:/etc/taihao-apps/hive-conf/hive-log4j2.properties Async: true
Exception in thread "main" java.lang.RuntimeException: java.io.IOException: DestHost:destPort master-1-1.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com:9000 , LocalHost:localPort master-1-1.c-8120a41f6b0c44
3d.cn-zhangjiakou.emr.aliyuncs.com/172.26.95.71:0. Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:651)
at org.apache.hadoop.hive.ql.session.SessionState.beginStart(SessionState.java:591)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:747)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:683)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
at org.apache.hadoop.util.RunJar.main(RunJar.java:236)
......
Caused by: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:770)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:733)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:827)
at org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:421)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1606)
at org.apache.hadoop.ipc.Client.call(Client.java:1435)
... 34 more
Caused by: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
at org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.java:173)
at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:390)
at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:627)
at org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:421)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:814)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:810)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:810)
... 37 more
因为机器没有 kinit
过,我们需要选择一个 principal kinit 下。
你可以通过 kadmin.local
登入 kadmin,输入 list_principals
,看到 EMR 已经内置了如下 principals。
HTTP/core-1-1.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com@EMR.C-8120A41F6B0C443D.COM
HTTP/core-1-2.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com@EMR.C-8120A41F6B0C443D.COM
HTTP/core-1-3.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com@EMR.C-8120A41F6B0C443D.COM
HTTP/master-1-1.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com@EMR.C-8120A41F6B0C443D.COM
K/M@EMR.C-8120A41F6B0C443D.COM
emr-monitor/core-1-1.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com@EMR.C-8120A41F6B0C443D.COM
emr-monitor/core-1-2.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com@EMR.C-8120A41F6B0C443D.COM
emr-monitor/core-1-3.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com@EMR.C-8120A41F6B0C443D.COM
emr-monitor/master-1-1.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com@EMR.C-8120A41F6B0C443D.COM
flink/master-1-1.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com@EMR.C-8120A41F6B0C443D.COM
hadoop/core-1-1.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com@EMR.C-8120A41F6B0C443D.COM
hadoop/core-1-2.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com@EMR.C-8120A41F6B0C443D.COM
hadoop/core-1-3.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com@EMR.C-8120A41F6B0C443D.COM
hadoop/master-1-1.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com@EMR.C-8120A41F6B0C443D.COM
hdfs/core-1-1.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com@EMR.C-8120A41F6B0C443D.COM
hdfs/core-1-2.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com@EMR.C-8120A41F6B0C443D.COM
hdfs/core-1-3.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com@EMR.C-8120A41F6B0C443D.COM
hdfs/master-1-1.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com@EMR.C-8120A41F6B0C443D.COM
hive/master-1-1.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com@EMR.C-8120A41F6B0C443D.COM
host/master-1-1.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com@EMR.C-8120A41F6B0C443D.COM
kadmin/admin@EMR.C-8120A41F6B0C443D.COM
kadmin/changepw@EMR.C-8120A41F6B0C443D.COM
kadmin/master-1-1.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com@EMR.C-8120A41F6B0C443D.COM
kiprop/master-1-1.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com@EMR.C-8120A41F6B0C443D.COM
krbtgt/EMR.C-8120A41F6B0C443D.COM@EMR.C-8120A41F6B0C443D.COM
rangeradmin/master-1-1.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com@EMR.C-8120A41F6B0C443D.COM
rangerlookup/master-1-1.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com@EMR.C-8120A41F6B0C443D.COM
rangerusersync/master-1-1.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com@EMR.C-8120A41F6B0C443D.COM
root/admin@EMR.C-8120A41F6B0C443D.COM
spark/master-1-1.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com@EMR.C-8120A41F6B0C443D.COM
trino/core-1-1.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com@EMR.C-8120A41F6B0C443D.COM
trino/core-1-2.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com@EMR.C-8120A41F6B0C443D.COM
trino/core-1-3.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com@EMR.C-8120A41F6B0C443D.COM
trino/master-1-1.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com@EMR.C-8120A41F6B0C443D.COM
zookeeper/master-1-1.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com@EMR.C-8120A41F6B0C443D.COM
这里我就不自己创建 principal 了,直接用个现成的 spark,因为 EMR 肯定不同组件都给我们配置好了 spark 的相关权限,直接用这个 principal 就行了。
执行如下 kinit 命令:
kinit -kt /etc/taihao-apps/spark-conf/keytab/spark.keytab spark/master-1-1.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com@EMR.C-8120A41F6B0C443D.COM
之后你可以通过 klist
命令确认 kinit
成功了。
[root@master-1-1(172.26.95.71) ~]# klist
Ticket cache: FILE:/tmp/krb5cc_0
Default principal: spark/master-1-1.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com@EMR.C-8120A41F6B0C443D.COM
Valid starting Expires Service principal
2023-08-20T22:53:12 2023-08-21T22:53:12 krbtgt/EMR.C-8120A41F6B0C443D.COM@EMR.C-8120A41F6B0C443D.COM
renew until 2023-08-27T22:53:12
自此,你可以在这个新的 emr 进行任何建表的操作了。
ECS 配置
我的机器系统是 Centos 7,当然 ECS 得先安装 Kerberos 套件,执行如下命令:
yum install -y krb5-server krb5-libs krb5-workstation
安装成功后,修改下 /etc/krb5.conf
内容,就是加个 [realms]
字段和修改下 default_realm
就行了。这两个字段怎么填,你直接抄 EMR 上面的 /etc/krb5.conf
就行了,我删减了一部分修改后如下:
[libdefaults]
default_realm = EMR.C-8120A41F6B0C443D.COM
[realms]
EMR.C-8120A41F6B0C443D.COM = {
kdc = master-1-1.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com:88
admin_server = master-1-1.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com:749
}
default_relam
一定要设置,不然 Trino 会启动报错。
然后先从 EMR 上面把 spark.keytab
拷到 ECS 上面:
scp root@172.26.95.71:/etc/taihao-apps/spark-conf/keytab/spark.keytab .
然后执行 kinit
命令:
kinit -kt spark.keytab spark/master-1-1.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com@EMR.C-8120A41F6B0C443D.COM
并通过 klist
确认 kinit
成功:
Ticket cache: FILE:/tmp/krb5cc_1025
Default principal: spark/master-1-1.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com@EMR.C-8120A41F6B0C443D.COM
Valid starting Expires Service principal
08/20/2023 23:07:05 08/21/2023 23:07:04 krbtgt/EMR.C-8120A41F6B0C443D.COM@EMR.C-8120A41F6B0C443D.COM
自此,ECS 上面的配置算是完成了。
Trino 配置
Trino 的 Hive catalog 配置如下:
connector.name=hive
hive.metastore.uri=thrift://master-1-1.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com:9083
hive.metastore.authentication.type=KERBEROS
# 注意,这里填的是 hive,而不是 ECS 上面 kinit 的 spark
hive.metastore.service.principal=hive/master-1-1.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com@EMR.C-8120A41F6B0C443D.COM
hive.metastore.client.principal=spark/master-1-1.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com@EMR.C-8120A41F6B0C443D.COM
# Spark principal 对应的 keytab
hive.metastore.client.keytab=/home/disk1/smith/kerberos/spark.keytab
hive.hdfs.authentication.type=KERBEROS
hive.hdfs.trino.principal=spark/master-1-1.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com@EMR.C-8120A41F6B0C443D.COM
hive.hdfs.trino.keytab=/home/disk1/smith/kerberos/spark.keytab
hive.config.resources=/home/disk1/smith/tools/trino-server-405/etc/catalog/kerberos/hdfs-site.xml
注意这里需要额外引用下 EMR 上面自带的 hdfs-site.xml
,正常来说是不需要的,但是感觉阿里云的 EMR 配置过一些别的乱七八糟的配置。如果你不配置 hive.config.resources
,HDFS 访问会报错。具体报错信息见下面的 【疑难杂症-org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block】。
EMR 上面自带的 hdfs-site.xml
从 /etc/taihao-apps/hadoop-conf/hdfs-site.xml
上面原封不动拷贝过来就行了。
StarRocks 配置
这里以 Hive catalog 为例。我们把 hdfs-site.xml
,core-site.xml
和 hive-site.xml
分别放置在 FE/BE 对应的 conf
目录下就行了。
FE
hive-site.xml
内容如下:
注意用的是 hive
而不是 ECS kinit
的 spark
:
<configuration>
<property>
<name>hive.metastore.kerberos.principal</name>
<value>hive/master-1-1.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com@EMR.C-8120A41F6B0C443D.COM</value>
</property>
<property>
<name>hive.metastore.sasl.enabled</name>
<value>true</value>
</property>
</configuration>
core-site.xml
内容如下:
<configuration>
<property>
<name>hadoop.security.authentication</name>
<value>KERBEROS</value>
</property>
</configuration>
hdfs-site.xml
内容如下:
注意这里配置的是 hdfs
而不是 spark
。
<configuration>
<property>
<name>dfs.datanode.kerberos.principal</name>
<value>hdfs/_HOST@EMR.C-8120A41F6B0C443D.COM</value>
</property>
<property>
<name>dfs.namenode.kerberos.principal</name>
<value>hdfs/_HOST@EMR.C-8120A41F6B0C443D.COM</value>
</property>
</configuration>
BE
BE 不会访问 Hive,所以不需要配置 hive-site.xml
。
core-site.xml
内容如下:
<configuration>
<property>
<name>hadoop.security.authentication</name>
<value>KERBEROS</value>
</property>
</configuration>
hdfs-site.xml
内容比 FE 的复杂,估计是阿里云 EMR 的特殊设置,我们直接拷贝 EMR 的 hdfs-site.xml
即可。不然会报错,见【疑难杂症-org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block】。
疑难杂症
如何得到更多的 Kerberos 错误信息
在 JVM 的启动参数上面添加 -Dsun.security.krb5.debug=true
,然后你可以在日志里面看见更多的 Kerberos 认证信息。
Kerberos 连接超时 Receive timed out
可以看见如下日志:
2023-08-20 23:18:21,489 ERROR (starrocks-mysql-nio-pool-0|163) [TSaslTransport.open():307] SASL negotiation failure
javax.security.sasl.SaslException: GSS initiate failed
at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211) ~[jdk.security.jgss:?]
at org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:95) ~[libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:265) ~[libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:38) ~[libthrift-0.13.0.jar:0.13.0]
at org.apache.hadoop.hive.metastore.security.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:51) ~[hive-apache-3.1.2-13.jar:?]
at org.apache.hadoop.hive.metastore.security.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:48) ~[hive-apache-3.1.2-13.jar:?]
at java.security.AccessController.doPrivileged(Native Method) ~[?:?]
at javax.security.auth.Subject.doAs(Subject.java:423) ~[?:?]
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) ~[hadoop-common-3.3.6.jar:?]
at org.apache.hadoop.hive.metastore.security.TUGIAssumingTransport.open(TUGIAssumingTransport.java:48) ~[hive-apache-3.1.2-13.jar:?]
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:528) ~[starrocks-fe.jar:?]
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:301) ~[starrocks-fe.jar:?]
at jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[?:?]
at jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) ~[?:?]
at jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[?:?]
at java.lang.reflect.Constructor.newInstance(Constructor.java:490) ~[?:?]
at org.apache.hadoop.hive.metastore.utils.JavaUtils.newInstance(JavaUtils.java:84) ~[hive-apache-3.1.2-13.jar:?]
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:95) ~[hive-apache-3.1.2-13.jar:?]
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:148) ~[hive-apache-3.1.2-13.jar:?]
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:119) ~[hive-apache-3.1.2-13.jar:?]
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:112) ~[hive-apache-3.1.2-13.jar:?]
at com.starrocks.connector.hive.HiveMetaClient$RecyclableClient.<init>(HiveMetaClient.java:94) ~[starrocks-fe.jar:?]
at com.starrocks.connector.hive.HiveMetaClient$RecyclableClient.<init>(HiveMetaClient.java:83) ~[starrocks-fe.jar:?]
at com.starrocks.connector.hive.HiveMetaClient.getClient(HiveMetaClient.java:138) ~[starrocks-fe.jar:?]
at com.starrocks.connector.hive.HiveMetaClient.callRPC(HiveMetaClient.java:154) ~[starrocks-fe.jar:?]
at com.starrocks.connector.hive.HiveMetaClient.callRPC(HiveMetaClient.java:146) ~[starrocks-fe.jar:?]
at com.starrocks.connector.hive.HiveMetaClient.getDb(HiveMetaClient.java:232) ~[starrocks-fe.jar:?]
at com.starrocks.connector.hive.HiveMetastore.getDb(HiveMetastore.java:85) ~[starrocks-fe.jar:?]
at com.starrocks.connector.hive.CachingHiveMetastore.loadDb(CachingHiveMetastore.java:281) ~[starrocks-fe.jar:?]
at com.google.common.cache.CacheLoader$FunctionToCacheLoader.load(CacheLoader.java:169) ~[spark-dpp-1.0.0.jar:?]
at com.google.common.cache.CacheLoader$1.load(CacheLoader.java:192) ~[spark-dpp-1.0.0.jar:?]
at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3570) ~[spark-dpp-1.0.0.jar:?]
at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2312) ~[spark-dpp-1.0.0.jar:?]
at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2189) ~[spark-dpp-1.0.0.jar:?]
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2079) ~[spark-dpp-1.0.0.jar:?]
at com.google.common.cache.LocalCache.get(LocalCache.java:4011) ~[spark-dpp-1.0.0.jar:?]
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4034) ~[spark-dpp-1.0.0.jar:?]
at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:5010) ~[spark-dpp-1.0.0.jar:?]
at com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:5017) ~[spark-dpp-1.0.0.jar:?]
at com.starrocks.connector.hive.CachingHiveMetastore.get(CachingHiveMetastore.java:522) ~[starrocks-fe.jar:?]
at com.starrocks.connector.hive.CachingHiveMetastore.getDb(CachingHiveMetastore.java:277) ~[starrocks-fe.jar:?]
at com.starrocks.connector.hive.CachingHiveMetastore.loadDb(CachingHiveMetastore.java:281) ~[starrocks-fe.jar:?]
at com.google.common.cache.CacheLoader$FunctionToCacheLoader.load(CacheLoader.java:169) ~[spark-dpp-1.0.0.jar:?]
at com.google.common.cache.CacheLoader$1.load(CacheLoader.java:192) ~[spark-dpp-1.0.0.jar:?]
at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3570) ~[spark-dpp-1.0.0.jar:?]
at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2312) ~[spark-dpp-1.0.0.jar:?]
at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2189) ~[spark-dpp-1.0.0.jar:?]
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2079) ~[spark-dpp-1.0.0.jar:?]
at com.google.common.cache.LocalCache.get(LocalCache.java:4011) ~[spark-dpp-1.0.0.jar:?]
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4034) ~[spark-dpp-1.0.0.jar:?]
at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:5010) ~[spark-dpp-1.0.0.jar:?]
at com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:5017) ~[spark-dpp-1.0.0.jar:?]
at com.starrocks.connector.hive.CachingHiveMetastore.get(CachingHiveMetastore.java:522) ~[starrocks-fe.jar:?]
at com.starrocks.connector.hive.CachingHiveMetastore.getDb(CachingHiveMetastore.java:277) ~[starrocks-fe.jar:?]
at com.starrocks.connector.hive.HiveMetastoreOperations.getDb(HiveMetastoreOperations.java:142) ~[starrocks-fe.jar:?]
at com.starrocks.connector.hive.HiveMetadata.getDb(HiveMetadata.java:100) ~[starrocks-fe.jar:?]
at com.starrocks.server.MetadataMgr.lambda$getDb$1(MetadataMgr.java:149) ~[starrocks-fe.jar:?]
at java.util.Optional.map(Optional.java:265) ~[?:?]
at com.starrocks.server.MetadataMgr.getDb(MetadataMgr.java:149) ~[starrocks-fe.jar:?]
at com.starrocks.server.GlobalStateMgr.changeCatalogDb(GlobalStateMgr.java:3597) ~[starrocks-fe.jar:?]
at com.starrocks.mysql.MysqlProto.negotiate(MysqlProto.java:231) ~[starrocks-fe.jar:?]
at com.starrocks.mysql.nio.AcceptListener.lambda$handleEvent$1(AcceptListener.java:86) ~[starrocks-fe.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at java.lang.Thread.run(Thread.java:834) ~[?:?]
Caused by: org.ietf.jgss.GSSException: No valid credentials provided (Mechanism level: Receive timed out)
at sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:776) ~[java.security.jgss:?]
at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:266) ~[java.security.jgss:?]
at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:196) ~[java.security.jgss:?]
at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:192) ~[jdk.security.jgss:?]
... 64 more
Caused by: java.net.SocketTimeoutException: Receive timed out
at java.net.PlainDatagramSocketImpl.receive0(Native Method) ~[?:?]
at java.net.AbstractPlainDatagramSocketImpl.receive(AbstractPlainDatagramSocketImpl.java:181) ~[?:?]
at java.net.DatagramSocket.receive(DatagramSocket.java:814) ~[?:?]
at sun.security.krb5.internal.UDPClient.receive(NetClient.java:205) ~[java.security.jgss:?]
at sun.security.krb5.KdcComm$KdcCommunication.run(KdcComm.java:404) ~[java.security.jgss:?]
at sun.security.krb5.KdcComm$KdcCommunication.run(KdcComm.java:364) ~[java.security.jgss:?]
at java.security.AccessController.doPrivileged(Native Method) ~[?:?]
at sun.security.krb5.KdcComm.send(KdcComm.java:348) ~[java.security.jgss:?]
at sun.security.krb5.KdcComm.sendIfPossible(KdcComm.java:253) ~[java.security.jgss:?]
at sun.security.krb5.KdcComm.send(KdcComm.java:229) ~[java.security.jgss:?]
at sun.security.krb5.KdcComm.send(KdcComm.java:200) ~[java.security.jgss:?]
at sun.security.krb5.KrbTgsReq.send(KrbTgsReq.java:246) ~[java.security.jgss:?]
at sun.security.krb5.KrbTgsReq.sendAndGetCreds(KrbTgsReq.java:261) ~[java.security.jgss:?]
at sun.security.krb5.internal.CredentialsUtil.serviceCreds(CredentialsUtil.java:308) ~[java.security.jgss:?]
at sun.security.krb5.internal.CredentialsUtil.acquireServiceCreds(CredentialsUtil.java:126) ~[java.security.jgss:?]
at sun.security.krb5.Credentials.acquireServiceCreds(Credentials.java:458) ~[java.security.jgss:?]
at sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:695) ~[java.security.jgss:?]
at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:266) ~[java.security.jgss:?]
at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:196) ~[java.security.jgss:?]
at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:192) ~[jdk.security.jgss:?]
... 64 more
通过 Kerberos 的 Debug 日志可以看到:
Using builtin default etypes for default_tgs_enctypes
default etypes for default_tgs_enctypes: 18 17 20 19 16 23.
>>> CksumType: sun.security.krb5.internal.crypto.RsaMd5CksumType
>>> EType: sun.security.krb5.internal.crypto.Aes256CtsHmacSha1EType
>>> KrbKdcReq send: kdc=master-1-1.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com UDP:88, timeout=30000, number of retries =3, #bytes=907
>>> KDCCommunication: kdc=master-1-1.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com UDP:88, timeout=30000,Attempt =1, #bytes=907
SocketTimeOutException with attempt: 1
>>> KDCCommunication: kdc=master-1-1.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com UDP:88, timeout=30000,Attempt =2, #bytes=907
SocketTimeOutException with attempt: 2
>>> KDCCommunication: kdc=master-1-1.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com UDP:88, timeout=30000,Attempt =3, #bytes=907
SocketTimeOutException with attempt: 3
>>> KrbKdcReq send: error trying master-1-1.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com:88
java.net.SocketTimeoutException: Receive timed out
at java.base/java.net.PlainDatagramSocketImpl.receive0(Native Method)
at java.base/java.net.AbstractPlainDatagramSocketImpl.receive(AbstractPlainDatagramSocketImpl.java:181)
at java.base/java.net.DatagramSocket.receive(DatagramSocket.java:814)
at java.security.jgss/sun.security.krb5.internal.UDPClient.receive(NetClient.java:205)
at java.security.jgss/sun.security.krb5.KdcComm$KdcCommunication.run(KdcComm.java:404)
at java.security.jgss/sun.security.krb5.KdcComm$KdcCommunication.run(KdcComm.java:364)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.security.jgss/sun.security.krb5.KdcComm.send(KdcComm.java:348)
仿佛在用 UDP 协议访问目标 88 端口的时候,超时了。进过排查,我的安全组没有放行 UDP。
可以在 /etc/krb5.conf
下面的 [libdefaults]
添加一行:udp_preference_limit = 1
使用 TCP 协议进行访问就行了。
Trino Failed connecting to Hive metastore
执行 sql 发现:
trino> select * from hive.smith.hello;
Query 20230820_002828_00000_7etqp failed: Failed connecting to Hive metastore: [master-1-1.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com:9083]
但是你可以从 log 日志里面发现 Kerberos 鉴权实际上是成功的。
Client Principal = spark/master-1-1.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com@EMR.C-8120A41F6B0C443D.COM
Server Principal = spark/master-1-1.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com@EMR.C-8120A41F6B0C443D.COM
Session Key = EncryptionKey: keyType=18 keyBytes (hex dump)=
0000: 60 4E EE 46 9F 3E E1 24 B2 88 24 5A 43 34 49 A8 `N.F.>.$..$ZC4I.
0010: 8A 06 A9 C2 51 A9 DC EF D9 46 AB A3 78 F9 86 4C ....Q....F..x..L
Forwardable Ticket true
Forwarded Ticket false
Proxiable Ticket false
Proxy Ticket false
Postdated Ticket false
Renewable Ticket false
Initial Ticket false
Auth Time = Sun Aug 20 08:28:28 CST 2023
Start Time = Sun Aug 20 08:28:28 CST 2023
End Time = Mon Aug 21 08:28:28 CST 2023
Renew Till = null
Client Addresses Null
2023-08-20T08:28:37.940+0800 INFO Query-20230820_002828_00000_7etqp-147 stdout >>> KrbApReq: APOptions are 00100000 00000000 00000000 00000000
2023-08-20T08:28:37.941+0800 INFO Query-20230820_002828_00000_7etqp-147 stdout >>> EType: sun.security.krb5.internal.crypto.Aes256CtsHmacSha1EType
2023-08-20T08:28:37.942+0800 INFO Query-20230820_002828_00000_7etqp-147 stdout Krb5Context setting mySeqNumber to: 820609998
2023-08-20T08:28:37.942+0800 INFO Query-20230820_002828_00000_7etqp-147 stdout Created InitSecContextToken:
0000: 01 00 6E 82 03 11 30 82 03 0D A0 03 02 01 05 A1 ..n...0.........
0010: 03 02 01 0E A2 07 03 05 00 20 00 00 00 A3 82 01 ......... ......
0020: D3 61 82 01 CF 30 82 01 CB A0 03 02 01 05 A1 1C .a...0..........
0030: 1B 1A 45 4D 52 2E 43 2D 44 31 44 31 44 36 36 42 ..EMR.C-D1D1D66B
0040: 35 31 37 32 32 45 37 43 2E 43 4F 4D A2 51 30 4F 51722E7C.COM.Q0O
0050: A0 03 02 01 00 A1 48 30 46 1B 05 73 70 61 72 6B ......H0F..spark
0060: 1B 3D 6D 61 73 74 65 72 2D 31 2D 31 2E 63 2D 64 .=master-1-1.c-d
0070: 31 64 31 64 36 36 62 35 31 37 32 32 65 37 63 2E 1d1d66b51722e7c.
0080: 63 6E 2D 7A 68 61 6E 67 6A 69 61 6B 6F 75 2E 65 cn-zhangjiakou.e
0090: 6D 72 2E 61 6C 69 79 75 6E 63 73 2E 63 6F 6D A3 mr.aliyuncs.com.
00A0: 82 01 51 30 82 01 4D A0 03 02 01 12 A1 03 02 01 ..Q0..M.........
00B0: 02 A2 82 01 3F 04 82 01 3B 43 BD 58 AB 7C 93 A5 ....?...;C.X....
00C0: 15 E7 4C 36 F1 B0 10 DF 1E 3A 83 74 8B CB 7A ED ..L6.....:.t..z.
00D0: C3 01 E3 63 CF ED B6 B6 F7 E4 C2 84 B4 EC 85 A0 ...c............
00E0: 2C E3 01 94 38 6C AB 86 43 A4 3B 90 4E BF DE 7E ,...8l..C.;.N...
00F0: F5 02 86 A3 6A 96 E5 DC 0F 44 0A C4 B4 F1 E2 14 ....j....D......
0100: 03 B8 B5 85 57 17 FE D7 AC 45 82 10 FA 6E E2 A0 ....W....E...n..
0110: 4C FB 02 A0 2C 44 90 3B 9A 1A 3F F5 08 29 27 21 L...,D.;..?..)'!
0120: 26 8E 66 E4 A1 F5 89 CB 6C A9 9D A8 B9 9E F0 03 &.f.....l.......
0130: C1 22 5F 4A 08 77 F3 0E 63 4E DF 43 C6 5E 41 10 ."_J.w..cN.C.^A.
0140: 88 CC 11 0D 67 5B 4C CD C4 24 59 00 64 F0 1A 38 ....g[L..$Y.d..8
这是因为 Trino 的 Hive catalog 配置 hive.metastore.service.principal
错了,不应该用 spark/master-1-1.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com@EMR.C-8120A41F6B0C443D.COM
换成 hive 的就行了,hive/master-1-1.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com@EMR.C-8120A41F6B0C443D.COM
。
org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block xxx
这个报错无论是 StarRocks 还是 Trino,都会遇到,错误的栈大概就是 Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1115803670-172.26.95.71-1691556308511:blk_1073742076_1253 file=/user/hive/warehouse/smith.db/orc_map_late_bug_table/part-00001-bc8f96b0-17f3-46ec-903e-0c04b8fb2687-c000.snappy.orc
。
看起来不像是 Kerberos 的错误,你只需要从 EMR 中,把 hdfs-site.xml
文件拷贝过来用就行了,估计阿里云的 EMR HDFS 配置了一些奇怪的东西所导致。因为里面配置项太多了,没有精力一一排查。
2023-08-21T11:07:12.876+0800 ERROR stage-scheduler io.trino.execution.StageStateMachine Stage 20230821_030658_00000_n6ytk.1 failed [5/4518]
io.trino.spi.TrinoException: Error opening Hive split hdfs://master-1-1.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com:9000/user/hive/warehouse/smith.db/or
c_map_late_bug_table/part-00001-bc8f96b0-17f3-46ec-903e-0c04b8fb2687-c000.snappy.orc (offset=0, length=793): Could not obtain block: BP-1115803670-172.26.95.71-169155
6308511:blk_1073742076_1253 file=/user/hive/warehouse/smith.db/orc_map_late_bug_table/part-00001-bc8f96b0-17f3-46ec-903e-0c04b8fb2687-c000.snappy.orc
at io.trino.plugin.hive.orc.OrcPageSourceFactory.createOrcPageSource(OrcPageSourceFactory.java:474)
at io.trino.plugin.hive.orc.OrcPageSourceFactory.createPageSource(OrcPageSourceFactory.java:197)
at io.trino.plugin.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:291)
at io.trino.plugin.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:196)
at io.trino.plugin.base.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:49)
at io.trino.split.PageSourceManager.createPageSource(PageSourceManager.java:62)
at io.trino.operator.TableScanOperator.getOutput(TableScanOperator.java:308)
at io.trino.operator.Driver.processInternal(Driver.java:411)
at io.trino.operator.Driver.lambda$process$10(Driver.java:314)
at io.trino.operator.Driver.tryWithLock(Driver.java:706)
at io.trino.operator.Driver.process(Driver.java:306)
at io.trino.operator.Driver.processForDuration(Driver.java:277)
at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:752)
at io.trino.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:164)
at io.trino.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:519)
at io.trino.$gen.Trino_405____20230821_030529_2.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1115803670-172.26.95.71-1691556308511:blk_1073742076_1253 file=/user/hive/warehouse/smith.db/orc_map_late_bug_table/part-00001-bc8f96b0-17f3-46ec-903e-0c04b8fb2687-c000.snappy.orc
at org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:879)
at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:862)
at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:841)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:567)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:757)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:829)
at java.base/java.io.DataInputStream.read(DataInputStream.java:151)
at java.base/java.io.DataInputStream.read(DataInputStream.java:151)
at io.trino.hdfs.FSDataInputStreamTail.readTail(FSDataInputStreamTail.java:59)
at io.trino.filesystem.hdfs.HdfsInput.readTail(HdfsInput.java:56)
at io.trino.filesystem.TrinoInput.readTail(TrinoInput.java:46)
at io.trino.plugin.hive.orc.HdfsOrcDataSource.readTailInternal(HdfsOrcDataSource.java:66)
at io.trino.orc.AbstractOrcDataSource.readTail(AbstractOrcDataSource.java:93)
at io.trino.orc.OrcReader.wrapWithCacheIfTiny(OrcReader.java:325)
at io.trino.orc.OrcReader.createOrcReader(OrcReader.java:103)
at io.trino.orc.OrcReader.createOrcReader(OrcReader.java:94)
at io.trino.plugin.hive.orc.OrcPageSourceFactory.createOrcPageSource(OrcPageSourceFactory.java:274)
... 18 more
后面经大佬指点,其实只要在 hdfs-site.xml
里面添加如下配置就行了:
<property>
<name>dfs.data.transfer.protection</name>
<value>integrity</value>
</property>
本人因为环境没了,就没有验证,需要的人可以自己试试。
为啥不直接把 EMR 上面的 hdfs-site.xml
,core-site.xml
和 hive-site.xml
一股脑的拷贝过来?
2023-08-21T11:07:12.876+0800 ERROR stage-scheduler io.trino.execution.StageStateMachine Stage 20230821_030658_00000_n6ytk.1 failed [5/4518]
io.trino.spi.TrinoException: Error opening Hive split hdfs://master-1-1.c-8120a41f6b0c443d.cn-zhangjiakou.emr.aliyuncs.com:9000/user/hive/warehouse/smith.db/or
c_map_late_bug_table/part-00001-bc8f96b0-17f3-46ec-903e-0c04b8fb2687-c000.snappy.orc (offset=0, length=793): Could not obtain block: BP-1115803670-172.26.95.71-169155
6308511:blk_1073742076_1253 file=/user/hive/warehouse/smith.db/orc_map_late_bug_table/part-00001-bc8f96b0-17f3-46ec-903e-0c04b8fb2687-c000.snappy.orc
at io.trino.plugin.hive.orc.OrcPageSourceFactory.createOrcPageSource(OrcPageSourceFactory.java:474)
at io.trino.plugin.hive.orc.OrcPageSourceFactory.createPageSource(OrcPageSourceFactory.java:197)
at io.trino.plugin.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:291)
at io.trino.plugin.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:196)
at io.trino.plugin.base.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:49)
at io.trino.split.PageSourceManager.createPageSource(PageSourceManager.java:62)
at io.trino.operator.TableScanOperator.getOutput(TableScanOperator.java:308)
at io.trino.operator.Driver.processInternal(Driver.java:411)
at io.trino.operator.Driver.lambda$process$10(Driver.java:314)
at io.trino.operator.Driver.tryWithLock(Driver.java:706)
at io.trino.operator.Driver.process(Driver.java:306)
at io.trino.operator.Driver.processForDuration(Driver.java:277)
at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:752)
at io.trino.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:164)
at io.trino.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:519)
at io.trino.$gen.Trino_405____20230821_030529_2.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1115803670-172.26.95.71-1691556308511:blk_1073742076_1253 file=/user/hive/warehouse/smith.db/orc_map_late_bug_table/part-00001-bc8f96b0-17f3-46ec-903e-0c04b8fb2687-c000.snappy.orc
at org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:879)
at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:862)
at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:841)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:567)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:757)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:829)
at java.base/java.io.DataInputStream.read(DataInputStream.java:151)
at java.base/java.io.DataInputStream.read(DataInputStream.java:151)
at io.trino.hdfs.FSDataInputStreamTail.readTail(FSDataInputStreamTail.java:59)
at io.trino.filesystem.hdfs.HdfsInput.readTail(HdfsInput.java:56)
at io.trino.filesystem.TrinoInput.readTail(TrinoInput.java:46)
at io.trino.plugin.hive.orc.HdfsOrcDataSource.readTailInternal(HdfsOrcDataSource.java:66)
at io.trino.orc.AbstractOrcDataSource.readTail(AbstractOrcDataSource.java:93)
at io.trino.orc.OrcReader.wrapWithCacheIfTiny(OrcReader.java:325)
at io.trino.orc.OrcReader.createOrcReader(OrcReader.java:103)
at io.trino.orc.OrcReader.createOrcReader(OrcReader.java:94)
at io.trino.plugin.hive.orc.OrcPageSourceFactory.createOrcPageSource(OrcPageSourceFactory.java:274)
... 18 more
其实大部分情况下也是 work 的,但是它们包含的配置项太多了,很难说不会对系统造成一些 Unforeseen Consequences。
原创文章,作者:Smith,如若转载,请注明出处:https://www.inlighting.org/archives/trino-starrocks-emr-kerberos-setup