alluxio worker will throw "Failed to find any Kerberos tgt" with kerberos hadoop enviroment

Description

  • when working with kerberos hdfs enviroment, the following exception will be thrown periodically

    2018-11-13 11:01:47,993 WARN HdfsUnderFileSystem - 2 try to open hdfs://gdccluster/data/gamein/xyq_us/ods/ods_xyq_ach_show_day/dt=20171216/OdsXyqAchShow_20171216ACHSHOW-r-00000 : Failed on local exceptio
    n: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : lo
    cal host is: "gdc-worker15-uspresto.i.nease.net/10.191.58.46"; destination host is: "gdc-nn02-formal.i.nease.net":9000;
    2018-11-13 11:01:47,997 INFO RetryInvocationHandler - Exception while invoking getBlockLocations of class ClientNamenodeProtocolTranslatorPB over gdc-nn02-formal.i.nease.net/10.160.254.249:9000 after 8 fail over attempts. Trying to fail over immediately.
    java.io.IOException: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "gdc-worker15-uspresto.i.nease.net/10.191.58.46"; destination host is: "gdc-nn02-formal.i.nease.net":9000;
    at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:776)
    at org.apache.hadoop.ipc.Client.call(Client.java:1479)
    at org.apache.hadoop.ipc.Client.call(Client.java:1412)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
    at com.sun.proxy.$Proxy46.getBlockLocations(Unknown Source)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:255)
    at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
    at com.sun.proxy.$Proxy47.getBlockLocations(Unknown Source)
    at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1226)
    at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1213)
    at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1201)
    at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:306)
    at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:272)
    at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:264)
    at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1526)
    at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:304)
    at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:299)
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:312)
    at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
    at alluxio.underfs.hdfs.HdfsUnderFileSystem.open(HdfsUnderFileSystem.java:439)
    at alluxio.underfs.UnderFileSystemWithLogging$25.call(UnderFileSystemWithLogging.java:439)
    at alluxio.underfs.UnderFileSystemWithLogging$25.call(UnderFileSystemWithLogging.java:436)
    at alluxio.underfs.UnderFileSystemWithLogging.call(UnderFileSystemWithLogging.java:556)
    at alluxio.underfs.UnderFileSystemWithLogging.open(UnderFileSystemWithLogging.java:436)
    at alluxio.worker.block.UfsInputStreamManager.acquire(UfsInputStreamManager.java:252)
    at alluxio.worker.block.UfsInputStreamManager.acquire(UfsInputStreamManager.java:185)
    at alluxio.worker.block.UnderFileSystemBlockReader.updateUnderFileSystemInputStream(UnderFileSystemBlockReader.java:299)
    at alluxio.worker.block.UnderFileSystemBlockReader.init(UnderFileSystemBlockReader.java:133)
    at alluxio.worker.block.UnderFileSystemBlockReader.create(UnderFileSystemBlockReader.java:102)
    at alluxio.worker.block.UnderFileSystemBlockStore.getBlockReader(UnderFileSystemBlockStore.java:241)
    at alluxio.worker.block.DefaultBlockWorker.readUfsBlock(DefaultBlockWorker.java:438)
    at alluxio.worker.block.AsyncCacheRequestManager.cacheBlockFromUfs(AsyncCacheRequestManager.java:139)
    at alluxio.worker.block.AsyncCacheRequestManager.lambda$submitRequest$0(AsyncCacheRequestManager.java:94)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
    Caused by: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
    at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
    at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650)
    at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:737)
    at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375)
    at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528)
    at org.apache.hadoop.ipc.Client.call(Client.java:1451)
    ... 41 more
    Caused by: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
    at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
    at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:414)
    at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:560)
    at org.apache.hadoop.ipc.Client$Connection.access$1900(Client.java:375)
    at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:729)
    at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:725)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
    at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:725)
    ... 44 more
    Caused by: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)
    at sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:147)
    at sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:122)
    at sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:187)
    at sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:224)
    at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212)
    at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179)
    at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:192)
    ... 53 more

  • from the exception, it is obvious that the worker process tries to access files in hdfs, but can not find the kerberos tgt

  • it seems to be a known limitation from the offically document of V1.8 (http://www.alluxio.org/docs/1.8/en/ufs/HDFS.html#running-alluxio-locally-with-hdfs), which indicates "a known limitation is that the Kerberos TGT may expire after the max renewal lifetime. You can work around this by renewing the TGT periodically. Otherwise you may see No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt) when starting Alluxio services."

  • but for me, the solution increases the maintenance cost:

  • you need to add a crontab to periodically refresh kerbeoros tgt and have to monitor the running status of the crontab

  • It is usual for us to log onto the machine and kinit as another user to run some command, which will ruin the kerberos tgt enviroment needed by alluxio worker

  • imo,a chore thread is need to do the kinit and relogin job, `UserGroupInformation.loginUserFromKeytabAndReturnUGI` commonly used by hadoop/hbase is recommended

  • Any some suggestion is appreciated

Environment

  • OS version: debian 8.9

  • alluxio version: 1.7.1

  • UFS hadoop version: CDH-5.6.0-Hadoop-2.6.0

Status

Assignee

Gene Pang

Reporter

Shuang Li (louShang)

Labels

Components

Affects versions

Priority

Major
Configure