Friday, January 26, 2018

java.lang.RuntimeException: Failed to start Service "Cluster" (ServiceState=SERVICE_STOPPED, STATE_JOINING)

Oracle Access Manager 11.1.2.3

While trying to start OAM Policy Manager server on the second node in a multi node setup, the server starts but the application fails with the stack similar to the below:

at weblogic.application.utils.StateMachineDriver.nextState(StateMachineDriver.java:52)
       at weblogic.application.internal.BaseDeployment.activate(BaseDeployment.java:212)
       at weblogic.application.internal.EarDeployment.activate(EarDeployment.java:59)
       at weblogic.application.internal.DeploymentStateChecker.activate(DeploymentStateChecker.java:161)
       at weblogic.deploy.internal.targetserver.AppContainerInvoker.activate(AppContainerInvoker.java:80)
       at weblogic.deploy.internal.targetserver.BasicDeployment.activate(BasicDeployment.java:187)
       at weblogic.deploy.internal.targetserver.BasicDeployment.activateFromServerLifecycle(BasicDeployment.java:379)
       at weblogic.management.deploy.internal.DeploymentAdapter$1.doActivate(DeploymentAdapter.java:51)
       at weblogic.management.deploy.internal.DeploymentAdapter.activate(DeploymentAdapter.java:200)
       at weblogic.management.deploy.internal.AppTransition$2.transitionApp(AppTransition.java:30)
       at weblogic.management.deploy.internal.ConfiguredDeployments.transitionApps(ConfiguredDeployments.java:240)
       at weblogic.management.deploy.internal.ConfiguredDeployments.activate(ConfiguredDeployments.java:169)
       at weblogic.management.deploy.internal.ConfiguredDeployments.deploy(ConfiguredDeployments.java:123)
       at weblogic.management.deploy.internal.DeploymentServerService.resume(DeploymentServerService.java:180)
       at weblogic.management.deploy.internal.DeploymentServerService.start(DeploymentServerService.java:96)
       at weblogic.t3.srvr.SubsystemRequest.run(SubsystemRequest.java:64)
       at weblogic.work.ExecuteThread.execute(ExecuteThread.java:263)
       at weblogic.work.ExecuteThread.run(ExecuteThread.java:221)
Caused By: java.lang.RuntimeException: Failed to start Service "Cluster" (ServiceState=SERVICE_STOPPED, STATE_JOINING)
       at com.tangosol.coherence.component.util.daemon.queueProcessor.Service.start(Service.CDB:38)
       at com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.start(Grid.CDB:6)
       at com.tangosol.coherence.component.net.Cluster.onStart(Cluster.CDB:56)
       at com.tangosol.coherence.component.net.Cluster.start(Cluster.CDB:11)
       at com.tangosol.coherence.component.util.SafeCluster.startCluster(SafeCluster.CDB:3)
       at com.tangosol.coherence.component.util.SafeCluster.restartCluster(SafeCluster.CDB:10)
       at com.tangosol.coherence.component.util.SafeCluster.ensureRunningCluster(SafeCluster.CDB:31)
       at com.tangosol.coherence.component.util.SafeCluster.start(SafeCluster.CDB:2)
       at com.tangosol.net.CacheFactory.ensureCluster(CacheFactory.java:427)
       at com.tangosol.net.DefaultConfigurableCacheFactory.ensureServiceInternal(DefaultConfigurableCacheFactory.java:978)
       at com.tangosol.net.DefaultConfigurableCacheFactory.ensureService(DefaultConfigurableCacheFactory.java:947)
       at com.tangosol.net.DefaultConfigurableCacheFactory.ensureCache(DefaultConfigurableCacheFactory.java:929)
       at com.tangosol.net.DefaultConfigurableCacheFactory.configureCache(DefaultConfigurableCacheFactory.java:1306)

This happens because of incorrect Coherence cluster configuration on OAM. OAM stores the configuration in oam-config.xml for Coherence but accepts all parameters as server startup. The parameters begin with oam.coherence.*. In this example, the VM running OAM had more than one interface attached (multiple IPs). The default Coherence default port for OAM is 9095 and OAM and Policy Manager both have their own clusters, hence the clash.

For this error, I changed the coherence port on Policy Manager managed servers to be 15001 using -Doam.coherence.localport=15001, restart and that solved the problem.

Other variant of this error seems to indicate issues with cluster authentication which can be solved by specifying all the IPs on the VMs as the authentication nodes, e.g.

-Doam.coherence.auth.1=IP1
-Doam.coherence.auth.2=IP2
-Doam.coherence.auth.x=IP3
-Doam.coherence.auth.x=IP4

You can also add a range of auth addresses using the syntax:
-Doam.coherence.auth.range.to.1=10.10.13.xx -Doam.coherence.auth.range.from.1=10.10.113.xx
-Doam.coherence.auth.range.to.2=10.12.13.xx -Doam.coherence.auth.range.from.2=10.12.113.xx