Wednesday, 1 March 2017

Checking Infinispan cluster health and Kubernetes/OpenShift

Modern applications and microservices often need to expose their health status. A common example is Spring Actuator but there are also many different ways of doing that. 

Starting from Infinispan 9.0.0.Beta2 we introduced the HealthCheck API. It is accessible in both Embedded and Client/Server mode. 

Cluster Health and Embedded Mode


The HealthCheck API might be obtained directly from EmbeddedCacheManager and it looks like this:

embeddedCacheManager
.getHealth()
.getClusterHealth()
.getNumberOfNodes() // Those two methods allow to control if
.getNodeNames() // proper number of nodes joined the cluster
.getClusterName() // Might be helpful for managing multiple clusters
.getHealthStatus() // UNHEALTHY, HEALTHY, REBALANCING
.getHostInfo()
.getNumberOfCpus() // Those 3 methods might be
.getTotalMemoryKb() // useful for dynamic Cloud
.getFreeMemoryInKb() // environments
getCacheHealth()
.get(0).getStatus() // UNHEALTHY, HEALTHY, REBALANCING
.get(0).getCacheName() // Cache name

The nice thing about the API is that it is exposed in JMX by default:


More information about using HealthCheck API in Embedded Mode might be found here:

Cluster Health and Server Mode


Since Infinispan is based on Wildfly, we decided to use CLI as well as built-in Management REST interface.

Here's an example of checking the status of a running server:

ispn-cli.sh -c "/subsystem=datagrid-infinispan/cache-container=clustered/health=HEALTH:read-resource(include-runtime=true)"
{
"outcome" => "success",
"result" => {
"cache-health" => "HEALTHY",
"cluster-health" => ["test"],
"cluster-name" => "clustered",
"free-memory" => 99958L,
"log-tail" => [
"2016-08-10 11:54:14,706 INFO [org.infinispan.server.endpoint] (MSC service thread 1-5) DGENDPT10001: HotRodServer listening on 127.0.0.1:11222",
"2016-08-10 11:54:14,706 INFO [org.infinispan.server.endpoint] (MSC service thread 1-1) DGENDPT10001: MemcachedServer listening on 127.0.0.1:11211",
"2016-08-10 11:54:14,785 INFO [org.jboss.as.clustering.infinispan] (MSC service thread 1-6) DGISPN0001: Started ___protobuf_metadata cache from clustered container",
"2016-08-10 11:54:14,800 INFO [org.jboss.as.clustering.infinispan] (MSC service thread 1-6) DGISPN0001: Started ___script_cache cache from clustered container",
"2016-08-10 11:54:15,159 INFO [org.jboss.as.clustering.infinispan] (MSC service thread 1-5) DGISPN0001: Started ___hotRodTopologyCache cache from clustered container",
"2016-08-10 11:54:15,210 INFO [org.infinispan.rest.NettyRestServer] (MSC service thread 1-6) ISPN012003: REST server starting, listening on 127.0.0.1:8080",
"2016-08-10 11:54:15,210 INFO [org.infinispan.server.endpoint] (MSC service thread 1-6) DGENDPT10002: REST mapped to /rest",
"2016-08-10 11:54:15,306 INFO [org.jboss.as] (Controller Boot Thread) WFLYSRV0060: Http management interface listening on http://127.0.0.1:9990/management",
"2016-08-10 11:54:15,307 INFO [org.jboss.as] (Controller Boot Thread) WFLYSRV0051: Admin console listening on http://127.0.0.1:9990",
"2016-08-10 11:54:15,307 INFO [org.jboss.as] (Controller Boot Thread) WFLYSRV0025: Infinispan Server 9.0.0-SNAPSHOT (WildFly Core 2.2.0.CR9) started in 8681ms - Started 196 of 237 services (121 services are lazy, passive or on-demand)"
],
"number-of-cpus" => 8,
"number-of-nodes" => 1,
"total-memory" => 235520L
}
}

Querying the HealthCheck API using the Management REST is also very simple:

curl --digest -L -D - "http://localhost:9990/management/subsystem/datagrid-infinispan/cache-container/clustered/health/HEALTH?operation=resource&include-runtime=true&json.pretty=1" --header "Content-Type: application/json" -u ispnadmin:ispnadmin
HTTP/1.1 401 Unauthorized
Connection: keep-alive
WWW-Authenticate: Digest realm="ManagementRealm",domain="/management",nonce="AuZzFxz7uC4NMTQ3MDgyNTU1NTQ3OCfIJBHXVpPHPBdzGUy7Qts=",opaque="00000000000000000000000000000000",algorithm=MD5,qop="auth"
Content-Length: 77
Content-Type: text/html
Date: Wed, 10 Aug 2016 10:39:15 GMT
HTTP/1.1 200 OK
Connection: keep-alive
Authentication-Info: nextnonce="AuZzFxz7uC4NMTQ3MDgyNTU1NTQ3OCfIJBHXVpPHPBdzGUy7Qts=",qop="auth",rspauth="b518c3170e627bd732055c382ce5d970",cnonce="NGViOWM0NDY5OGJmNjY0MjcyOWE4NDkyZDU3YzNhYjY=",nc=00000001
Content-Type: application/json; charset=utf-8
Content-Length: 1927
Date: Wed, 10 Aug 2016 10:39:15 GMT
{
"cache-health" : "HEALTHY",
"cluster-health" : ["test", "HEALTHY"],
"cluster-name" : "clustered",
"free-memory" : 96778,
"log-tail" : [
"2016-08-10 11:54:14,706 INFO [org.infinispan.server.endpoint] (MSC service thread 1-5) DGENDPT10001: HotRodServer listening on 127.0.0.1:11222",
"2016-08-10 11:54:14,706 INFO [org.infinispan.server.endpoint] (MSC service thread 1-1) DGENDPT10001: MemcachedServer listening on 127.0.0.1:11211",
"2016-08-10 11:54:14,785 INFO [org.jboss.as.clustering.infinispan] (MSC service thread 1-6) DGISPN0001: Started ___protobuf_metadata cache from clustered container",
"2016-08-10 11:54:14,800 INFO [org.jboss.as.clustering.infinispan] (MSC service thread 1-6) DGISPN0001: Started ___script_cache cache from clustered container",
"2016-08-10 11:54:15,159 INFO [org.jboss.as.clustering.infinispan] (MSC service thread 1-5) DGISPN0001: Started ___hotRodTopologyCache cache from clustered container",
"2016-08-10 11:54:15,210 INFO [org.infinispan.rest.NettyRestServer] (MSC service thread 1-6) ISPN012003: REST server starting, listening on 127.0.0.1:8080",
"2016-08-10 11:54:15,210 INFO [org.infinispan.server.endpoint] (MSC service thread 1-6) DGENDPT10002: REST mapped to /rest",
"2016-08-10 11:54:15,306 INFO [org.jboss.as] (Controller Boot Thread) WFLYSRV0060: Http management interface listening on http://127.0.0.1:9990/management",
"2016-08-10 11:54:15,307 INFO [org.jboss.as] (Controller Boot Thread) WFLYSRV0051: Admin console listening on http://127.0.0.1:9990",
"2016-08-10 11:54:15,307 INFO [org.jboss.as] (Controller Boot Thread) WFLYSRV0025: Infinispan Server 9.0.0-SNAPSHOT (WildFly Core 2.2.0.CR9) started in 8681ms - Started 196 of 237 services (121 services are lazy, passive or on-demand)"
],
"number-of-cpus" : 8,
"number-of-nodes" : 1,
"total-memory" : 235520
}%

Note that for the REST endpoint, you have to use proper credentials. 

More information about the HealthCheckA API in Server Mode might be found here:

Cluster Health and Kubernetes/OpenShift


Monitoring cluster health is crucial for Clouds Platforms such as Kubernetes and OpenShift. Those Clouds use a concept of immutable Pods. This means that every time you need change anything in your application (changing configuration for the instance), you need to replace the old instances with new ones. There are several ways of doing that but we highly recommend using Rolling Updates. We also recommend to tune the configuration and instruct Kubernetes/OpenShift to replace Pods one by one (I will show you an example in a moment). 

Our goal is to configure Kubernetes/OpenShift in such a way, that each time a new Pod is joining or leaving the cluster a State Transfer is triggered. When data is being transferred between the nodes, the Readiness Probe needs to report failures and prevent Kubernetes/OpenShift from doing progress in Rolling Update procedure. Once the cluster is back in stable state, Kubernetes/OpenShift can replace another node. This process loops until all nodes are replaced. 

Luckily, we introduced two scripts in our Docker image, which can be used out of the box for Liveness and Readiness Probes:
At this point we are ready to put all the things together and assemble DeploymentConfig:

- apiVersion: v1
kind: DeploymentConfig
metadata:
name: transactions-repository-new
spec:
replicas: 3
strategy:
type: Rolling
rollingParams:
updatePeriodSeconds: 10
intervalSeconds: 20
timeoutSeconds: 600
maxUnavailable: 1
maxSurge: 1
template:
spec:
containers:
- args:
- -Djboss.default.jgroups.stack=kubernetes
image: jboss/infinispan-server:latest
imagePullPolicy: Always
name: infinispan-server
ports:
- containerPort: 8181
protocol: TCP
- containerPort: 8888
protocol: TCP
- containerPort: 9990
protocol: TCP
- containerPort: 11211
protocol: TCP
- containerPort: 11222
protocol: TCP
- containerPort: 57600
protocol: TCP
- containerPort: 7600
protocol: TCP
- containerPort: 8080
protocol: TCP
env:
- name: OPENSHIFT_KUBE_PING_NAMESPACE
valueFrom: {fieldRef: {apiVersion: v1, fieldPath: metadata.namespace}}
terminationMessagePath: /dev/termination-log
terminationGracePeriodSeconds: 90
livenessProbe:
exec:
command:
- /usr/local/bin/is_running.sh
initialDelaySeconds: 10
timeoutSeconds: 80
periodSeconds: 60
successThreshold: 1
failureThreshold: 5
readinessProbe:
exec:
command:
- /usr/local/bin/is_healthy.sh
initialDelaySeconds: 10
timeoutSeconds: 40
periodSeconds: 30
successThreshold: 2
failureThreshold: 5

Interesting parts of the configuration:
  • lines 13 and 14: We allocate additional capacity for the Rolling Update and allow one Pod to be down. This ensures Kubernetes/OpenShift replaces nodes one by one.
  • line 44: Sometimes shutting a Pod down takes a little while. It is always better to wait until it terminates gracefully than taking the risk of losing data.
  • lines 45 - 53: The Liveness Probe definition. Note that when a node is transferring the data it might highly occupied. It is wise to set higher value of 'failureThreshold'.
  • lines 54 - 62: The same rule as the above. The bigger the cluster is, the higher the value of 'successThreshold' as well as 'failureThreshold'.
Feel free to checkout other articles about deploying Infinispan on Kubernetes/OpenShift:

No comments:

Post a Comment