Using Prometheus and Grafana to monitor a Nutanix Cluster.

Published: May 17, 2024 (Updated: February 5, 2025) in Telemetry, Nutanix, nutanix, prometheus, telemetry, grafana, python by gary.

Using a small python script we can liberate data from the “Analysis” page of prism element and send it to prometheus, where we can combine cluster metrics with other data and view them all on some nice Grafana dashboards.

Output

Prism Analysis Page Vs Grafana Dashboard

method

The method we will use here is to create a python script which pulls stats from Prism Element via API – and exposes them in a format that Prometheus can consume. The available metrics expose many interesting details – but are updated only every 20-30 seconds. This is enough to do trending, and fits nicely with the typical Prometheus scrape interval.

The metrics are aggregated under three buckets. Per VM, Per Storage Container, Per Host and Cluster-wide.

Below is an example of creating the Per VM CPU panel – we divide the PPM metric by 10,000 to get a % which is what we see on the analysis page in Prism.

Useful metrics

Within these groupings/aggregations I have found the following metrics to be most useful to monitor resource usage on my test cluster. For CPU usage, the API seems to return what you would expect. e.g. for a VM – we get the % of provisioned vCPU used – and for the host we get the % of Physical CPU used.

Metric name	metric description
controller_num_iops	Overall IO rate per VM, container, host or cluster per second
controller_io_bandwidth_kBps	Overall throughput per VM, container, host or cluster in Kilobytes per second
controller_avg_io_latency_usecs	Average IO response time per VM, container, host or cluster in microseconds
hypervisor_cpu_usage_ppm	CPU usage expressed as parts per million (divide by 10,000) to get %

Some useful metrics

Python Script

Run the below python code and supply the IP of a CVM (or the cluster VIP) a username and password. e.g I save the code below as entity_stats.py. Then point a prometheus scraper at port 8000 wherever this code is running. The heavy lifting is done by the promethus_client python module.

$ python ./entity_stats.py --vip <CVM_IP_ADDRESS> --username admin --password <password>

import requests
from requests.auth import HTTPBasicAuth
import json
import pprint
import prometheus_client
from prometheus_client import CollectorRegistry, Gauge, push_to_gateway,Info
from prometheus_client import start_http_server, Summary
import time
import random
import argparse
import os

global filter_spurious_response_times,spurious_iops_threshold

#Attempt to filter spurious respnse time values for very low IO rates
filter_spurious_response_times=True
spurious_iops_threshold=50
            
# Entity Centric groups the stats by entities (e.g. vms, containers, hosts) - counters are labels
def main():
    global username,password
    parser=argparse.ArgumentParser()
    parser.add_argument("-v","--vip",action="store",help="The Virtual IP for the cluster",default=os.environ.get("VIP"))
    parser.add_argument("-u","--username",action="store",help="The prism username",default=os.environ.get("PRISMUSER"))
    parser.add_argument("-p","--password",action="store",help="The prism password",default=os.environ.get("PRISMPASS"))
    args=parser.parse_args()
    vip=args.vip
    username=args.username
    password=args.password
    if not (vip and username and password):
        print("Need a vip, username and password")
        exit(1)
    
    check_prism_accessible(vip)
    #Instantiate the prometheus guages to store metrics
    setup_prometheus_endpoint_entity_centric()
    #Start prometheus end-point on port 8000 after the Gauges are instantiated        
    start_http_server(8000)

    #Loop forever getting the metrics for all available entities (They may come and go)
    #then expose the metrics for those entities on prometheus exporter ready for scraping
    while(True):
        for family in ["containers","vms","hosts","clusters"]:
            entities=get_entity_names(vip,family)
            push_entity_centric_to_prometheus(family,entities)
        #The counters are meant for trending and are quite coarse
        #10s of  seconds is a reasonable scrape interval    
        time.sleep(10)

def setup_prometheus_endpoint_entity_centric():
    #
    # Setup gauges for VMs Hosts and Containers
    #
    global gVM,gHOST,gCTR,gCLUSTER
    prometheus_client.instance_ip_grouping_key()
    gVM = Gauge('vms', 'Stats grouped by VM',labelnames=['vm_name','metric_name'])
    gHOST = Gauge('hosts', 'Stats grouped by Pysical Host',labelnames=['host_name','metric_name'])
    gCTR = Gauge('containers', 'Stats grouped by Storage Container',labelnames=['container_name','metric_name'])
    gCLUSTER = Gauge('cluster','Stats grouped by cluster',labelnames=['cluster_name','metric_name'])


def push_entity_centric_to_prometheus(family,entities):

    if family == "vms":
        gGAUGE=gVM
    if family == "containers":
        gGAUGE=gCTR
    if family == "hosts":
        gGAUGE=gHOST
    if family == "clusters":
        gGAUGE=gCLUSTER
    
     #Get data from the dictionary passed in and set the gauges
    for entity in entities:
            #Each family may use a different identifier for the entity name.
            if family == "containers":
                entity_name=entity["name"]
            if family == "vms":
                entity_name=entity["vmName"]
            if family == "hosts":
                entity_name=entity["name"]
            if family == "clusters":
                entity_name=entity["name"]
            # regardless of the family, the stats are always stored in a  
            # structure called stats.  Within the stats structure the data 
            # is layed out as Key:Value.  We just walk through make a prometheus
            # guage for whatever we find
            for metric_name in entity["stats"]:
                stat_value=entity["stats"][metric_name]
                if any(prefix in metric_name for prefix in ["controller","hypervisor","guest"]):
                    print(entity_name,metric_name,stat_value)
                    gid=gGAUGE.labels(entity_name,metric_name)
                    gid.set(stat_value)
            #Overwrite value with -1 if IO rate is below spurious IO rate threshold.  This is 
            #to avoid misleading response times for entities that are doing very little IO
            if filter_spurious_response_times:
                print("Supressing spurious values - entity centric - family",entity_name,family)
                read_rate_iops=entity["stats"]["controller_num_read_iops"]
                write_rate_iops=entity["stats"]["controller_num_write_iops"]
                rw_rate_iops=entity["stats"]["controller_num_iops"]

                if (int(read_rate_iops)<spurious_iops_threshold):
                    print("read iops too low, supressing write response times")
                    gGAUGE.labels(entity_name,"controller_avg_read_io_latency_usecs").set("-1")
                if (int(write_rate_iops)<spurious_iops_threshold):
                    print("write iops too low, supressing write response times")
                    gGAUGE.labels(entity_name,"controller_avg_write_io_latency_usecs").set("-1")
                if (int(rw_rate_iops)<spurious_iops_threshold):
                    print("RW iops too low, supressing write response times")
                    gGAUGE.labels(entity_name,"controller_avg_io_latency_usecs").set("-1")

def get_entity_names(vip,family):
    requests.packages.urllib3.disable_warnings()
    v1_stat_VM_URL="https://"+vip+":9440/PrismGateway/services/rest/v1/"+family+"/"
    response=requests.get(v1_stat_VM_URL, auth=HTTPBasicAuth(username,password),verify=False)
    response.raise_for_status()
    result=response.json()
    entities=result["entities"]
    return entities

def check_prism_accessible(vip):
    #Check name resolution
    url="http://"+vip

    status = None
    message = ''
    try:
        resp = requests.head('http://' + vip)   
        status = str(resp.status_code)
    except:
        if ("[Errno 11001] getaddrinfo failed" in str(vip) or     # Windows
            "[Errno -2] Name or service not known" in str(vip) or # Linux
            "[Errno 8] nodename nor servname " in str(vip)):      # OS X
            message = 'DNSLookupError'
        else:
            raise
    return url, status, message

if __name__ == '__main__':
    main()

Comments

Jason says:

May 26, 2024 at 11:48 pm

This looks really handy! Is there any extra system load when scraping every 10 seconds like you have here? How far away do you think we are from having a native metrics endpoint in the cvm?

Reply