Ganglia HowTo

In Brief:
  • Many large AIX High performance Computing (HPC) clusters use this excellent tool to monitor performance across large clusters of machines.
  • The data is displayed graphical on a Website, includes configuration and performance statistics. This is also increasingly being used in commercial data centers too to monitor large groups of machines.
  • Ganglia can also be used to monitor a group of logical partitions (LPARs) on a single machine - these just look like a cluster to Ganglia.
  • Ganglia is not limited to just the AIX, which makes it even more useful in heterogeneous computer rooms.
  • For more information go to the Ganglia home website at http://ganglia.sourceforge.net/
  • For the Ganglia for AIX and Linux on POWER binaries goto http://www.perzl.org/ganglia/
  • Briefly, a daemon runs on each node, machine or LPAR and the data collected by a further daemon and placed in an rrdtool database. Ganglia then uses PHP scripts on a web server to generate the graphs as directed by the user. There is also an on-going project to add POWER5 micro-partitions statistics.

Contents

  1. Introduction to Ganglia
    1. Performance Monitoring in General
    2. Uses for Ganglia in Performance Monitoring
    3. Have you seen Ganglia yet?
    4. The components of Ganglia
  2. Ganglia Setup
    1. Before you start
    2. Setting up the simplest possible Ganglia with the two following nodes
    3. Larger Setup with groups of machines
  3. POWER5 additions
  4. Advanced topics
When the contents is releated to the IBM POWER5 based machines running Advanced POWER Virtualisation and Logical Partitions you should find this little logo.
Otherwise the content should apply to any machine running ganglia.

Performance Monitoring in General

Every systems administrator knows that they should be monitoring the performance of their machines to:

  • Check for general machine and OS health,
  • Spotting longer term trends,
  • Avoiding the "hitting the performance wall" issues,
  • Identifying bottlenecks etc.

The problems are also many:

  • If you manage one or two machines this is easy but if you are in charge of a few hundred then you just don't have the time.
  • Many monitoring tools provide "gallons" of statistics, far too many to deal with regularly, so what you need are a lower number of stats and just the important ones.
  • You also rapidly get a data overload problem - with hundreds of servers capturing data files you need to manage all the files and data so that you can sort it and find it later.
  • Next you find the different operating systems have different tools, stat and data and displayed in different ways.

From the data centres that visit I find various approaches to this problem:

  • They have purchased a large cross platform performance monitoring suite. The downside is that tools are usually expensive, have a performance hit and you are at the mercy of the vendor to update the tool.
  • They do nothing.
  • They capture raw data to the disk and only look at the data in post mortem style i.e. when it all goes horribly wrong.

Now as the developer of the nmon tool for detailed monitoring on AIX and Linux systems I have long been a fan of the low level high volume performance data but I have come to understand that this is impossible for large numbers of machine and does not let you take the overall view of the computer room. To be honest, I was shocked when I first came across Ganglia for the first time. This was clearly the tools I had been looking for a long time and was even thinking of writing myself. When I first got it running I was shocked again - the flexibility and ability to add new stats was amazing.

I have since shown my working version of Ganglia to many system administrators and the reaction is always exactly the same:

  • "Wow! That is really cool ... I think that is exactly what I want ... where do I get it?"

There is one problem. Getting started with Ganglia is quite hard work. The problem is that the designers and developers brains are too large and us regular guys struggle to understand the basic setup. The developers tend to be High Performance comuting (HPC) people running 50 to 2000+ nodes at universities. The Ganglia documents refer to distributed, scalable, multiple-resolution, network broadcast, XML protocol models - all very well but how can I get it working quickly.

The rest of this article is for regular guys who just what to get it working and get the benefits - the theory can come later.

Uses for Ganglia in Performance Monitoring

In Ganglia we have a number of terms:

  • Node - a machine - typically racked up 1,2 or 4 CPU small machines all essentially helping to do one job or task or calculation
  • Cluster - a group of nodes
  • GRID - a group of clusters

I see multiple uses of Ganglia:

  • Large scale clusters (i.e. what it was designed for) - so this the focus area of the developers and it works very well. In this case, each parallel computing infrastructure is seen as a ganglia cluster and if you have more than one you can get Ganglia to view these as a GRID of clusters, so you can seamlessly on the Ganglia website see the performance stats of the various Grids.
  • Just an bunch of machines - if you have a group of machines in your data centre with different purposes, sizes, speeds and applications then you can call all of these machines your cluster. Fortunately, Ganglia also displays the important machines information like hostname, number of CPUs and memory. Using this you can see both the machine details and the performance. So you can find the busy machines and then see what is going on. Hopefully, your machine names will make sense!
  • Data Centre(s) sprawl - if you have even more machines you may view them in different terms. You may split them in to groups in terms of production, test and development. You may split them up in terms of geography like London, Glasgow, Swansea, Birmingham and Paris. Or you might have a functional split like external customers, admin, sales, human resources etc. What ever grouping you decide can be used with Ganglia. In Ganglia terms each of these groups is a cluster and the groups in total are your GRID.
  • Logical Partitions (LPAR) and Virtualization engines - this is a growing area in the computer industry and performance monitoring in this area is often forgotten. These operating system images are sharing a physical computer but you want to track which of them is taking what resources from the pool. In Ganglia terms each of these operating system images is a node and the machine as a whole is your Ganglia cluster.
  HPC Bunch of machines Data center Virtualization
Node each machine each machine each machine LPAR
Cluster all the nodes used
for a single task
all the nodes in
the computer room
what ever grouping
you decide
the LPARs of a
single machine
GRID groups of clusters other machine rooms
or not used
all the cluster
groups as a whole
multiple machines

You will have to decide what makes sense for you. Below we will show, at "block box" level, what makes up Ganglia, then set up a tiny two machine cluster (bunch of machines style) that you can follow for practice and then a Virtualization example, which is easy once you have the basic understanding.

Have you seen Ganglia yet?

If not now would be a good time to have a look. Fortunately some of the largest clusters in the world use Ganglia and have made the user interface public so you can see them from the Internet. Here is good link:

  • University of California Berkeley Grid at http://monitor.millennium.berkeley.edu/
    • This is at the Grid level you can select a cluster to see all the nodes and the overall cluster stats for CPU and memory
    • This simply lets you drill down to the cluster you are interested in
    • An example of a cluster is the Nano Cluster within the Grid scroll down and find the Nano cluster and click on the name or the graph for Nano
  • You should end up at the Nano Cluster URL http://monitor.millennium.berkeley.edu/?c=Nano
    • At this level you can select the statistics for the whole cluster that you want to see for all nodes (at the bottom).
    • At the top you have:
      • the node and CPU counts (these nodes hav 2 CPUs each)
      • the summary graphs for the cluster
      • the pie chart showing how much of the cluster is busy
      • a set of small graphs, one per node
    • Click on the "Physical View" top right and you will find the nodes are
      • The Total CPUs, memory and Disk space
      • the fullest Disk
      • each node has 2 GHz and have 2 GB of memory each
      • This is a good view to spot odd nodes or configurations
    • Go back to the "Full View" by clicking on it or on the Nano Cluster name
    • You can look at lots of different stats and configuration details
    • For example click on the Metric (default is load_one = the CPU load in the last 1 minute) and select "cpu_user".
    • Now the graphs show you the nodes User Utilisation numbers.
    • Select "machine_type" and you see they are powerpc machines - actually the IBM JS20 Blades.
    • Select "os_name" and then "os_release" and you see they are all Linux using the 2.6 kernel.
    • now select the Last field and you will see you can view the graphs over the last hour, day, week, month and year
    • If you then select a node of the Nano cluster from the list of just click on the n1 graph you see the machine (node) details for node 1
  • You should end up at the n1 node URL at http://monitor.millennium.berkeley.edu/?c=Nano&h=n1
    • Now you see all the graphs for this one node.

Have a browse around. Once you understand the levels (Grid, cluster, node) it is relatively easy to work it all out.

Here is some screen dumps from Grid which is IBM pSeries machines using Ganglia to monitor Virtualization Engines (LPARs in IBM speak) on a few machines:

Below is the Grid View and the graphs are the summary of each cluster:

  • Cluster demo_p505 is the machine with virtualization
  • The "other" cluster is just a collection of older machines that I also want to monitor

Below is the Cluster View of cluster demo_p505 with the node graphs at the bottom

Below is the Physical View of the demo_p505 cluster showing the CPU and memory of each node

Below shows some of the different stats that can be shown and how you select the different time periods:

Below is a quick check of the Operating System and version configuration:

!

Note: in the above you can see the Virtual I/O Server, two copies of AIX and many copies of Linux running (Red Hat EL4, SUSE SLES 9 and Fedora4)

Below are the stats for network packets out, disk write and memory free (in that order);



Below is a look at the physical CPU use - this is a new stats added for POWER5 and shows how much of the real CPU time each LPAR is taking up

For POWER5 running AIX and Linux LPAR weight, SMT status, Entitlement, Capped status, kernel64bit status and others have also been added.
 

Below are the details for one individual node




Note: the above also shows some POWER5 only options but the bulk is standard Ganglia

The components of Ganglia

The components of Ganglia are as follows:

  • The data collector (G)
    • The daemon is a single file and called gmond (Ganglia MONitor Daemon!)
    • Its configuration file /etc/gmond.conf
    • This goes on each node
  • The data consolidator (G)
    • The daemon is a single file and called gmetad (Ganglia METAdata Daemon!)
    • Its configuration file /etc/gmetad.conf
    • You need one of these for each cluster. On massive clusters you can have more than one and a hierarchy.
    • This daemon collects the gmond data set via the network and saves it in a rrdtool database.
  • The database
    • Ganglia uses the well known and respected Open Source tool called - rrdtool
  • The Web GUI tools (G)
    • These are a collection of PHP scripts started by the Webserver to extract the ganglia data and generate the graphs for the website
  • The web server with PHP
    • This could be any web server that supports PHP, SSL and XML
    • Every one uses Apache2 - you are on your own if you use anything else!
  • Addition advanced tools (G)
    • gmetric to add extra stats - in fact anything you like numbers or strings, with units etc.
    • gstat to get at the Ganglia data to do anything else you like

The parts that are marked up with (G) are part of Ganglia.
The other parts you have to get and install as pre-requisites namely Apache2, PHP and rrdtool - these may also have pre-requisites.

The below diagram shows the connections:

In this diagram you should note the following features:

  • The left hand side shows the gmond daemons process running one on each node of the cluster. This is configured by a single /etc/gmond.conf files on each node. So the installation on the nodes is very simple = just two files that are identical on each node (assuming it is the same hardware and OS). You also need to make sure the gmond process is started every time the machine is rebooted. The gmond.conf only needs three of four lines changed for each cluster like the cluster name and where to forward the stats.
  • The top right hand side shows the more complicated central machine (which normally is one of the nodes in the cluster but does not have to be). On this machine the metad daemon process collects the performance stats and saves them to rrdtool databases. Again this is controlled but the single configuration file - in this case /etc/gmetad.conf. This only needs a couple of lines changed for each cluster too. and if it is to report a grid it needs to have one line of configuration to be able to find the other gmetad daemons and get the stats they hold.
  • The lower right hand side shows the website details. The user browses to the website and invokes the PHP scripts that fetches the data from the co-located rrdtool databases and generates the graphs you have seen above dynamically.
  • The setup of the right versions of Apache2 and with the right built-in features is the hardest part and depends on your operating system.

A word on suitable Web Servers:

  • On Linux this is really easy as most recent version as the Apache2 and PHP4 or PHP5 usually comes with your Linux distribution i.e. the standard CDROM or should be on your network install server.
  • On the other UNIX platforms, you have to either check that the Apache2 and PHP are available or compile them from the Open Source code directly. Fortunately, this is relatively simple - even for AIX!
  • For AIX, see the "AIX and Open Source" Wiki page at this Direct Link to AIX and Open Source Wiki page for details on how to compile your own version of Apache2 and PHP with the required features for Ganglia. At the time of writing, I could find no downloadable version that would work with Ganglia for AIX.

Ganglia Setup

Before you start

It can be tricky if you change some things after you have Ganglia running. So before you start:

  • Make sure you are not going to change your hostnames. This is a given in production ,so think about this mainly for proto-type and test systems.
  • Make sure you are not going to change IP addresses.
  • Make sure the timezone, time and date is consistent on all machines in a cluster and the use of NTP is recommended.

Setting up the simplest possible Ganglia with the two following nodes

  • One Ganglia Client node with just the gmond data collector
  • One ganglia Server node with gmond, gmetad, rrdtool, Apache2 and PHP5

We will tackle this in three steps:

  • Install and setup of gmond on the Client node
  • Install and setup of gmetad on the Server node
  • Install and setup of Ganglia web "front end" PHP scripts on the Server node

Before you start I hope you have determined the type of configuration you want in terms of Grid and Cluster names. In this simple case we are going to ignore the Grid level so you need a Cluster name. This can, in fact be the hardest part
For this worked example we are going to call the Cluster "serenity".

Note: there is documentation for Ganglia that can be found at http://ganglia.sourceforge.net/docs/ganglia.html

Install and setup of gmond on the Client node - simplest example

You now need the gmond package for you operating system and platform.
At the time of writing ganglia 3.0.3 is the latest available but you might be on a later release.
You are looking for something with a name like:

  • ganglia-gmond-3.0.3-<OS-and-platform>.rpm
    For example:
    Operating System RPM Filename
    AIX 5.3 ganglia-gmond-3.0.3-1.aix5.3.ppc.rpm
    SUSE on POWER machines ganglia-gmond-3.0.3-1.suse.ppc64.rpm
    Red Hat on x86 ganglia-gmetad-3.0.3-1.rhel4.x86_64.rpm

The Ganglia Download website is at http://ganglia.sourceforge.net/downloads.php or the AIX or Linux on POWER binaries at http://www.perzl.org/ganglia/

If you can't find the right download then you are going to have to recompile the Open Source code yourself but first check the end of this wiki page.

To install the gmond daemon/command use:

rpm -Uvh filename.rpm

This will
  • Install the gmond binary - usually in /usr/sbin or /opt/freeware/sbin
  • Create a /etc/gmond.conf default config file
  • Set your system to automatically restart gmond on reboot (on most systems) and
  • Start the gmond process.

But you need to edit the /etc/gmond.conf file, kill gmond and restart gmond.
The gmond command can be used to create the default gmond.conf file like this: gmond -t >/etc/gmond.conf
The file needs to be changed as follows. Change:

cluster {
  name = "unspecified"
  owner = "unspecified"
  latlong = "unspecified"
  url = "unspecified"
}

to:
cluster {
  name = "serenity"
  owner = "unspecified"
  latlong = "unspecified"
  url = "unspecified"
}

Note for POWER5 additions - with Linux on POWER the gmond process needs access to /proc/ppc64/lparcfg but this is only allowed by the root super user. You can chmod this or chown this pseudo file on reboot or change the /etc/gmond.conf file as follows, change in the "globals" sections at the top:
setuid = yes

to
setuid = no

This will be using the defaults for the rest of the setup including the broadcast address which we will change in the more complex example.
You now need to kill and restart gmond for this new configuration file.
On most system you will find the automatic control script either:
  • /etc/init.d/gmond
  • /etc/rc.d/init.d/gmond

You can then restart gmond. for example: /etc/init.d/gmond restart
The options to this script are: start|stop|restart|status

That is all - sounds complex and takes time to explain but much simpler to do. In practice:

  • FTP the rpm file to the machine or better yet have some shared disk over NFS
  • Run the rpm command
  • Edit the gmond.conf file
  • Restart gmond

That comes to about 10 seconds per node and you could automate it as it is only a couple of files and they are identical on each node.

To check it is still running use either:

  • /etc/init.d/gmond restart
  • ps -ef | grep gmond

Help about gmond options can be found using: gmond --help

Problem determination

If you find gmond fails to start or is not running after starting it and checking with ps (as above) then it is simple to start gmond in debug mode for more information. You need to run the actual gmond command (not the start up script in /etc/ ... ) and start it in debug mode and in the foreground - i.e. the output is to the screen and a control-C will halt gmond.

# gmond --debug=9
udp_recv_channel mcast_join=239.2.11.71 mcast_if=NULL port=8649 bind=239.2.11.71
tcp_accept_channel bind=NULL port=8649
udp_send_channel mcast_join=239.2.11.71 mcast_if=NULL host=NULL port=8649

        metric 'cpu_user' being collected now
        metric 'cpu_user' has value_threshold 1.000000
        metric 'cpu_system' being collected now
        metric 'cpu_system' has value_threshold 1.000000
        metric 'cpu_idle' being collected now
...
...
<Control-C>

In the case above it started normally. If there is a problem starting gmond it should be detail in the output and it will stop. If you watch for a minute or so the gmond debug output, you will notice the gmond processes on all the nodes "chatting" to each other. This allows the auto-discovery of new nodes to a cluster to work but does suggest that on large clusters the default time between sending the performance information could be tuned to be less often.
One problem that has happened to me and difficult to determine the cause is error messages about "failing to create a multicast server". This is caused by not having a network gateway (default route) setup on your system. Check it you have a route with:
  • on Linux the "route" command
  • on AIX the "netstat -C" command
    and look in the output for a "default destination" line with Flags set to "UG". In a production environment this is unlikely but in a quick test setup for Ganglia it is easy to forget to set gateway.
    Don't use the "route -f" on AIX - you might think it would list full output but it actually flushes (drops) ALL default routes i.e. gateways - the exact opposite of what you want and may cause lots of network problems until you add back the default route i.e. gateway (been there, done that).

Install and setup of gmetad on the Server node - simplest example

If the Server side is on a nod of the cluster then you should, of course install the gmond data collector on this node too. It is done exactly the same way as described above and in this simple example it is assumed that you will.

Now we need to set up the data management side followed by the Web Server. Just like the installation of the gmond daemon we need to locate the RPM file for the gmetad daemon. And you can probably guess what that file is going to be called too. Make sure it is the same version number as the gmond you are using. To install the gmetad daemon/command use: rpm -Uvh filename.rpm
This will

  • Install the gmetad binary - usually in /usr/sbin or /opt/freeware/sbin
  • Create a /etc/gmetad.conf default config file
  • Set your system to automatically restart gmond on reboot (on most systems)
  • Start the gmetad process.

You may find that this rpm command will fail due to pre-requisites. This depends on if you have previously install other libraries and tools.
For my system I needed: rrdtool and libart_lgpl

There may be other pre-reqs for rrdtool but these are already installed due to Apache and PHP pre-reqs including:

    • libpng-1.2.1-6.aix5.1.ppc.rpm
    • freetype2-2.1.7-2.aix5.1.ppc.rpm
    • zlib-1.2.2-4.aix5.1.ppc.rpm
    • perl 5.8.2

But you need to edit the /etc/gmetad.conf file, kill gmetad and restart gmetad.
The file needs to be changed as follows. Find the section with comments about the data_source syntax and add the line:

data_source "serenity" localhost

The name "serenity" identifies the cluster whose data is to be saved on this machine and "localhost" means that this machine will hold a copy rather than getting the information that is stored on another gmetad database on a different machine. this will be covered more in the more complex example.
This will be using the defaults for the rest of the setup including the broadcast address which we will change in the more complex example, later on.
Now restart gmetad using the command: /etc/rc.d/init.d/gmetad restart (or /etc/rc.d/gmetad restart depending your system).

If you want to confirm gmetad
To check it is still running useeither:

  • /etc/init.d/gmetad restart
  • ps -ef | grep gmetad

Also take a look at where the daemon is saving the data in rrdtool databases. The directory is actually in the /etc/gmetad.conf file but the default is in /var/lib/ganglia/rrds. there sould be a series of directories and files in here for each cluster, node and statistic with some summaries too.

Install and setup of Ganglia web "front end" PHP scripts on the Server node - simplest example

Warning: This section assumes you have a Web Server with PHP, SSL and XML support built-in

You now need the front-end PHP scripts package which is independant operating system and platform.
At the time of writing ganglia 3.0.3 is the latest available but you might be on a later release.
You are looking for the file: ganglia-web-3.0.3-1.noarch.rpm

It can be found at the Ganglia Download website is at http://ganglia.sourceforge.net/downloads.php or the AIX or Linux on POWER binaries at http://www.perzl.org/ganglia/

Install the RPM with: rpm -Uvh ganglia-web-3.0.3-1.noarch.rpm
Now the bad news - this is installed at /var/www/html/ganglia
You must move this directory to the directories servers by your web server.
This directory could be anywhere but popular examples are:

  • /usr/local/apache2/htdocs
  • /srv/www/htdocs
  • /webpages
    For apache this directory is in the httpd.conf configuration file and in the line (for example):
    DocumentRoot "/usr/local/apache2/htdocs"


    The UNIX owner of the files in this directories files in the lines:
    User apache
    Group apache

You can rename the "ganglia" directory but we will retain this for this example and will assume your top level web server directory is /usr/local/apache2/htdocs
Copy the files and set the right owner with:

cp -R /var/www/html/ganglia /usr/local/apache2/htdocs
chown -R apache:apache /usr/local/apache2/htdocs/ganglia

Now point your browser at the ganglia scripts with the following URL:
  • http://<your-webserver-here>/ganglia

Problem determination

Problem 1) If the above URL, does not work your web server does not naturally find index.php files (like it normally find index.html files if you don't explicitly have this at the end of the URL), so try: http://<your-webserver-here>/ganglia/index.php

Problem 2) If naming the index.php file does not work, try creating a file in the ganglia directory a file called test.php with contents:

<h1>PHP Test</h1>
<?PHP phpinfo() ?>

Make this file readable with:
chmod 755 /usr/local/apache2/htdocs/ganglia/test.php

Then try the following URL: http://<your-webserver-here>/ganglia/test.php
This should show you lots of PHP details.

Problem 3) If it does not work and you only get the words "PHP Test" or just the raw text content of the file or it refuses with an error then you probably do not have PHP support on your web server. Sorry but adding PHP support to what-ever software you run for your web server is beyond the scope of this article. For recent (last 2 years) Linux systems we can recommend Apache that comes with your distribution as these all seem to have built in PHP support - or at least the ones I have tried which is primarily SUSE SLES9 and Red Hat EL4. For the AIX platform, all we can offer is the instructions for using the latest Apache and PHP - this is best done by recompiling the source code and is not as hard as it sounds. Find the details at Direct Link to AIX and Open Source Wiki page. For other platforms, you need to ask your vendor or start searching the Internet for a suitable download. It can be very hard to determine if a web server and PHP download has all the optional components required to support Ganglia without actually trying it. If you have success perhaps you can add to the list below:

  • AIX - best to recompile Apache and PHP details at Direct Link to AIX and Open Source Wiki page
  • SUSE SLES9 - Apache and PHP from the distribution works fine on x86 and POWER hardware.
  • Red Hat EL4 - Apache and PHP from the distribution works fine on x86 and POWER hardware.

Larger Setup with groups of machines

In this section we assume you have tried Ganglia or have worked through the above simple example. In the above, we accepted lots of the default setting for the Ganglia gmond and gmetad daemons to make the setup simple. In this section, we are only going to set a few extra options to allow multiple clusters. These clusters (as described above), could be grouping machines together for a number of purposes like:

  • They are all nodes of a HPC super computer working as a whole
  • They are different machine rooms or geographical groupings
  • They are functionally grouped together like web, database, admin, app servers
  • They are the logical partitions of a virtualized machine and share hardware.

Whatever the reason, the mechanics of setting these clusters up are the same. Actually deciding your clustering groups and their names is far harder than setting them up! Also note that Ganglia is a powerful tool and there are hundreds of options and possible ways of setting it up. We are going to cover here only what is necessary to have a single Ganglia Grid with multiple clusters containing multiple nodes.

WARNING:
There are network issues here that need to be understood. Ganglia by default uses network broadcast packets from the gmond daemon which are picked up by any listening gmetad daemon. This is for maximum flexibility and minimum setup. These packets are issues only every few seconds - like between 10 to 15 seconds. As Ganglia is designed to scale to thousands of machines this is unlikely to cause network bottlenecks but you need to be aware this is happening and if you are not he network administrator then you need to discuss this with them. The default is to broadcast with UDP an IP address of 239.2.11.71 and uses port 8649. These can be changed but beyond changing the IP address, it is not covered here in further detail.

Below is a diagram showing three clusters (green, blue and yellow) being available from one web server co-hosted with the yellow cluster:

Here we have four examples of clusters

  1. Yellow Cluster - with local nodes and supporting the front end user interface
    • This cluster is supporting the web server from which users can view the Ganglia data
    • It will show data for the locally supported node (yellow nodes) and the remotely supported blue and green clusters
    • The stats for the yellow nodes
  2. Yellow Cluster - without local nodes and supporting the front end user interface
    • As above but just running the front end web server, as it is not mandatory that there are locally supported nodes
  3. Blue Cluster - this is a group of nodes with no local data repository
    • The blue nodes will shared their performance data and it is collected on one node
    • The gmetad daemon of the Yellow cluster collects this information and stores it
    • If yellow cluster was not saving the stats the information would be lost
  4. Green Cluster - this is a group of nodes with local data repository
    • The green nodes will shared their performance data and it is collected on one node
    • The gmetad daemon of this cluster collects this information and stores it
    • This means its data is independent of yellow cluster save the data (unlike Blue cluster).

This shows some of the options for Ganglia setup there are many more. In practice you would simplify things and not have one of each type. I suggest two typical setups:

  • Lots of green clusters and one yellow cluster with local nodes
  • Lots of blue clusters and one yellow cluster without local nodes

Note: make all the gmond.conf files the same in each cluster to make life simple.

So how to setup the Green Cluster?

This is the same as the simple example above but we don't need to setup the web server, PHP or the Ganglia PHP scripts. The gmond daemon on each node of the cluster broadcasts the performance and configuration information and the gmetad daemon for the cluster saves the data in the rrdtool database. This data is later sent on to the higher level gmetad with the web server. A simple way to get this to work effectively is to install the gmond and gmetad processes (just as before) and then make the following changes to the gmond.conf and gmetad.conf files. We want all the nodes to only send their data to the gmetad daemon for this cluster. There is no point in others knowing about or seeing these packets. This is controlled by the multicast parameters. In this example:

  • cluster name green
  • The gmetad is running on a machine with a host name of "green23" and also running a gmond daemon.

Change the cluster name = ""unspecified" to the cluster name green and each refences to 239.2.11.71 changes to green23 (or its IP address if you prefer). The top of the /etc/gmond.conf file will look like this:

/* This configuration is as close to 2.5.x default behavior as possible
   The values closely match ./gmond/metric.h definitions in 2.5.x */
globals {
  daemonize = yes
  setuid = yes
  user = nobody
  debug_level = 0
  max_udp_msg_len = 1472
  mute = no
  deaf = no
  host_dmax = 0 /*secs */
  cleanup_threshold = 300 /*secs */
  gexec = no
}

/* If a cluster attribute is specified, then all gmond hosts are wrapped inside
 * of a <CLUSTER> tag.  If you do not specify a cluster tag, then all <HOSTS> will
 * NOT be wrapped inside of a <CLUSTER> tag. */
cluster {
  name = "green"
  owner = "unspecified"
  latlong = "unspecified"
  url = "unspecified"
}

/* The host section describes attributes of the host, like the location */
host {
  location = "unspecified"
}

/* Feel free to specify as many udp_send_channels as you like.  Gmond
   used to only support having a single channel */
udp_send_channel {
  mcast_join = green23
  port = 8649
}

/* You can specify as many udp_recv_channels as you like as well. */
udp_recv_channel {
  port = 8649
  family = inet4
}
...

Notes:
  • Remember for POWER5 and Linux LPARS, you should have "setuid = no".
  • It is tempting to set the owner, latlong, url and location fields but these will not normally appear on the resulting Ganglia website. There may be advanced settings to get this information but the writer has not found this yet! If you know how to display these then please add it here. The latlong field has been used to draw world maps and display the sites of ganglia clusters on it. the other fields could be useful information too. For example, knowing who to notify of a problem or knowing how to find a machine that has failed in a large computer room
  • Don't forget to update all the /etc/gmond.conf files on all the nodes.
  • Don't forget to restart all the gmond daemons on all the nodes.

For the machine or LPAR running gmetad you need to add to the /etc/getad.conf a single data source line, so that is gather the data and saves it in local rrdtool files as below:

data_source "green" localhost

So how to setup the Blue Cluster?

This is very much like the Green cluster except there is no node running gmetad. Just select one node say, for example, "bigblue" and replace the 239.2.11.71 with bigblue (or its IP address if you prefer) and change the cluster name to "blue". Change all the /etc/gmond.conf files and restart all the daemons. This node bigblue will forward on the stats to the Yellow cluster when asked. The selection of which node is not important it just needs to be available.

So how to setup the Yellow Cluster?

This is very much like the Green cluster except the setup for the gmetad deamon machine is a bit special. Again change the /etc/gmond.conf file and set the name of the cluster to "yellow" and replace the 239.2.11.71 changes to the hostname of the node running gmetad (or its IP address if you prefer)Change all the /etc/gmond.conf files and restart all the daemons.

The details for the /etc/gmetad.conf is a little more complex. Assuming we already have Blue and Green clusters running. We have local nodes in the Yellow cluster and we want to also display the other clusters. At the top of the /etc/gmetad.conf file we add the following lines.

data_source "yellow" localhost
data_source "blue" bigblue
data_source "green" green23

This directs gmetad to:
  • contact the local gmond daemon to get the stats for the yellow nodes
  • contact the bigblue node to get information about the blue nodes - as this is a gmond daemon it will save the data locally in rrdtool
  • contact the green23 node to get information about the green cluster - as this has gmetad running and saving local data, yellow will collect summary stats from green23 and will ask green23 for more data if required for the front end website graphs.

In addition we want them to appear in one Grid called "Rainbow". Further down the /etc/gmetad.conf file we set the following line:

gridname "Rainbow"

Now the Ganglia website should display a grid called Rainbow and have three clusters of yellow, green and blue. If you drill down into one of the clusters you should see only the nodes of that cluster and the summaries of the cluster should reflect just the correct nodes.

POWER5 additions

The POWER5 and POWER5+ machines from IBM can run AIX and Linux on POWER in logical partitions (LPARs) that are less than one CPU or parts of the CPU. The ranges from 0.1 CPUs up to the maximum of 64 CPUs in increments of 0.01 of a CPU. These are called Micro-partitions or Shared Processor partitions. Additions have been made to Ganglia to support the statistics from these types of partition so that the LPARs form a Ganglia cluster. This makes Ganglia an ideal extra performance tools for monitoring such partitioned machines.

What are the new POWER5 stats that are available?

The following additional metrics are defined for AIX and Linux on POWER are:

Ganglia stat name Value
capped boolean 0=false, 1 = true false means Uncapped
cpu_entitlement a number
cpu_in_lpar a number Either the number of Dedicated CPUs or for Shared CPU LPARs, the Virtual Processor number
cpu_in_machine a number
cpu_in_pool a number
cpu_pool_idle a number
cpu_used a number also called Physc or Physical Consumed in some tools
disk_read a number
disk_write a number
kernel64bit boolean 0=false, 1 = true
lpar boolean 0=false, 1 = true
lpar_name a string
lpar_num a number
oslevel a string OS version and release in as much detail as possible
serial_num a string Serial number of the machine
smt boolean 0=false, 1 = true Simultaneous Multi Threading
splpar boolean 0=false, 1 = true Shared Processor Logical Partition
weight boolean 0=false, 1 = true

The meaning of these new performance stats should be fairly obvious to experienced systems administrators familiar for POWER5 and Micro-partitions except cpu_in_lpar which is the Virtual Processor number in POWER5 and just the number of CPUs in POWER4 machines. The stats should also work on non-POWER5 machines some details will clearly not possible but they should be reported in a suitable way. If you are new to POWER5 and the Advanced POWER Virtualisation (APV) features then there are two excellent Redbooks to read up on the subject:

Redbook URL for Downloading the .pdf
Advanced POWER Virtualization on IBM System p5 http://www.redbooks.ibm.com/Redbooks.nsf/RedbookAbstracts/sg247940.html
Advanced POWER Virtualization on IBM eServer p5 Servers: Architecture and Performance Considerations http://www.redbooks.ibm.com/Redbooks.nsf/RedbookAbstracts/sg245768.html

Where do I get Ganglia?

The Ganglia download website has the pre-built binaries for Fedora and RedHat EL4 for two platforms which are x86 (AMD and Intel) and ia64 (Itanium). The download website is http://ganglia.sourceforge.net/downloads.php take the "ganglia monitor core" link or the AIX or Linux on POWER binaries at http://www.perzl.org/ganglia/ . If you have such Linux system and want to run a quick test to learn then these are recommended and you have Apache and PHP support with the distribution.

Where do I get the Source code for Ganglia, Apache and PHP?

You can recompile the Ganglia daemons from the open source code found on the Ganglia prime website but it can be a little tricky as it needs quite a few supporting libraries and tools. Don't let me put your off. If you are a developer this can be done by downloading the code and then finding the latest versions of the required packages. It is simply a case of trying the ./configure and make commands and waiting for errors to tell you what is missing.

There are Ganglia intructions for compiling the deamon at http://ganglia.sourceforge.net/docs/ganglia.html

Many of the websites and hints for recompiling Apache 2 and PHP 5 are useful so check out the AIX and Open Source wiki page aixopen. You can get the POWER5 source code updates from http://www.perzl.org/ganglia/ Note: the Ganglia front end PHP scripts are platform independant.

You could run just the gmond daemons on the POWER based machines (AIX and/or Linux) and the gmetad plus Apache + PHP on a x86 Linux based machine.

Where do I download the binaries for standard Ganglia or Ganglia for AIX and Linux on POWER with the POWER5 additions?

  POWER AIX POWER Linux x86 Linux Other platform or Operating System
Web Server Apache2+PHP5 See details at aixopen with Linux Distro with Linux Distro Ask vendor
Ganglia FE PHP scripts Platform independent set Platform independent set Platform independent set Platform independent set
gmetad POWER5 RPMs POWER5 RPMs Ganglia download site (2) Need to compile (2)
gmond POWER5 RPMs POWER5 RPMs Ganglia download site Need to compile (2)
  • (2) If you want the web front end on non POWER machines (like a Linux/PC), just make sure you run the gmetad on POWER5 (to enable the POWER5 additions) and get the web front end gmetad to talk to the POWER5 gmetad for the data.

This is the problem at the moment. The POWER5 additions are being offered to the Ganglia developers as updates and will hopefully, be in the standard code when it is next released - possibly 3.1. The binary RPMs with POWER5 additions for gmond and gmetad are for

  • AIX 5L v5.1
  • AIX 5L v5.2
  • AIX 5L v5.3
  • Linux SUSE SLES9 and SLES10 for POWER
  • Linux Red Hat EL AS 4 for POWER

Part of the design of Ganglia means you need a gmetad that requests these extra POWER5 stats from gmond, so you will need to run gmetad on POWER based AIX and Linux to be able to see the new POWER5 stats. This means the Ganglia web server normally has to also be on AIX or Linux on POWER. So in this case, you could run the gmond daemons, gmetad daemon and and Apache + PHP on the POWER based machines (AIX and/or Linux).

Advanced topics

Using gmetric to add more stats

You have your Ganglia cluster working nice but you start thinking "I wish I could also monitor XYZ". Well, what ever XYZ is. if you can get a number or a string at the command line then you can add it to the Ganglia monitored data.
Examples for AIX might be:

  • Transaction rate of your database - this will depend on the database
  • Number of database users connected - this will depend on the database
  • Machine model - on AIX use: lsattr -El sys0 -a modelname -F value
  • Machine firmware level - on AIX use: lsattr -El sys0 -a fwversion-F value
  • Number of disks - on AIX use: lspv | wc -l

You can read the gmetric documentation at http://ganglia.sourceforge.net/docs/ganglia.html - it is near the bottom.

To add the firmware string:

gmetric --name firmware --value `lsattr -El sys0 -a modelname -F value` --type "string"

To add the the number of disks string:
gmetric --name number_of_disks --value `lspv | wc -l` --type int32

To add the number of transaction and assuming you have script that will work this out called "transactions" that returns a number with a decimal point - you will have to write this yourself!
gmetric --name tpm --value `/usr/local/bin/transactions` --type double

The above will only save the statistics once. The firmware level is unlikely to change, the number of diks could change and the number of transactions per minute will definitely change. To get these always u to date, it is recommended to get the commands regularly (run once every 60 seconds) via cron.
Seemingly, by magic these new stats or strings will appear on the Ganglia website. Find the machine involved and all the data about the node. then click on the "Gmetrics" link - it was not obvious to me the first time that this is where the new data would appear! You may have to give it a minute or two for the values to appear.

Using gstat to extract data

The gstat tool can let you know information about your cluster it can be useful to determine an number of things
For example, to check the hosts that are up or dead just run: gstat

$ gstat
CLUSTER INFORMATION
       Name: demo_p505
      Hosts: 9
Gexec Hosts: 0
 Dead Hosts: 0
  Localtime: Wed Jun 21 17:51:05 2006

There are no hosts running gexec at this time

You can also get more information about the status with the --all options
$gstat --all --single_line
CLUSTER INFORMATION
       Name: demo_p505
      Hosts: 9
Gexec Hosts: 0
 Dead Hosts: 0
  Localtime: Wed Jun 21 17:56:29 2006

CLUSTER HOSTS
Hostname                     LOAD                       CPU              Gexec
 CPUs (Procs/Total) [     1,     5, 15min] [  User,  Nice, System, Idle, Wio]

daic4.aixncc.uk.ibm.com     4 (    0/   66) [  0.00,  0.00,  0.00] [   0.0,   0.0,   0.0,  99.9,   0.0] OFF
daic3.aixncc.uk.ibm.com     4 (    0/   82) [  0.00,  0.00,  0.00] [   0.0,   0.0,   0.1,  99.8,   0.1] OFF
daic2.aixncc.uk.ibm.com     4 (    0/   57) [  0.00,  0.00,  0.00] [   0.0,   0.0,   0.1,  99.9,   0.0] OFF
daivios1.aixncc.uk.ibm.com  4 (    0/   77) [  0.00,  0.00,  0.00] [   0.0,   0.0,   0.0,  99.9,   0.0] OFF
dainim.aixncc.uk.ibm.com    4 (    0/   77) [  0.00,  0.00,  0.00] [   0.0,   0.0,   0.0, 100.0,   0.0] OFF
dai6.aixncc.uk.ibm.com      4 (    0/   86) [  0.08,  0.03,  0.01] [   0.1,   0.0,   0.1,  99.5,   0.3] OFF
daic1.aixncc.uk.ibm.com     4 (    1/   60) [  1.00,  1.02,  1.09] [   0.0,   0.0,   0.0, 100.0,   0.0] OFF
daic5.aixncc.uk.ibm.com     2 (    0/   53) [  0.00,  0.00,  0.00] [   0.0,   0.0,   0.0,  99.9,   0.0] OFF
daivios.aixncc.uk.ibm.com   4 (    1/   74) [  2.01,  2.09,  1.94] [   0.1,   0.0,   0.6,  99.2,   0.0] OFF

The Cluster Summary Pie Chart

In the summary of the Cluster you are shown a pie chart for the load_one (i.e. the CPU load over the last minute for the various nodes in different colours. However, you might like to display some other number in the pie chart. To do this, go to the Ganglia web server directory. There you should find a file called conf.php - this file has all sorts of interesting options. The default_metric decides the stats used for the pie chart. For the POWER5 addition I wanted to have the new metric of cpu_used (this is the physical CPU used by the partition), so I changed

#
# Default metric
#
$default_metric = "load_one";

to
$default_metric = "cpu_used";

I have not tried this with a number that is not a percentage as it may fail or need other things changed.

Default sort order of the machines

You can change default sort to sorting by hostname in order to have the nodes always in same order. This stops the order from changing depending on the statistics that you are looking at - which I find confusing. To achieve this edit get_context.php and change:

if (!$sort)
      $sort = "descending";

to
if (!$sort)
      $sort = "by hostname";

POWER5 Cross Partition/Whole Machine/CEC/Global LPAR View graphs - via Automated Add-on

This Add-on extracts the CPU use from the rrdtool databases in the /var/lib/ganglia/rrds directory.

With this POWER5 Add-On you have a new graph at the Ganglia Cluster (in our case the LPARs of one machine) of the added up CPU use and the size of the Shared CPU Pool.
An example is below:

You can see the dark blue line around this graph (this is the Zoomable Add-on) which means you can click on it to get a much larger and more detailed graph as below.

In the above graph you can see all the Logical Partitions (LPARs) on this two CPU pSeries p505 machine - this includes the following operating systems:

  • AIX 5.3 and AIX 6.1
  • Linux on POWER SUSE SLES 9, Fedora 5 and RedHat 4
  • Virtual I/O Server (called the p505ivm partition)

This is a crash and burn machine used for demonstrations the "fake workload" is generated via nstress ncpu programs on the red partition p505lpar9.  The other workloads were started by hand to create more interesting graph. You can see that the p505ivm LPAR (Virtual I/O Server) is in dark blue. Also note how little CPU time the mostly idle LPARs are taking - it is around 0.02% to 0.04% of a CPU. This is likely to just be the regular (100 times a second) timer interrupts, device drivers and daemons ticking over.  At the top of the chart above is that the number of CPUs in the Shared Pool is a black line at the 2 CPU level.  When using the Integrated Virtualization Manager (IVM) all CPUs are normally in the shared pool. We can also see when the three LPARs are busy we take practically all the CPU time.

Where to get these Add-Ons?

Further cool additions - finer control of the graph times

By default Ganglia records enough data in the rrdtool database to draw last hour, day, week, month and year graphs but if you make a one line change to the /etc/gmetad.conf file to increase the data held and a little bit more disk space then you can use the Calendar add-on where you can ask for graphs between any two date and times. For example, if you find a peak from three weeks ago you can ask Ganglia to graph just that half day or even hour. See the below for selecting the start or end time date and time:

Further cool additions - custom graphs

Yes it gets even better. Now you can specify what you want to graph like what stats, labels, which dates and colours and more. This is a further add-on for custom graphs. See below on the options for specifying the graphs

We have the CPU entitlement, and actually used plus the numbers of CPUs in the machine and pool (if different) for a particular peak a few days ago.

and the generated the below graph - the graph details can be saved and used again later on.

POWER5 Cross Partition/Whole Machine/CEC/Global LPAR View graphs - via Manually written PHP scripts

As all the LPAR data is held in rrdtool databases in the /var/lib/ganglia/rrds directory of the gmetad and webserver machine, it is possible to extract the Physical CPU used in each Logical Partition (LPAR) of the machine to see the use of CPU power as a whole. The same goes for memory, disk and network stats.

This is still a work in progress but it is a start. Below are a few of the graphs generated, so far, and then how this was done is explained. Click on the any thumbnail graph for a bigger version:

CPU Last Hour
CPU Last Day
CPU Last Week
CPU Last Month
CPU Last 3 Months
CPU Last year
Memory Free
Memory Total
Network In
Network Out
Disk Read
Disk Write
Run Queue

Each of these graphs is generated by a PHP script. Below is a sample one:

<?php
header("Content-type: image/gif");
passthru("/usr/bin/rrdtool graph - \
--title 'Global LPAR View for machine demo_p505 - Physical-CPU-Use for Last-Hour' \
--vertical-label 'Physical-CPUs' \
--start end-1h \
--width 800 \
--height 600 \
--lower-limit 0 \
DEF:LABEL1=/var/lib/ganglia/rrds/demo_p505/daivios.aixncc.uk.ibm.com/cpu_used.rrd:sum:AVERAGE \
AREA:LABEL1#0000FF:daivios \
DEF:LABEL2=/var/lib/ganglia/rrds/demo_p505/daic1.aixncc.uk.ibm.com/cpu_used.rrd:sum:AVERAGE \
STACK:LABEL2#00FF00:daic1 \
DEF:LABEL3=/var/lib/ganglia/rrds/demo_p505/daic11.aixncc.uk.ibm.com/cpu_used.rrd:sum:AVERAGE \
STACK:LABEL3#FF0000:daic11 \
DEF:LABEL4=/var/lib/ganglia/rrds/demo_p505/lpar9.aixncc.uk.ibm.com/cpu_used.rrd:sum:AVERAGE \
STACK:LABEL4#00FFFF:lpar9 \
DEF:LABEL5=/var/lib/ganglia/rrds/demo_p505/daic3.aixncc.uk.ibm.com/cpu_used.rrd:sum:AVERAGE \
STACK:LABEL5#FFFF00:daic3 \
DEF:LABEL6=/var/lib/ganglia/rrds/demo_p505/daic4.aixncc.uk.ibm.com/cpu_used.rrd:sum:AVERAGE \
STACK:LABEL6#FF00FF:daic4 \
DEF:LABEL7=/var/lib/ganglia/rrds/demo_p505/daic5.aixncc.uk.ibm.com/cpu_used.rrd:sum:AVERAGE \
STACK:LABEL7#000088:daic5 \
DEF:LABEL8=/var/lib/ganglia/rrds/demo_p505/dainim.aixncc.uk.ibm.com/cpu_used.rrd:sum:AVERAGE \
STACK:LABEL8#008800:dainim \
DEF:LABEL9=/var/lib/ganglia/rrds/demo_p505/dai6.aixncc.uk.ibm.com/cpu_used.rrd:sum:AVERAGE \
STACK:LABEL9#880000:dai6 \
DEF:LABEL10=/var/lib/ganglia/rrds/demo_p505/daivios1.aixncc.uk.ibm.com/cpu_used.rrd:sum:AVERAGE \
STACK:LABEL10#008888:daivios1 \
DEF:LABEL11=/var/lib/ganglia/rrds/demo_p505/lpar10.aixncc.uk.ibm.com/cpu_used.rrd:sum:AVERAGE \
STACK:LABEL11#888800:lpar10 \
DEF:LABEL12=/var/lib/ganglia/rrds/demo_p505/daivios.aixncc.uk.ibm.com/cpu_in_pool.rrd:sum:AVERAGE \
LINE3:LABEL12#FF0000:CPU_in_pool \
2>/tmp/err");
?>

These PHP scripts are generated via a small configuration file and a shell script.
Configuration files looks like this (called data505):
demo_p505
daivios.aixncc.uk.ibm.com daivios 0000FF
daic1.aixncc.uk.ibm.com daic1 00FF00
daic11.aixncc.uk.ibm.com daic11 FF0000
lpar9.aixncc.uk.ibm.com lpar9 00FFFF
daic3.aixncc.uk.ibm.com daic3 FFFF00
daic4.aixncc.uk.ibm.com daic4 FF00FF
daic5.aixncc.uk.ibm.com daic5 000088
dainim.aixncc.uk.ibm.com dainim 008800
dai6.aixncc.uk.ibm.com dai6 880000
daivios1.aixncc.uk.ibm.com daivios1 008888
lpar10.aixncc.uk.ibm.com lpar10 888800

Notes:
  1. First line is the cluster name as found in the directory in /var/lib/ganglia/rrds
  2. The rest of the lines are one per LPAR with:
    1. Hostname for the node as found in /var/lib/ganglia/rrds/<clustername>/ directory
    2. Short hand name you want on the graph
    3. Six digit Hexadecimal number for the colour

The shell script is here (called create_global):

write_php()
{
title=$1
time=$2
period=$3
variable=$4
poolline=$5
units=$6

i=1
read machine

printf "<?php\n"
printf "header(\"Content-type: image/gif\");\n"
printf "passthru(\"/usr/bin/rrdtool graph - %c\n" '\'
printf -- "--title \'Global LPAR View for machine %s - %s for %s\' %c\n" $machine $title $time '\'
printf -- "--vertical-label \'%s\' %c\n" $units '\'
printf -- "--start %s %c\n" $period '\'
printf -- "--width 800 %c\n" '\'
printf -- "--height 600 %c\n" '\'
printf -- "--lower-limit 0 %c\n" '\'

#do the first line as it need the AREA tag - other lines need STACK
read node1 name1 colour1
printf "DEF:LABEL%d=/var/lib/ganglia/rrds/%s/%s/%s.rrd:sum:AVERAGE %c\n" $i $machine $node1 $variable '\'
printf "AREA:LABEL%d#%s:%s %c\n" $i $colour1 $name1 '\'


while read node name colour
do
let i=i+1
printf "DEF:LABEL%d=/var/lib/ganglia/rrds/%s/%s/%s.rrd:sum:AVERAGE %c\n" $i $machine $node $variable '\'
printf "STACK:LABEL%d#%s:%s %c\n" $i $colour $name '\'
done

if [[ "$poolline" == "yes" ]]
then
let i=i+1
printf "DEF:LABEL%d=/var/lib/ganglia/rrds/%s/%s/cpu_in_pool.rrd:sum:AVERAGE %c\n" $i $machine $node1 '\'
printf "LINE3:LABEL%d#FF0000:CPU_in_pool %c\n" $i '\'
fi

printf -- "2>/tmp/err\");\n"
printf -- "?>\n"
}

# Main script here
# Main script here
input_file=$1
read machine  < $input_file
write_php "Physical-CPU-Use" "Last-Hour"    end-1h cpu_used yes "Physical-CPUs" <$1 >${machine}_hour.php
write_php "Physical-CPU-Use" "Last-Day"     end-1d cpu_used yes "Physical-CPUs" <$1 >${machine}_day.php
write_php "Physical-CPU-Use" "Last-Week"    end-1w cpu_used yes "Physical-CPUs" <$1 >${machine}_week.php
write_php "Physical-CPU-Use" "Last-Month"   end-1m cpu_used yes "Physical-CPUs" <$1 >${machine}_month.php
write_php "Physical-CPU-Use" "Last-Quarter" end-3m cpu_used yes "Physical-CPUs" <$1 >${machine}_quarter.php
write_php "Physical-CPU-Use" "Last-Year"    end-1y cpu_used yes "Physical-CPUs" <$1 >${machine}_year.php

write_php "CPU-Entitlement"  "Last-Day"    end-1d cpu_entitlement yes "Physical-CPUs" <$1 >${machine}_entitle.php

write_php "Memory-Free"      "Last-Day"    end-1d mem_free no "Bytes" <$1 >${machine}_mem_free.php
write_php "Memory-Total"     "Last-Day"    end-1d mem_total no "Bytes" <$1 >${machine}_mem_total.php

write_php "Network-In"       "Last-Day"    end-1d bytes_in no "Bytes" <$1 >${machine}_network_in.php
write_php "Network-Out"      "Last-Day"    end-1d bytes_out no "Bytes" <$1 >${machine}_network_out.php

write_php "Disk-Read"        "Last-Day"    end-1d disk_read no  "Bytes" <$1 >${machine}_disk_read.php
write_php "Disk-Write"       "Last-Day"    end-1d disk_write no "Bytes" <$1 >${machine}_disk_write.php

write_php "Run-Queue"        "Last-Day"    end-1d proc_run no "Processes" <$1 >${machine}_proc_run.php

The script is is called as follows and within a directory of the webserver
create_global data505

This generates the 14 PHP scripts. When the PHP scripts are accessed via the web broswer, it generates the graphs on the fly. You might want to make this simple via a webpage containing something like this:
<html>
<body>
<h1>Welcome to this Ganglia Cross Partition or Global LPAR View</h1>

CPU Graphs Over Time
<ol>
<li> <a href=demo_p505_hour.php>Last Hour</a>
<li> <a href=demo_p505_day.php>Last Day</a>
<li> <a href=demo_p505_week.php>Last Week</a>
<li> <a href=demo_p505_month.php>Last Month</a>
<li> <a href=demo_p505_quarter.php>Last Quarter</a>
<li> <a href=demo_p505_year.php>Last Year</a>
</ol>

For the Last Hour only
<ol>
<li> <a href=demo_p505_entitle.php>Entitlement</a>
<li> <a href=demo_p505_proc_run.php>Run Queue</a>
<li> <a href=demo_p505_mem_total.php>Memory Total</a>
<li> <a href=demo_p505_mem_free.php>Memory Free</a>
<li> <a href=demo_p505_network_in.php>Network In</a>
<li> <a href=demo_p505_network_out.php>Network Out</a>
<li> <a href=demo_p505_disk_read.php>Disk Read</a>
<li> <a href=demo_p505_disk_write.php>Disk Write</a>
</ol>

</body>
</html>

Using unicast for multiple cluster configuration

Ganglia webnode is a LPAR on p550. We have to machines p505 and p550, LPARs from each one should appear in a different cluster.
On ganglia web-node I used following configuration for gmetad:

data_source "p550" localhost
data_source "p505" 172.28.255.203

And this gmond.conf:
cluster {
  name = "p550"
  owner = "Tomas Baublys"
  latlong = "unspecified"
  url = "unspecified"
}
#...
udp_send_channel {
 # The headnode of p550 cluster ist webnode itself
 host = 172.28.255.100
 port = 8666
}

/* You can specify as many udp_recv_channels as you like as well. */
udp_recv_channel {
  port = 8666
}

On all p550 lpars I used the same gmond.conf above.
On the p505 cluster I determined one lpar (172.28.255.203) to be the head node (using gmond only) and all other sending information to it. I used this gmond.conf for all p505 LPARs:
cluster {
  name = "p505"
  owner = "Tomas Baublys"
  latlong = "unspecified"
  url = "unspecified"
}
#...
udp_send_channel {
host = 172.28.255.203
port = 8666
}

udp_recv_channel {
port = 8666

Larger detailed graphs via an enhanced Ganglia Web-Frontend script

People have noted that some Ganglia websites on the Internet, allow you to click on the small Ganglia generated graphs to get a much large and more detailed graph and wondered how to get this on their own Ganglia system. The change is very simple to make and makes the graphs much more valuable.

Take this Link to Michael's Webpage with the details and download of the enhanced scripts.

Scenario: setting up Unicast configuration through Firewalls

Have a look at this Wiki page if you need to go for Unicast through a Firewall for more secure networks

Add your tips here ... please!

The postings on this site solely reflect the personal views of the authors and do not necessarily represent the views, positions, strategies or opinions of IBM or IBM management.

The rpm-installation of the mentioned ganglia-web-3.0.3-1.noarch.rpm from http://ganglia.sourceforge.net/downloads.php, which is supposed to be plattform independent failed to install like:
rpm -Uvh ganglia-web-3.0.3-1.noarch.rpm
package ganglia-web-3.0.3-1 is for a different operating system

Thus we used the php-files from the ganglia-source like below, which works fine
wget http://belnet.dl.sourceforge.net/sourceforge/ganglia/ganglia-3.0.3.tar.gz
gunzip ganglia-3.0.3.tar.gz; tar xvf ganglia-3.0.3.tar
mv ganglia-3.0.3/web /usr/local/apache2/htdocs/ganglia

Posted by gkuehnberger at Dec 17, 2006 18:38 | Permalink

For loading the ganglia-web-3.0.3-1.noarch.rpm install problem use
rpm -Uvh --ignoreos ganglia-web-3.0.3-1.noarch.rpm

Posted by nagger at Apr 20, 2007 17:14 | Permalink

hi and thanks for this great tutorial. I am getting the following error trying to start gmetad

/etc/rc.d/init.d # gmetad start
exec(): 0509-036 Cannot load program gmetad because of the following errors:
0509-130 Symbol resolution failed for /opt/freeware/lib/librrd_th.a(librrd_th.so.2) because:
0509-136 Symbol art_free (number 112) is not exported from
dependent module /opt/freeware/lib/libart_lgpl_2.a(libart_lgpl_2.so.2).
0509-136 Symbol art_alloc (number 113) is not exported from
dependent module /opt/freeware/lib/libart_lgpl_2.a(libart_lgpl_2.so.2).
0509-026 System error: Error 0
0509-192 Examine .loader section symbols with the
'dump -Tv' command.

Can you please help me understand and resolve this issue.

Posted by latelatif at Mar 11, 2009 18:58 | Permalink

I resolved the above error by installing rrdtool-1.2.13-1.perl58.aix5.2.ppc.rpm from http://www.inet.hr/zmp/ibm/rrdtool/ instead of installing v1.2.30 from http://www.perzl.org/aix/index.php?n=Main.Rrdtool (which I had installed the first time around when I received the above error). I also installed libart_lgpl-devel although not sure if this was necessary.

My oslevel is 5300-07

ganglia-gmetad-3.0.7-1.aix5.3.ppc.rpm also seems to have an issue if I try to stop gmetad by using "gmetad stop". Only a restart command seems to work.

Also discovered that having a special character in the cluster name is not at all advisable This is because after starting gmetad, it tries to create a directory under /var/lib/ganglia/rrds with the name of the cluster and having a special character will obviously fail for a directory name. I found this out by editing this line in gmetad.conf in /etc and enabling debug.

  1. debug_level 10

Also enabled setuid by editing this line and removing the comment in gmetad.conf

  1. setuid off
Posted by latelatif at Mar 11, 2009 19:42 | Permalink

Another tip

If you are changing the rrd_rootdir location in gmetad.conf, then don't forget to change the location in $

Unknown macro: {/your/app/server/root}
/ganglia/conf.php as well.

The variable in conf.php is called $gmetad_root

Thanks for the great tutorial. We are up and running in production.

Posted by latelatif at Mar 13, 2009 12:44 | Permalink

Setting cpu_used for LPAR with dedicated processors in cluster view shows 100% CPU used. For LPARs with shared CPU pool problem doesn't exist. Other metrics in cluster view are displayed correctly.
I'm using ganglia 3.0.7 from Perzl.

Posted by wojtekr72 at Mar 30, 2009 13:20 | Permalink

Equipment :
p570 Power5 with 10 Gbit PCI-X2 interface ( AIX 5.3 TL07 SP03 )
Ganglia 3.0.7 from Perzl

Problem :
The value for the network transfer ( MB/sec ) is not shown correctly with 10 Gbit ( parameter "Network" ).
This became obvious when we compared it with nmon measuresments.

The values shown do not exceed 1 Gbit performance ( 120 MB/sec ) but in reality > 250 MB/sec have been achieved.

Does anyone have an explantion
Thanks in advance