Software Secret Weapons™


 
How To Monitor MYSQL Replication In Real Time With SNMP
by Pavel Simakov on 2007-05-07 16:12:53 under Linguine Watch, view comments
Bookmark and Share
 


Introduction
Many projects have successfully used MySQL master-slave replication feature. This type of replication is quite easy to setup and it works very well. Initially, I was skeptical of the replication and did not expect it to work reliably. I became more optimistic after reading some of Jeremy Zawodny’s presentations and his book "High performance MySQL" both showing that Yahoo! Finance team made replication work. I finally tried replication myself and it indeed works.

One big issue remained. In order to get a truly mission-critical database system one has to continuously monitor the state of both the master and the slave servers. In his writing Jeremy has hinted at the simple algorithm for monitoring replication status. Several simple scripts were written to explore his ideas. However, it was quite difficult for me to adapt these various scripts to my Java projects.

Finally I wrote MySQL replication monitor in Java. What is really cool is that I connected it with my Linguine Watch open-source monitoring project to allow monitoring of replication status with SNMP. This article describes how I have done it. Now, you can setup your MySQL master-slave replication monitoring with SNMP in five minutes!

General algorithm for monitoring
The basic algorithm for monitoring replication consists of the following steps:

  • write the timestamp into some dedicated table on the master
  • wait a bit for the replication to take place
  • read the timestamp value off the same dedicated table on the slave
  • subtract two timestamp values and see how much the slave is behind the master.
  • repeat the whole sequence

I created simple table _MASTERSTATUS on the master to store the timestamp and some other things. Here is the DDL for it:

 
CREATE TABLE `_masterstatus` (
  `ID` int(11) NOT NULL default '0',
  `SEQUENCE` int(11) default NULL,
  `UPDATED_ON` datetime default NULL,
  `MASTER_UID` int(11) default NULL,
  PRIMARY KEY  (`ID`),
  KEY `ID_idx` (`ID`)
);

There are several fields present in this table in addition to the timestamp. The absolute time can be very tricky to synchronize in the clustered environment and it should not be trusted. That is why I rely on the SEQUENCE number as an alternative. Instead of writing new timestamp to the master we first read the SEQUENCE number from the master. Then we increment its value by one and then write new value back into the same row on the master. Now we can wait for the replication and later read the SEQUENCE number off the slave and compare both values. We still put the timestamp into the UPDATED_ON field on the master, but for the convenience and not for the actual monitoring.

To figure our how much the slave is behind the master we multiply the difference in the SEQUENCE numbers by the frequency of the updates to the master. If the master has SEQUENCE number 10, the slave has 3 and master is updated once every 5 seconds, the slave is behind master by 7 * 5 = 35 seconds.

When the master-slave replication is configured a master is given a unique id. However, it is not possible to figure out the master's id while talking to the slave. To overcome this problem we need the MASTER_UID field. This field will help us to defend against the situation when one tries to accidentally monitor the master and the slave that actually do not work with each other. Now, every time the monitoring application starts it chooses a random number. This random number is written to the MASTER_UID field on all consecutive updates to the master. When reading the SEQUENCE number from the slave, we also check that the value of the MASTER_UID field matches with the master.

Purging binary logs
There is one annoying detail about MySQL master-slave replication that does not get mentioned frequently. For the replication to work, the binary logging must be enabled on the master. The binary logs take up large amount of disk space and must be cleaned up manually. MySQL replication monitor described in here takes care of purging binary logs as well.

After the slave status has been confirmed by checking SEQUENCE number and MASTER_UID, we determine the name of the current binary log file that is being processed by the slave. It can be obtained by issuing SHOW SLAVE STATUS statement on the slave (Appendix A). All prior binary logs have been already processed by the slave and can be purged from the master using PURGE MASTER LOGS TO statement.

Deployment diagram
The typical deployment diagram for the complete monitoring solution is presented below. It shows two MySQL servers: the master and the slave. Both servers are queried from the MySQL monitor I am describing here. The MySQL monitor contains a worker thread that issues periodic updates to the master. The monitor also checks how data is replicated from the master to the slave and purges the binary logs if everything works fine.

The MySQL monitor has built-in Java SNMP agent for reporting data to an SNMP monitoring station. A very good product OpManager from Adventnet can be used as a monitoring station. It has a free version that handles all aspects of SNMP monitoring (with alarms) for a limited number of hosts. Any other SNMP capable monitoring application will work as well.

Almost no programming skills are needed
If you are a Java programmer you can integrate this package into any of your existing applications. At the same time I keep forgetting that not all people in the world are programmers. If you are not a Java programmer, but still want to use MySQL replication monitor you can now do it without writing any custom code or compiling anything.

The MySQL monitor application can be started from the command line using the following batch file. Please properly set OY_LW_HOME and MYSQL_REPL_PROPS variables before use. Most of all possible configuration parameters can be set in a MYSQL_REPL_PROPS property file. So you don’t have to write any Java code to recompile anything. Here the sample properties file. The description of the parameters is in the Appendix B. The standard Java JDK 1.4+ is required.

See MySqlReplStatusMonitor in action
Start the master and the slave servers and MySqlReplTest monitoring application. Connect the OpManager to MySqlReplTest application via default SNMP port 161. Start some unit tests that generate moderate MySQL work. To test the replication progress I have set a maximum bin log file size to just 50K (max_binlog_size=50000), so I can force logs being purged frequently.

The real-time chart for the sample run is presented below. The same data can be seen on a periodic HTML report.

The red line shows how master SEQUENCE number is incremented every 5 seconds. The green line shows SEQUENCE number on the slave. Both follow each other closely. The blue line shows how much the slave is behind the master. It always stays at 0, confirming that the replication is always done within 5 seconds (in my test at least). The yellow line shows number of errors.

At around 120 seconds into the test I issue STOP SLAVE command on the slave server. The green line stops growing; new values for SEQUENCE number do not arrive to the slave. The blue line, the lag, increases correspondingly. As soon as I issue START SLAVE command at around 140 seconds slave immediately recovers and everything goes back to normal! There are no errors while replication is stopped; we are able to connect to both the slave and the master to check their statuses.

At around 180 seconds, I terminate the slave process and the number of errors and the lag start growing. The errors increase because we are not able to connect to slave anymore. Finally, at around 210 seconds, I terminate the master. The red line, that represents the master SEQUENCE number, stops growing. The number of errors goes up twice as fast.

Now we are ready to eliminate the last bit of manual work in the replication monitoring. Set up an email alarm inside the OpManager to trigger if:

  • the lag is over 12 (slave 1 minute behind), or
  • the number of errors is above 0.
We are done now!

Implementation of the MySQL monitor
The implementation consists of several Java classes. MySqlReplThread class is a thread that queries the master and the slave and purges the binary logs. MySqlReplCmd class handles details of SQL statements and manages database connections. In addition to the common SQL queries and updates, the following MySQL -specific statements are used (Appendix A):

  • SHOW MASTER STATUS;
  • SHOW SLAVE STATUS;
  • SHOW MASTER LOGS;
  • PURGE MASTER LOGS TO 'bin-log.XYZ';

MySqlReplContext class contains various options that control behavior of the package. MySqlReplManager class holds all of the other classes together. MySqlReplTest contains main() function and all the configuration code.

Following the ideas of the Linguine Watch framework for adding performance counters to large software applications, all the MySQL monitoring information is concentrated in one class MySqlReplStatusMonitor.

This class contains five individual performance counters including:

  • master_seq (master SEQUENCE number)
  • slave_seq (slave SEQUENCE number)
  • lag (the difference between two numbers above)
  • purge_to (last bin log number that was purged)
  • errors (number of SQL or other errors since the start time)

All performance counters are continuously updated by the MySqlReplThread as it communicates with the master and the slave. At the same time the performance counter values can be accessed directly in real-time by registering MySqlReplStatusMonitor with Linguine Watch built-in SNMP agent.

Implementation of SNMP agent
Linguine Watch is an open-source Java library for real-time monitoring of Java applications. It has a simple mechanism for publishing performance monitoring objects (like the MySqlReplStatusMonitor) and a built-in SNMP agent. Linguine Watch is free, simple, thread-safe, high-performance library that adds very little overhead. It is designed to be deployed with the application into production environment.

The SNMP defines a special MIB file format for listing counter definitions. The Linguine Watch generates MIB files automatically thus saving you from the need to learn MIB file format and writing the file by hand. The generated MIB file for the MySqlReplStatusMonitor is here.

Putting it all together
Finally we have to combine together the Linguine Watch SNMP agent and the MySQL replication monitor into MySqlReplTest application. The simple versions of this process are discussed in Tutorial 1 and Tutorial 2. But the MySqlReplTest example is more complex because both the SNMP agent and the MySqlReplManager have threads and require a coordination between their start() and stop() operations.

The issues here are similar to the issues of any Java or non-Java package where multiple dependent services must be started and stopped in a coordinated way. I do not want to pollute this article with the deep discussion of this and I will discuss this topic elsewhere.

To make long story short, the following adapted excerpt from MySqlReplTest shows the correct order for initialization and finalization of the Linguine Watch SNMP agent and MySqlReplManager. Notice how finalization sequence is in the reverse order to initialization.

 
public static void main(String  args){

MySqlReplManager man; LinguineSNMPWatch watch;

try { // create, customize, start watch watch = new LinguineSNMPWatch(); init(watch); watch.start(); try { // create, customize MySQLReplManager man = new MySqlReplManager(); init(man); // after watch has started we can now add counters watch.getPerfAgent().addMonitor(man.getMonitor()); try { // after counters have been added we can generate SNMP MIB file watch.getSNMPAgent().generateMIBFile("c:/test.mib");

// start MySQLReplManager threads man.start(); try {

// enter main message loop for your application here // do not return until application needs to exit ...

} finally{ // stop MySQLReplManager man.stop(); }

} finally { // remove monitor watch.getPerfAgent().removeMonitor(man.getMonitor()); } } finally { // stop watch last watch.stop(); } } catch(Exception e){ e.printStackTrace(); } }

Conslusion
If you think that only big boys like Microsoft, Oracle, IBM, Sun, or Cisco can afford built-in software monitoring, you are wrong. It is quite simple; you just have to go for it!

Appendix A
MySQL statements used in managing master-slave replication.

 
mysql> SHOW MASTER STATUS;
+------------------+----------+--------------+------------------+
| File             | Position | Binlog_do_db | Binlog_ignore_db |
+------------------+----------+--------------+------------------+
| my-bin-log.1666  | 678      |              |                  |
+------------------+----------+--------------+------------------+
1 row in set (0.00 sec)



mysql> PURGE MASTER LOGS TO 'my-bin-log.1580';
Query OK, 0 rows affected (0.01 sec)



mysql> SHOW SLAVE STATUS;
+-------------+-------------+-------------+---------------+------------------+...
| Master_Host | Master_User | Master_Port | Connect_retry | Master_Log_File  |...
+-------------+-------------+-------------+---------------+------------------+...
| 127.0.0.1   | admin       | 3011        | 60            | my-bin-log.1666  |...
+-------------+-------------+-------------+---------------+------------------+...
1 row in set (0.00 sec)


mysql> SHOW MASTER LOGS;
+------------------+
| Log_name         |
+------------------+
| my-bin-log.1663  |
| my-bin-log.1664  |
| my-bin-log.1665  |
| my-bin-log.1666  |
+------------------+
4 rows in set (0.00 sec)

Appendix B
Configuration parameters can be changed programmatically or via property file properties file.

Following configuration parameters can be set for MySqlReplContext:

 
	// the name of the housekeeping database that will store timestamp and sequence
	// numbers used to track replication status
	private String dbName = "MASTER";
	
	// the name of the housekeeping database
	private String tableName = "_MASTERSTATUS";
	
	// the table type for the CREATE TABLE command 
	private String tableCreatePostfix = "type = InnoDB";
		
	// change master timestamp and sequence number and check slave status
	// every this number of milliseconds
	private int checkDelayMillis = 5000;
	
	// purge master logs after confirming that slave is up to date
	// every this number of milliseconds
	private int purgeDelayMillis = 5 * 60 + 1000;
	
	// if set to true will purge binary logs once confirmed with slave
	// that they have been processed; is set to false will only report replication
	// status, increment master status, but will not purge bin logs
	private boolean canPurgeBinLogs = true;

Following configuration parameters can be set for MySqlReplContext.MySqlReplConnectionPool:

 
	// connection string to the master database
	private String masterConnStr;
		
	// connection string to the slave database
	private String slaveConnStr;

References

  1. Managing MySQL Replication by Jeremy Zawodny.
  2. Monitoring MySQL Replication Slaves by Christian G. Warden.

Comments (12)

  • Comment by Salvatore — July 30, 2007 @ 4:31 am

    Hi

    First of all congratulations for your very useful tutorial. I’m writing because I have an idea that I would discuss with you. I’m working for an italian radio company and we are building an IT infrastructure totally based on open source technology. So, I’m building a MySQL monitor starting from your tool.

    My work goals are:

    1) Configurable monitor for a generic MySQL replication topology;
    2) “Special agent” that, in case of master node fault, switch one of the slave in master and reorganize all the topology;
    3) Integration of some advising system (email,sms,etc) to alert if something happens.

    Are you interested in this kind of tool?
    I would to publish it under GPL licence, and I just would know your opinion about the project.
    Have you tried something of similar?
    Do you think that this kind of tools could be useful?

    I appreciate any suggestions.

    Thank you for your kindly attention.
    Salvatore

  • Comment by Pavel Simakov — August 1, 2007 @ 11:17 am

    Hi Salvatore,

    Thank for reading Software Secret Weapons!

    I have reviewed and had time to think about your project as you described it below. Several points can be noted. The project, given its superficial simplicity, might turn into very complex project and might require advanced software engineering skills and ongoing system administration. Do you have these already? Do you have budget for if? Can you hire skilled people?

    Replication topology and especially restructuring the topology is the issue of huge complexity. The best example I know of how it was done is described in Google patches to MySQL in the section Mysql4Patches here http://code.google.com/p/google-mysql-tools/. Google supports advanced dynamic topology restructuring, but it is Google… You can try to use Google version of MySQL, but I personally found documentation to be spotty and you have to be ready to “dig in” if something goes wrong. Good luck with that!

    So my advise is not even try for dynamic MySQL topology. MySQL is quite stable and you can improve reliability by using C-JDBC. Monitoring and advising systems don’t need to be built as they already exist. I personally had a very positive experience with free edition of AdventNet’s ManageEngine OpManager http://manageengine.adventnet.com/products/opmanager/index.html. It supports alarms and things like that. The LinguineWatch was specifically designed with SNMP support to allow using tools like OpManager for monitoring and advising.

    Good luck,

    - Pavel Simakov

  • Comment by Xaprb — October 1, 2007 @ 10:13 am

    The MySQL Heartbeat script, which Jeremy Cole and Six Apart contributed to the MySQL Toolkit (http://mysqltoolkit.sourceforge.net/) is the same idea, elegantly designed as a command-line tool.

  • Comment by yingkuan — February 21, 2008 @ 7:15 pm

    I wonder what’s the issue with using Seconds_Behind_Master in show slave status to check heartbeat ?

    For binlog rotation, from Mysql 5.0 you could set expire_logs_days
    instead of manually do it.

  • Comment by Alex F — February 26, 2008 @ 2:39 am

    I’m not a java programmer, all I need is to monitor MySQL replication. Is there any simple manual to install your scripts properly? I spent day trying to setup them with no success.

  • Comment by Agostino Curti — March 11, 2008 @ 12:09 pm

    Hello Pavel,
    I downloaded/installed Linguine Watch 1.2 because I think this is a very powerful tool. I configured the props file and I checked the software, but during the startup tasks I see the error in attach.. It seems the master connection works fine, but the slave is unreachable… Do you have some idea ? My master/slave works fine and they are configured normally….
    The only thing is: my 2 hosts are BOTH master and slave (Server1 <—> Server2). Is it a problem for Linguine ?

    The error message is: “The slave reports to a different master 1422432106.”

    Thanks for your help and my compliments for the software.

    Agostino Curti

  • Comment by Pavel Simakov — March 11, 2008 @ 12:17 pm

    Hi,

    Thank you for using Linguine Watch!

    This example of MySQL replication monitor assumes one master and one slave one way replication (master -> slave). The monitoring of two way replication needs slightly different approach.

    You need to run two replication monitors one for server1 -> server2 and another one for server2 -> server1. Each master must have different master status table name, for example: _MASTERSTATUS_1 and _MASTERSTATUS_2. These table names can be setup in MySqlReplTest.props.

    Everything should run well after that.

    - Pavel Simakov

  • Comment by maplebed — March 18, 2008 @ 1:18 pm

    re: yingkuan – seconds behind master has a few characteristics that make it less than ideal in more complicated situations.
    * it only has a value when the slave is running, so if there is an error that stops slaving you no longer know how far the slave is behind the master
    * if you are using chained replication (A -> B -> C), seconds behind on C reports the difference between C and B, not between C and A (which is often what you are actually interested in). If B falls behind, C may report that it is up to date, when in reality it is as far behind as B.

  • Comment by Bryon B — April 7, 2009 @ 8:49 pm

    Has anyone tried running this agent as a service – either in Windows or Linux? If so, I’d like to find how you set it up, so you can start/stop it gracefully, etc.

  • Comment by Bryon B — April 8, 2009 @ 8:21 pm

    Thanks to a tip from Pavel, I was able to get the agent running as a Windows service without too much difficulty, using the Community version of the Java Service Wrapper ( http://wrapper.tanukisoftware.org ).

    I used the WrapperSimpleApp Integration method, and that seems to be working. It doesn’t do a clean shutdown, as the service wrapper forcibly terminates the JVM to shut it down, but it seems to work. Now I can start/stop the monitoring agent through the Windows Services Manager. Thanks for the tip Pavel!

  • Comment by Omry — August 12, 2009 @ 4:31 pm

    Hi,
    I created a really flexible and easy to use replication monitor called MySQL Monk.
    take a look at it:

    http://projects.firefang.net/wiki/MySQLMonk

  • Comment by lecteur de cartes — December 14, 2009 @ 7:18 am

    MySQL replication allows you to have an exact copy of a database from a master server on another server (slave), and all updates to the database on the master server are immediately replicated to the database on the slave server so that both databases are in sync.


Leave a comment


 
Dog Emotional 2010 Calendar Dog Emotional Mousepad Dog Fashionable 2010 Calendar Dog Fashionable Mousepad

Copyright © 2004-2010 by Pavel Simakov
any conclusions, recommendations, ideas, thoughts or the source code presented on this site are my own and do not reflect a official opinion of my current or past employers, partners or clients
SourceForge.net Logo