2014년 11월 22일 토요일

Why TokuDB need to redo log sync on internal XA

TokuDB has option whether redo log flush(sync) after each transaction or not like InnoDB. The option name is tokudb_commit_sync. If you set tokudb_commit_sync=OFF, TokuDB will not sync redo log or doing that periodically based on tokudb_fsync_log_period option. 

Usually we use deferred redo log sync mode. Redo log sync mode need a lot of disk write IO. But there's a lot of service where data is not so important (1~5 seconds data loss is allowed). And semi-sync replication or gallera-cluster doesn't need redo log to be synced on every commit.

But current version of TokuDB (now 7.5.2), enabling binary log is very expensive (even tokudb_commit_sync is OFF). TokuDB's redo log commit mode will be changed ON(sync on every transaction commit) automatically when your server's binary log is activated.
TokuTek says it's because of internal XA(Two-phase commit) between Binary log and TokuDB storage engine. But I can't understand why redo log sync is need for internal XA (Actually binary log is not synced when TokuDB do XA, only TokuDB redo log)
But InnoDB doing async mode flush(sync) both redo log and binary log. Why only TokuDB need to sync or redo log ?

I don't know TokuDB's internal story. Anyway I changed TokuDB redo log will be flushed async even though binary log activated (of couse tokub_commit_sync=OFF).
No weird things happened on my simple crash scenario like server failure and MySQL server failure (Of course tokudb_commit_sync is OFF, so last few second's data is lost). And also replicated data too.
But we can't simply disable TokuDB XA feature. Because it makes binary log and TokuDB redo stored different order. So your slave can different data from master.

Binary log
  UPDATE account SET money=money*10 WHERE id=?;
  UPDATE account SET money=money+100 WHERE id=?;

Redo log
  UPDATE account SET money=money+100 WHERE id=?;
  UPDATE account SET money=money*10 WHERE id=?;

Actually I heard that recent version of TokuDB added parameter for disabling XA. What makes TokuDB so strictly need redo log sync on XA ?
So I asked about this on tokudb-dev google groups, But I have not heard the reason. Below is my question I wrote on tokudb-dev groups.

-------------------------------------------------
I have question about TokuDB internal two – phase commit of MySQL(+TokuDB).
According to TokuDB ft-index source code (https://github.com/Tokutek/ft-index/blob/master/ft/txn/txn.cc),

toku_txn_prepare_txn() and toku_txn_commit_txn() have a little bit different sync mode.

void toku_txn_prepare_txn (TOKUTXN txn, TOKU_XA_XID *xa_xid) {
….
txn->do_fsync = (txn->force_fsync_on_commit || txn->roll_info.num_rollentries>0);
….
}

// toku_txn_commit_txn() –> toku_txn_commit_with_lsn()
int toku_txn_commit_with_lsn(TOKUTXN txn, int nosync, LSN oplsn,
TXN_PROGRESS_POLL_FUNCTION poll, void *poll_extra)
{
….
txn->do_fsync = !txn->parent && (txn->force_fsync_on_commit || (!nosync && txn->roll_info.num_rollentries>0));
….
}

As you can see, toku_txn_commit_txn() take into account nosync parameter and nosync parameter is actually determined by tokudb_commit_sync system variables. So If user set “tokudb_commit_sync=OFF” then toku_txn_commit_txn() is not call fsync() for toku redo log.
But toku_txn_prepare_txn() is not take into account this system variable. So toku_txn_prepare_txn() always call fsync() for tokudb redo log even though user set “tokudb_commit_sync=OFF”.

Is there any reason that prepare() does always call fsync() for tokudb redo log (on tokudb_commit_sync=OFF configured TokuDB) ?


This is related with below thread and we already talked about this issue a few months ago.
-------------------------------------------------

And still we are waiting for the answer.

2014년 9월 17일 수요일

Problem of MySQL Online DDL

MySQL Online DDL

Online DDL(Online schema change) is new feature (every DBA most wanted ^^) of MySQL 5.6 and MariaDB 10.0.
Online DDL is processed through bewow 4 task in sequence.

1) InnoDB::ha_prepare_inplace_alter_table(..)
2) InnoDB::ha_inplace_alter_table(..)
3) InnoDB::ha_commit_inplace_alter_table(..)
4) mysql_rename_table(..)

MySQL server will allocate memory buffer for online transaction logging in phase 1. This buffer is used for logging row change requested from users.
This row log buffer is allocated a unit of as much as sort_buffer_size, and it would be grow up to innodb_online_alter_log_max_size.
In phase 2, ALTERing thread will copy old table's row to newly created table applied new schema.
When copying rows is done, MySQL will take exclusive lock for target table and applying row log buffer to newly created table.

There's a few failure case, But most unpredictable case (I think) is row log buffer overflow. 
MySQL server will rollback ALTER statement when row log buffer is not sufficient. If you spend a lot of time until row log overflow, your are just waisting your precious time.
Even worse, MySQL server will not tell you how much row log buffer is used and how much free is there. All we can do is just allocating big memory and waiting for mysql server's judgment.
And MySQL manual didn't say anything about how much memory need for online DDL ㅠㅠ.


How DML affected row log buffer usage

So I tested usage of online row log buffer simply.
All test scenario is simple like below and print out increment of row log buffer usage (Added small fprintf code and recompiled it).
----------------------------------------------------------------------
Session1 : ALTER TABLE tb_onlineddl ..
Session2 : INSERT INTO tb_onlineddl ..
Session2 : UPDATE tb_onlineddl SET ..
Session2 : DELETE FROM tb_onlineddl WHERE ..
----------------------------------------------------------------------



CASE-1 : Adding column

CREATE TABLE `tb_onlineddl` (
  `fdpk` int(11) NOT NULL AUTO_INCREMENT,
  `fd1` char(100) NOT NULL,
  `fd2` char(100) NOT NULL,
  `fd3` char(100) NOT NULL,
  `fd4` char(100) NOT NULL,
  `fd5` char(100) NOT NULL,
  PRIMARY KEY (`fdpk`)
) ENGINE=InnoDB AUTO_INCREMENT=131071 DEFAULT CHARSET=latin1

MariaDB [test]> alter table tb_onlineddl add fd6 char(100), lock=none, algorithm=inplace;

MariaDB [test]> insert into tb_onlineddl values (null, 'insert', 'insert', 'insert', 'insert', 'insert');
Query OK, 1 row affected (0.00 sec)

MariaDB [test]> update tb_onlineddl set fd1='update' where fdpk=10;
Query OK, 0 rows affected (0.00 sec)
Rows matched: 1  Changed: 0  Warnings: 0

MariaDB [test]> delete from tb_onlineddl where fdpk=4;
Query OK, 1 row affected (0.00 sec)

>> DEBUG for INSERT :: online ddl row log buffer usage : Current: 519 , Total: 519,  Max: 134217728
>> DEBUG for UPDATE :: online ddl row log buffer usage : Current: 519 , Total: 1038, Max: 134217728
>> DEBUG for DELETE :: online ddl row log buffer usage : Current: 22 ,  Total: 1060, Max: 134217728
(in byte)

In this test, I learned that MySQL store whole record to row log buffer when I update just one column. But on DELETE statement, MySQL server store just primary key and some overhead bytes.

CASE-2 : Dropping column


CREATE TABLE `tb_onlineddl` (
  `fdpk` int(11) NOT NULL AUTO_INCREMENT,
  `fd1` char(100) NOT NULL,
  `fd2` char(100) NOT NULL,
  `fd3` char(100) NOT NULL,
  `fd4` char(100) NOT NULL,
  `fd5` char(100) NOT NULL,
  `fd6` char(100) NOT NULL,
  PRIMARY KEY (`fdpk`)
) ENGINE=InnoDB AUTO_INCREMENT=131071 DEFAULT CHARSET=latin1

Connection1 [test]> alter table tb_onlineddl drop fd6, lock=none, algorithm=inplace;

Connection2 [test]> insert into tb_onlineddl values (null, 'insert', 'insert', 'insert', 'insert', 'insert', 'insert');
Query OK, 1 row affected (0.01 sec)

Connection2 [test]> update tb_onlineddl set fd1='update' where fdpk=11;
Query OK, 0 rows affected (0.01 sec)
Rows matched: 1  Changed: 0  Warnings: 0

Connection2 [test]> delete from tb_onlineddl where fdpk=2;
Query OK, 1 row affected (0.00 sec)

>> DEBUG for INSERT :: online ddl row log buffer usage : Current: 620 , Total: 620,  Max: 134217728
>> DEBUG for UPDATE :: online ddl row log buffer usage : Current: 520 , Total: 1140, Max: 134217728
>> DEBUG for DELETE :: online ddl row log buffer usage : Current: 21 ,  Total: 1161, Max: 134217728

In this test, not much difference from CASE-1. One difference is that UPDATE statement use a little less row log buffer when ALTER DROP COLUMN.


CASE-3 : Dropping column (With TEXT/BLOB type column) 

CREATE TABLE `tb_onlineddl` (
  `fdpk` int(11) NOT NULL AUTO_INCREMENT,
  `fd1` char(100) NOT NULL,
  `fd2` char(100) NOT NULL,
  `fd3` char(100) NOT NULL,
  `fd4` char(100) NOT NULL,
  `fd5` char(100) NOT NULL,
  `fd6` char(100) NOT NULL,
  PRIMARY KEY (`fdpk`)
) ENGINE=InnoDB AUTO_INCREMENT=131071 DEFAULT CHARSET=latin1

Connection1 [test]> alter table tb_onlineddl drop fd6, lock=none, algorithm=inplace;

Connection2 [test]> insert into tb_onlineddl values (null, 'insert', 'insert', 'insert', 'insert', 'insert', 'insert');
Query OK, 1 row affected (0.01 sec)

Connection2 [test]> update tb_onlineddl set fd1='update' where fdpk=10;
Query OK, 0 rows affected (0.01 sec)
Rows matched: 1  Changed: 0  Warnings: 0

Connection2 [test]> delete from tb_onlineddl where fdpk=6;
Query OK, 1 row affected (0.00 sec)

CREATE TABLE `tb_onlineddl` (
  `fdpk` int(11) NOT NULL AUTO_INCREMENT,
  `fd1` char(100) NOT NULL,
  `fd2` char(100) NOT NULL,
  `fd3` char(100) NOT NULL,
  `fd4` char(100) NOT NULL,
  `fd5` text,
  `fd6` char(100) DEFAULT NULL,
  PRIMARY KEY (`fdpk`)
) ENGINE=InnoDB AUTO_INCREMENT=131072 DEFAULT CHARSET=latin1


>> DEBUG for INSERT :: online ddl row log buffer usage : Current: 527 , Total: 527, Max: 134217728
>> DEBUG for UPDATE :: online ddl row log buffer usage : Current: 425 , Total: 952, Max: 134217728
>> DEBUG for DELETE :: online ddl row log buffer usage : Current: 22  , Total: 974, Max: 134217728

In this test, Large object column (BLOB/TEXT) also use some bytes just needed.


CASE-4 : Adding column (With Index) 

CREATE TABLE `tb_onlineddl` (
  `fdpk` int(11) NOT NULL AUTO_INCREMENT,
  `fd1` char(100) NOT NULL,
  `fd2` char(100) NOT NULL,
  `fd3` char(100) NOT NULL,
  `fd4` char(100) NOT NULL,
  `fd5` char(100) NOT NULL,
  PRIMARY KEY (`fdpk`),
  KEY `ix_fd1` (`fd1`),
  KEY `ix_fd1_fd2` (`fd1`,`fd2`)
) ENGINE=InnoDB AUTO_INCREMENT=131073 DEFAULT CHARSET=latin1

Connection1 [test]> alter table tb_onlineddl add fd6 char(100), lock=none, algorithm=inplace;

Connection2 [test]> insert into tb_onlineddl values (null, 'insert', 'insert', 'insert', 'insert', 'insert');
Query OK, 1 row affected (0.02 sec)

Connection2 [test]> update tb_onlineddl set fd1='update' where fdpk=11;
Query OK, 0 rows affected (0.01 sec)
Rows matched: 1  Changed: 0  Warnings: 0

Connection2 [test]> delete from tb_onlineddl where fdpk=8;
Query OK, 1 row affected (0.01 sec)

>> DEBUG for INSERT :: online ddl row log buffer usage : Current: 519 , Total: 519 , Max: 134217728
>> DEBUG for UPDATE :: online ddl row log buffer usage : Current: 519 , Total: 1038, Max: 134217728
>> DEBUG for DELETE :: online ddl row log buffer usage : Current: 22  , Total: 1060, Max: 134217728

In this test, Secondary index does not affect row log buffer usage. MySQL server will rebuild secondary index with clustered key(primary key which contains whole columns' value).

So we can assume that apply row log phase(this phase need exclusive lock on table) will be slower when there's many index on table.



How to estimate row log buffer usage during ALTER

Now we can simply estimate how much row log buffer is need for Online altering table.


CASE-1


  •   10 columns, average row size = 350 bytes
  •   Use innodb_online_alter_log_max_size = 128MB
  •   If online ddl statement is lasting 10 minutes, Maximum allowable row change (UPDATE+INSERT) is 633 (128MB/350Bytes/60sec/10min) / second.
  •   If online ddl statement is lasting 30 minutes, Maximum allowable row change is 211/second.
  •   If online ddl statement is lasting 60 minutes, Maximum allowable row change is 105/second.



CASE-2


  •   80 columns, average row size = 1.2KBytes
  •   Use innodb_online_alter_log_max_size = 128MB
  •   If online ddl statement is lasting 10 minutes, Maximum allowable row change (UPDATE+INSERT) is 183 / second.
  •   If online ddl statement is lasting 30 minutes, Maximum allowable row change is 61/second.
  •   If online ddl statement is lasting 60 minutes, Maximum allowable row change is 30/second.


I know, this calculation is really rough. But I can't find better solution.
Even though this calculation is accurate, but estimating how many rows are updated and inserted on specific table is not so easy.



TODO

I think the best solution is that MySQL server reporting row log buffer usage and estimation of table-altering time. 

But it's not so easy to make MySQL server reporting row log buffer usage to client like MariaDB (Because of MySQL 5.5 ALTER is processed in MySQL handler, but MySQL 5.6 online DDL is handled in each storage engine).

I think the next best solution is adding new global status variables like. ..
(Global status, so if there's two or more concurrent altering will not be covered)


  • Innodb_onlineddl_rowlog_rows :: row count stored in row log buffer
  • Innodb_onlineddl_rowlog_used :: row log buffer usage (in bytes)
  • Innodb_onlineddl_progress :: Online DDL progress (in copied-rows or %)



With this status variables, DBA can estimate whether his online altering is succeed or not.
I will blog if this feature is ready. ^^

2014년 7월 17일 목요일

Defragmentation InnoDB table on MariaDB 10.0

Sometimes some of our service, there's a lot of DELETE but some case free space of removed rows can not be reused (Especially AutoIncrement primary key).
We have looking for online defragmentation feature for InnoDB long time.
MySQL 5.6 Facebook patch have this feature (I found this a few weeks ago), but not in MariaDB.
So I ported Facebook InnoDB defragmentation feature to MariaDB 10.0.

You can see more detailed explanation and patched source code from below facebook github and kakao tech blog.

https://www.github.com/facebook/mysql-5.6
http://kakao-dbe.blogspot.kr/2014/07/defragment-innodb-table-on-mariadb-100.html


Basic test (Defragmentation efficiency)

After insert 1 million rows, removed some rows(50% from tb_t50, 30% from tb_t30, 20% from tb_t10) from target tables.
And compare three table's data pages count after run "ALTER TABLE tb_txx DEFRAGMENT" command.

CREATE TABLE `tb_t50` (
  `fdpk` int(11) NOT NULL AUTO_INCREMENT,
  `fd1` char(100) NOT NULL,
  `fd2` char(100) NOT NULL,
  PRIMARY KEY (`fdpk`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

CREATE TABLE `tb_t30` (
  `fdpk` int(11) NOT NULL AUTO_INCREMENT,
  `fd1` char(100) NOT NULL,
  `fd2` char(100) NOT NULL,
  PRIMARY KEY (`fdpk`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

CREATE TABLE `tb_t10` -- // ==> Actually this table name must be tb_t20 )
  `fdpk` int(11) NOT NULL AUTO_INCREMENT,
  `fd1` char(100) NOT NULL,
  `fd2` char(100) NOT NULL,
  PRIMARY KEY (`fdpk`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;



Test scenario


  1. Insert 1048576 rows to tb_t50, tb_t30, tb_t10(fdpk column value is auto incremented from 1)
  2. Restart MariaDB server for clearing InnoDB buffer pool
  3. Check loaded page count from InnoDB buffer pool after running below query from tb_t50
    select count(*)
      from tb_t50 use index(primary)
      where fdpk between 1 and 1000000
      order by fdpk;
    select count(*)
      from information_schema.innodb_buffer_page
      where table_name='`test`.`tb_t50`';
  4. Delete 50% of rows from tb_t50, 30% of rows from tb_t30, 20% of rows from tb_t10
    delete from tb_t50 where fdpk%2=0;
    delete from tb_t30 where fdpk%3=0;
    delete from tb_t10 where fdpk%5=0;
  5. Restart MariaDB server
  6. Run select query from tb_t50 to load disk data page to InnoDB buffer pool
  7. Checking loaded page count from InnoDB buffer pool for tb_t50
  8. Doing 6) ~ 7) for tb_t30 and tb_t10



Test result








According to the first graph, pages will be merged automatically when removing 50% of rows from pages. 
But 1/3 and 1/5 case, pages are not merged automatically and stay intactly.
Average row count per page is decreased in tb_t20 and tb_t30 table (According to second graph).

But after defragmentation, average row count per page is getting same among three tables (4th graph). 
And loaded page of InnoDB buffer pool is getting smaller than before(before deletion)  (3rd graph).


Defragmentation Performance and Processing time

Below graph shows disk read iops according to innodb_defragment_frequency system variable.




  1. innodb_defragment_frequency = 100
  2. innodb_defragment_frequency = 1000
  3. innodb_defragment_frequency = 10
  4. innodb_defragment_frequency = 50


innodb_defragment_frequency is used to determine how many times merge method will be call per each second.
So greater value set to this variable more pages would be merged per second. So total defragmentation time is getting low.
But InnoDB defragment thread will read more pages if you set higher value to innodb_defragment_frequency system variable.
If you have fast ssd or whole data could be loaded into InnoDB buffer pool you should increase this variable.
Unless you should decrease this value.

Warning

You can not use this InnoDB defragmentation feature to shrink InnoDB tablespace's disk size.
This feature merge pages and free some page to reserved space of table, not to operating system.

2014년 7월 7일 월요일

MariaDB 10.0.12 Multi Threaded Slave

MariaDB 10.0 MTS

MariaDB 10.0 and MySQL 5.6 both have multi-threaded replication feature, but there's some differences.
Multi-threaded replication of MySQL 5.6 works based on schema(database) and MariaDB 10.0 works based on domain id of GTID.
MariaDB 10.0's multi-threaded replication based on GTID domain id is related with multi-source replication feature.
But we can control connection(session)'s domain id using gtid_domain_id system variable manually for slave multi-thread replication on single-master environment (Not multi-source replication).
Roughly, it is looks like MySQL 5.6 is automatic and MariaDB 10.0 is semi-automatic multi-threaded replication.

Both feature have some benefits as you think. But sometimes we want to allocate another slave sql thread for special purpose query like ALTER or heavy dml statement.
So MariaDB 10.0's semi-automatic multi-threaded replication is not so bad. On the other hand, sometimes schema based multi-threaded replication is useless because of the sharding. Usually sharded mysql instance has only one small database.

And MariaDB 10.0 has another multi-threaded replication feature. In MariaDB knowledge base, MariaDB 10.0 has two types of multi-threaded replication.
First thing is Out-of-ordered multi-threaded replication. it is GTID's domain id based replication written prior paragraph.
And second thing is In-ordered mutli-threaded replication. This type of multi-threaded replication is based on group commit of binary log on master side.
(In this blog, I will focus on second type of MTS based on group commit.)

Before going to multi-threaded replication on slave side, let's check MariaDB 10.0 group commit out first.

Binary Log Group Commit

This is not the first time MySQL binary log group commit is implemented. But this feature is removed in 5.0 because there's some issue and reborn in MySQL 5.5 and 5.6.
Whatever, MariaDB 10.0 group commit affect not only binary log flushing but also innodb transaction commit. 
So InnoDB transaction might get slow when you activate group commit parameters. But not so much if you control it properly.

MariaDB 10.0 supports two system parameters for binary log group commit.
binlog_commit_wait_usec
MariaDB will flush group of binary log events of multiple transactions requested in short time period. But there's no transactions committed at the exact same time. So MariaDB have to wait a little for waiting some transactions combined together
But MariaDB does not know the proper time to wait because it varies SLA and target performance. So MariaDB supports binlog_commit_wait_usec system variable so that DBA can control the time the oldest transaction can tolerate.
binlog_commit_wait_count
DBA can control how often group commit happen with not only time but also the count of transactions. It's transactions' count not binary log events' count.

We can observe how many transactions are committed together by decoding mysql binary log.

First create table "test" with 3 integer columns. And set binlog_commit_wait_usec and binlog_commit_wait_count system variables.
And open 4 connections and run the insert at the same time. You can run 4 insert statement using CSSHX like tool easily.

MariaDB [test]> set global binlog_commit_wait_usec=5000; /* 5 milli-seconds */
Query OK, 0 rows affected (0.00 sec)

MariaDB [test]> set global binlog_commit_wait_count=50;
Query OK, 0 rows affected (0.00 sec)

Client1 > insert into test values (1,1,1);
Client2 > insert into test values (2,2,2);
Client3 > insert into test values (3,3,3);
Client4 > insert into test values (4,4,4);

Each transaction will take more time(maximum 5 milli seconds in this case), and some trasnaction (lucky one) will take much less than 5 milli seconds.
On MariaDB 10.0, each transaction has "COMMIT ID" and a few transactions might have the same "COMMIT ID" if they committed within same group commit.
You can decode mysql binary log using mysqlbinlog or SHOW BINLOG EVENTS command and you can this "COMMIT ID" in the "BEGIN GTID ..." line. "COMMIT ID" is printed with "cid=".
Below sample is the result of above 4 clients' test.

MariaDB [test]> show binlog events in 'binlog.000015';
+-----------+-----+--------------+..+------------------------------------------------+
| Log_name  | Pos | Event_type   |..| Info                                           |
+-----------+-----+--------------+..+------------------------------------------------+
| binlog.00 |   4 | Format_desc  |..| Server ver: 10.0.11-MariaDB-log, Binlog ver: 4 |
| binlog.00 | 248 | Gtid_list    |..| [0-1-1213]                                     |
| binlog.00 | 287 | Binlog_check |..| binary-log.000014                              |
| binlog.00 | 327 | Binlog_check |..| binary-log.000015                              |
| binlog.00 | 367 | Gtid         |..| BEGIN GTID 0-1-1214 cid=1786                   |
| binlog.00 | 407 | Query        |..| use `test`; insert into test values (3,3,3)    |
| binlog.00 | 501 | Xid          |..| COMMIT /* xid=1786 */                          |
| binlog.00 | 528 | Gtid         |..| BEGIN GTID 0-1-1215 cid=1786                   |
| binlog.00 | 568 | Query        |..| use `test`; insert into test values (1,1,1)    |
| binlog.00 | 662 | Xid          |..| COMMIT /* xid=1788 */                          |
| binlog.00 | 689 | Gtid         |..| BEGIN GTID 0-1-1216 cid=1787                   |
| binlog.00 | 729 | Query        |..| use `test`; insert into test values (4,4,4)    |
| binlog.00 | 823 | Xid          |..| COMMIT /* xid=1787 */                          |
| binlog.00 | 850 | Gtid         |..| BEGIN GTID 0-1-1217 cid=1787                   |
| binlog.00 | 890 | Query        |..| use `test`; insert into test values (2,2,2)    |
| binlog.00 | 984 | Xid          |..| COMMIT /* xid=1789 */                          |
+-----------+-----+--------------+..+------------------------------------------------+

Binary log file (binlog.000015) has 4 transactions. (Each transaction start with "BEGIN GTID ...")
And all transactions have different GTID but same cid value "1786". This means all 4 transactions are (group) committed together.

MariaDB 10.0 Multi-Threaded Replication

MariaDB 10.0's In-ordered multi-threaded replication works based on binary log group commit.
Coordinator thread of slave side read all transactions which have same cid from relay log and distribute across multiple sql threads.
The reason slave can replay it parallel is all transactions within same "COMMIT ID" does not conflicted(changing same record). No confliction is certified on master MariaDB. (I think this is really cool idea. So I like MariaDB ^^).

So more transactions(group committed) on master more parallelism we can expect. 

Advantages for not grouped transactions

We can get some benefits if there's no group committed transactions on master.
MariaDB could run "commit" of each transaction parallel even though all transactions have different "COMMIT ID".
We can get a performance gain when "commit" itself is heavy like (sync_binlog=1 & log_slave_updates) or (innodb_flush_log_at_trx_commit이=1) configuration.


MTS Performance Test

Hardware spec


  • Intel(R) Xeon(R) CPU E3-1240 V2 @ 3.40GHz * 4 with HyperThreading
  • 32G memory
  • 2 SAS + Raid controller (R-1) with 512MB cache


MySQL Configurations


  • innodb_buffer_pool_size = 20G
  • binlog_commit_wait_usec = 1000
  • binlog_commit_wait_count = 50      ## only for MTS env.
  • slave_parallel_threads = 10        ## only for MTS env.
  • slave_parallel_max_queued = 524288 ## 512KB


Sysbench Configurations


  • 20 ~ 50 clients
  • table rows : 10 million
  • run only update_nokey update statement


Test Result


  • 15k Update statement / second  




  • 25k Update statement / second



 In this test I ran 25k update statements(per second), there's no replication delay on MTS replication.
I did not test how many update statements(per second) can be replicated without delay, because it's depend on query characteristics and performance.
And with MTS configuration, Slave server takes a lot of CPU cycles. (But it's fair enough, because Coordinator and SQL threads have to synchronize their status every time.)
And this is not a big deal on stand-by slave as we usually use this replication topology.

  • Group commit vs Master performance

There's one more thing we have to consider, transactions commit performance will be down when binlog_commit_wait_usec and binlog_commit_wait_count is set greater than 0.
Because each transactions have to wait until group conditions are met unless binlog_commit_wait_usec and binlog_commit_wait_count are 0. (Intermittent performance drop, I did not want to tune this because I think this performance drop is nothing to do with group commit performance.)


I ran the test both of group-commit activated and deactivated MariaDB. As you can see, there's some performance regression.
But in this test, this transaction waiting time affected throughput greatly because I made only 50 client threads. If there's 25k update statement (/second) with 5000 client thread, performance difference would be close.
And If we can choose proper waiting time and group-commit count(binlog_commit_wait_usec and binlog_commit_wait_count) this performance gap would not be a problem.



With MTS slave, we can see multiple sql thread replay binary event parallel.

MariaDB [(none)]> show processlist;
+-----+-------------+..+---------+-------------------..-+-----------------------------------------...---+
| Id  | User        |..| Command | State             .. | Info                                    ...   |
+-----+-------------+..+---------+-------------------..-+-----------------------------------------...---+
|   3 | system user |..| Connect | Waiting for prior .. | COMMIT                                  ...   |
|   4 | system user |..| Connect | init              .. | COMMIT                                  ...   |
|   5 | system user |..| Connect | init              .. | UPDATE sbtest set c='491483753-416518378...74 |
|   6 | system user |..| Connect | init              .. | UPDATE sbtest set c='909812426-309814605...27 |
|   7 | system user |..| Connect | Waiting for prior .. | COMMIT                                  ...   |
|   8 | system user |..| Connect | Waiting for prior .. | COMMIT                                  ...   |
|   9 | system user |..| Connect | init              .. | UPDATE sbtest set c='112858395-197504108...38 |
|  10 | system user |..| Connect | freeing items     .. | NULL                                    ...   |
|  11 | system user |..| Connect | init              .. | UPDATE sbtest set c='55183839-759991643-...4- |
|  12 | system user |..| Connect | Waiting for prior .. | COMMIT                                  ...   |
|  14 | root        |..| Sleep   |                   .. | NULL                                    ...   |
|  18 | system user |..| Connect | Waiting for master.. | NULL                                    ...   |
|  19 | system user |..| Connect | Slave has read all.. | NULL                                    ...   |
| 155 | root        |..| Query   | init              .. | show processlist                        ...   |
+-----+-------------+..+---------+-------------------..-+-----------------------------------------...---+

2014년 6월 22일 일요일

Memcached Replication (3)

This is the last article about Memcached Replication.If you missed previous two article, Read below article first for understanding.

  1. http://seonguck.blogspot.kr/2014/06/memcached-replication-1.html
  1. http://seonguck.blogspot.kr/2014/06/memcached-replication-2.html 



Install

Download KMC source first from GitHub.
https://github.com/kakao/mysql_5.6.14_kmc

To build KMC, you need to install basic library needed for MySQL 5.6.
Especially there's a lot of system which lack of libaio-devel and ncurses-devel and cmake.
And you need to install libmemcached library and development package.

libmemcached-1.0.4-1.el5.remi.x86_64.rpm
libmemcached-devel-1.0.4-1.el5.remi.x86_64.rpm

Now you can build KMC, you should run cmake with a few options.

# cd mysql-5.6.14_kmc
# mkdir Release
# cd Release
# cmake .. \
'-DBUILD_CONFIG=mysql_release' \
'-DCMAKE_INSTALL_PREFIX=/usr/local/mysql' \
'-DWITH_INNODB_MEMCACHED=ON' \
'-DENABLED_LOCAL_INFILE=OFF' \
'-DHAVE_QUERY_CACHE=OFF' \
'-DOPTIMIZER_TRACE=OFF' \
'-DENABLE_DEBUG_SYNC=OFF' \
'-DENABLED_PROFILING=OFF' \
'-DWITH_ARCHIVE_STORAGE_ENGINE=OFF' \
'-DWITH_EMBEDDED_SERVER=OFF' \
'-DENABLE_DTRACE=OFF'

Of course you don't need to all options, but you should put "DWITH_INNODB_MEMCACHED" for Memcached plugin of MySQL.
After cmake, just run make && make install. After installing you can find built MySQL executables in the directory you specified as INSTALL_PREFIX.
Before starting MySQL server, you should run $MYSQL_HOME/scripts/mysql_install_db script to create default dictionary schema as normal MySQL server.


Creating Memcached related schema.

After starting MySQL server, you should create Memcached related schema to activate Memcached plugin.
Initialization script is located in your MySQL home directory. Just run it.
Actually this procedure also need to original MySQL Memcached plugin not only for KMC. 

mysql> source $MYSQL_HOME/share/innodb_memcached_config.sql
mysql> use innodb_memcache
mysql> show tables

Check below three tables have created in innodb_memcache database.


  • cache_policies
  • config_options
  • containers


Now we need to initialize KMC basic schema.

mysql> USE innodb_memcache;
mysql> INSERT INTO `containers` VALUES ('default','kmc','kmc_template','k','v','f','c','e','PRIMARY');

mysql> CREATE DATABASE kmc;
mysql> USE kmc;

mysql> DROP TABLE IF EXISTS `kmc_template`;
mysql> CREATE TABLE `kmc_template` (
  `k` varchar(255) CHARACTER SET latin1 COLLATE latin1_bin NOT NULL DEFAULT '',
  `v` mediumblob,
  `f` int(11) NOT NULL DEFAULT '0',
  `c` bigint(20) unsigned NOT NULL DEFAULT '0',
  `e` int(11) NOT NULL DEFAULT '0',
  PRIMARY KEY (`k`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 STATS_PERSISTENT=0;

## Don't need this default row anymore, But I am not sure.
mysql> INSERT INTO `kmc_template` VALUES ('1','DO-NOT-REMOVE',0,0,0);

On this procedure, table name must start with "kmc_" prefix.


Configurations

Lastly change my.cnf configuration file for KMC. Below options are not only for KMC but also memcached performance.
We don't use KMC as InnoDB or MyISAM engine together. So I changed InnoDB options minimally.
This is not so strict options(except kmc_connect_string and binlog-format), so you can change it for your system standardizations.

## InnoDB ------------------------
...
innodb_data_file_path = ibdata1:64M:autoextend
...
innodb_log_file_size = 32M
innodb_log_files_in_group = 2
innodb_log_buffer_size = 16M

## Memcached --------------------
daemon_memcached_option = '-m 20480 -p 11211 -c 80000 -t 8 -s /tmp/memcached.sock'
innodb_api_enable_binlog = 1
innodb_api_trx_level=0 ## READ-UNCOMMITTED
innodb_api_bk_commit_interval=1
daemon_memcached_r_batch_size=1
daemon_memcached_w_batch_size=1
innodb_api_enable_mdl=OFF

## Replication:Binary log -----------
server-id  = 1
## Replication query is always idempotent
slave_exec_mode = IDEMPOTENT
## Memcached connection string :: with socket file
kmc_connect_string = '--SOCKET="/tmp/memcached.sock" --BINARY-PROTOCOL --NOREPLY --TCP-NODELAY --TCP-KEEPALIVE'
binlog-checksum = NONE
sync_binlog = 0
master_info_repository = FILE
relay_log_info_repository = FILE
sync_master_info    = 0
sync_relay_log      = 0
sync_relay_log_info = 0
slave_checkpoint_group = 100000
slave_checkpoint_period = 1000
max_binlog_size = 100M
expire_logs_days=1
expire_max_log_files = 15
binlog-format = ROW

daemon_memcached_option is Memcached plugin startup options and kmc_connect_string option is for SQL thread of slave MySQL server.
Now starting MySQL server and check the basic function on.
And you can check replication setup and the way to check replication status out is same as original MySQL server.


Testing Memcached and Replication

MySQL Memcached plugin support both binary and text mode protocol, so we can simply test memcached operation with telnet program.

[root@matt001 ~]# telnet localhost 11211
Trying 127.0.0.1...
Connected to localhost.localdomain (127.0.0.1).
Escape character is '^]'.

set matt 0 0 9
Seong Uck
STORED

get matt
VALUE matt 0 9
Seong Uck
END

delete matt
DELETED

get matt
END

quit
Connection closed by foreign host.

If you want to check whether socket file(/tmp/memcached.sock) is working correctly, you can use netcat utility. In this case you can't use telnet.

[root@matt001 ~]# nc -U /tmp/memcached.sock
set matt 0 0 9
SEONG UCK
STORED

get matt
VALUE matt 0 9
SEONG UCK
END

quit
[root@matt001 ~]#

If you want to check the replicated data, run GET operation on slave memcached after run SET on master memcached.
Of course you can inspect the binary log file through mysqlbinlog utility.



Memcached replication feature(Added feature) does't have unit test yet. So you should be carefull.

2014년 6월 21일 토요일

Memcached Replication (2)

Modifications

As written in the first article about Memcached Replication, need to change some of MySQL 5.6 source code.
Below diagram will explain how MySQL Memcached plugin works and what I modified.



(Blue line is original behavior of MySQL 5.6 Memcached plugin's Caching and Innodb-only mode, Bold brown line means my modifications)


And the number of the diagram means ...


  1. InnoDB API will transfer the input data to InnoDB engine(Table) as configured with innodb_api_bk_commit_interval, daemon_memcached_r_batch_size, daemon_memcached_w_batch_size system variables. And after that InnoDB API will write changes to binary log. But this is not happened on Cache-only mode. I modified MySQL code as memcached changes is also written to binary log but not apply that changes to InnoDB engine on Cache-only mode Memcached plugin. On original MySQL 5.6 Memcached plugin, dictionary table(innodb_memcache DB) and container table needs for Caching and Innodb-only cache mode not for Cache-only mode. But after modification dictionary table and container table is gotten to need for binary log writting. Of course we will not store memcached data to container table, but need it's template for binary log writting. 
  2. On slave side, SQL thread have to apply relay log gotten by IO thread to MySQL Memcached plugin NOT InnoDB tables. But on original MySQL replication, SQL thread directly access InnoDB API not Memcached. I need a way to access to Memcached plugin. I implemented it as SQL thread bring up another memcached client (using libmemcached client library). I can use Non-blocking mode mode with libmemcached client library for fast relay log applying. Actually in original MySQL 5.6, libmemcached.so shared library also exist on your $MYSQL_HOME/lib/plugin, So I renamed it as "libmemcachedserver.so" so that install libmemcached RPM separately on the same system.
  3. Repication SQL thread act as Memcached client have to run as fast as possible, so modified SQL thread's memcached client connect to Memcached socket file rather than 11211 TCP port. And worked as Non-blocking mode for fast replication. Actually SQL thread will apply relay log event as sync-mode on every 100th request so that slave can check replication apply error.




Another modifications


  • Originally, Memcached plugin never open both of TCP port and Unix domain socket at the same time, But I need both channel.
  • Copy Memcached plugin's status metric to MySQL server's status variables so that we can monitor memcached status with MySQL monitoring tool.InnoDB master thread will copy memcached status metrics to MySQL status variables every second.

mysql> show global status like '%kmc%';
+----------------------------------+--------------+
| Variable_name                    | Value        |
+----------------------------------+--------------+
| Innodb_kmc_connection_structures | 2004         |
| Innodb_kmc_curr_connections      | 2002         |
| Innodb_kmc_curr_items            | 70638969     |
| Innodb_kmc_pointer_size          | 64           |
| Innodb_kmc_threads               | 8            |
| Innodb_kmc_total_connections     | 5090         |
| Innodb_kmc_total_items           | 778120599    |
| Innodb_kmc_bytes                 | 18648687816  |
| Innodb_kmc_bytes_read            | 460650586044 |
| Innodb_kmc_bytes_written         | 497924594841 |
| Innodb_kmc_cmd_get               | 7781206118   |
| Innodb_kmc_cmd_set               | 778120599    |
| Innodb_kmc_evictions             | 498675673    |
| Innodb_kmc_get_hits              | 1043938514   |
| Innodb_kmc_get_misses            | 6737267604   |
| Innodb_kmc_limit_maxbytes        | 21474836480  |
+----------------------------------+--------------+


  • Fix related memory leak of MySQL 5.6.14 and expiration time bug of Memcached plugin.
  • Remove every file sync call for fast processing (if possible).
  • Add expire_max_log_files system variables so that we can control total binary logs' size by file count.(This features only use binary log file's suffix number, so if you use frequent "PURGE LOGS" or other commands which switch binary log file, expire_max_log_files system variables will not work as expected)
  • Add Binlog_purge_failed system status variable for monitoring binary log purge
    • 0 : Okay
    • 1 : Set when MySQL server attempt to remove ACTIVE state binary log.
    • 2 : SET when MySQL server attempt to remove USE state binary log.
  • Add kmc_connect_string system variable so that you can change how SQL thread connect to it's local memcached plugin. You can change it whenever you want because it is dynamic variable. But You have to "STOP/START SLAVE" for applying change.




Usable Memcached Operations

Unfortunately on KMC(Kakao MemCached), Some of memcached operations are not usable. It's because of Memcached replication characteristic (And have not tested it).

Usable operations

  • GET
  • SET
  • ADD
  • DELETE
  • REPLACE


Not implemented or Not tested

  • INCREMENT
  • DECREMENT
  • CAS


Performance

GET operations' performance is same as Original Memcached. But SET/ADD/DELETE operations need to written to disk (Binary log file). And there's a lot of complex processing are involed even though it's not sync mode. So SET/ADD/DELETE performance is dramatically lower than original memcached server.
But general purpose of Memcached server, SET/ADD/DELETE operations is not so many. Once data item is cached on Memcached, then GET operation is performed all the time.
If not (If there's a lot of SET/ADD/DELETE), that memcached server has no effect becuase cache ratio is low. And in this case Memcached server only add overhead (I think).

I also modified SQL thread work as parallel based on row, But MySQL 5.6 multi threaded replication make a lot of sync overhead(CPU overhead) among slave threads and coordinate thread.

MySQL Memcached plugin can process more SET/ADD/DELETE operations than normal MySQL's SQL statements, but you have to consider replication.
SQL thread of slave side is working as single thread, and this might be bottleneck.
On Intel X86 commodity server, I think 10k SET/ADD/DELETE operations are limit. it's the limit of replication (No replication delay status)
So I added expire_max_log_files system variable and Binlog_purge_failed system status variable. If you have enough memory, allocate some of memory for binary log directory using RamFS or RamDisk for SET/DELETE/ADD operation's performance. Added system variable and status variable will help RamFs usage management.





Some limitations


  • Container table name must start with "kmc_".
  • Can't use CAS, INCR and DECR operations.
  • KMC(Kakao MemCached) use two upper bytes of Memcached flags field for originated server_id. So this is not compatible with your memcached client.
  • KMC will convert your DELETE operation to SET operation with 1 second expire (because we can't use flags field on DELETE operation).




Download

https://github.com/kakao/mysql_5.6.14_kmc