2014년 7월 17일 목요일

Defragmentation InnoDB table on MariaDB 10.0

Sometimes some of our service, there's a lot of DELETE but some case free space of removed rows can not be reused (Especially AutoIncrement primary key).
We have looking for online defragmentation feature for InnoDB long time.
MySQL 5.6 Facebook patch have this feature (I found this a few weeks ago), but not in MariaDB.
So I ported Facebook InnoDB defragmentation feature to MariaDB 10.0.

You can see more detailed explanation and patched source code from below facebook github and kakao tech blog.

https://www.github.com/facebook/mysql-5.6
http://kakao-dbe.blogspot.kr/2014/07/defragment-innodb-table-on-mariadb-100.html


Basic test (Defragmentation efficiency)

After insert 1 million rows, removed some rows(50% from tb_t50, 30% from tb_t30, 20% from tb_t10) from target tables.
And compare three table's data pages count after run "ALTER TABLE tb_txx DEFRAGMENT" command.

CREATE TABLE `tb_t50` (
  `fdpk` int(11) NOT NULL AUTO_INCREMENT,
  `fd1` char(100) NOT NULL,
  `fd2` char(100) NOT NULL,
  PRIMARY KEY (`fdpk`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

CREATE TABLE `tb_t30` (
  `fdpk` int(11) NOT NULL AUTO_INCREMENT,
  `fd1` char(100) NOT NULL,
  `fd2` char(100) NOT NULL,
  PRIMARY KEY (`fdpk`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

CREATE TABLE `tb_t10` -- // ==> Actually this table name must be tb_t20 )
  `fdpk` int(11) NOT NULL AUTO_INCREMENT,
  `fd1` char(100) NOT NULL,
  `fd2` char(100) NOT NULL,
  PRIMARY KEY (`fdpk`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;



Test scenario


  1. Insert 1048576 rows to tb_t50, tb_t30, tb_t10(fdpk column value is auto incremented from 1)
  2. Restart MariaDB server for clearing InnoDB buffer pool
  3. Check loaded page count from InnoDB buffer pool after running below query from tb_t50
    select count(*)
      from tb_t50 use index(primary)
      where fdpk between 1 and 1000000
      order by fdpk;
    select count(*)
      from information_schema.innodb_buffer_page
      where table_name='`test`.`tb_t50`';
  4. Delete 50% of rows from tb_t50, 30% of rows from tb_t30, 20% of rows from tb_t10
    delete from tb_t50 where fdpk%2=0;
    delete from tb_t30 where fdpk%3=0;
    delete from tb_t10 where fdpk%5=0;
  5. Restart MariaDB server
  6. Run select query from tb_t50 to load disk data page to InnoDB buffer pool
  7. Checking loaded page count from InnoDB buffer pool for tb_t50
  8. Doing 6) ~ 7) for tb_t30 and tb_t10



Test result








According to the first graph, pages will be merged automatically when removing 50% of rows from pages. 
But 1/3 and 1/5 case, pages are not merged automatically and stay intactly.
Average row count per page is decreased in tb_t20 and tb_t30 table (According to second graph).

But after defragmentation, average row count per page is getting same among three tables (4th graph). 
And loaded page of InnoDB buffer pool is getting smaller than before(before deletion)  (3rd graph).


Defragmentation Performance and Processing time

Below graph shows disk read iops according to innodb_defragment_frequency system variable.




  1. innodb_defragment_frequency = 100
  2. innodb_defragment_frequency = 1000
  3. innodb_defragment_frequency = 10
  4. innodb_defragment_frequency = 50


innodb_defragment_frequency is used to determine how many times merge method will be call per each second.
So greater value set to this variable more pages would be merged per second. So total defragmentation time is getting low.
But InnoDB defragment thread will read more pages if you set higher value to innodb_defragment_frequency system variable.
If you have fast ssd or whole data could be loaded into InnoDB buffer pool you should increase this variable.
Unless you should decrease this value.

Warning

You can not use this InnoDB defragmentation feature to shrink InnoDB tablespace's disk size.
This feature merge pages and free some page to reserved space of table, not to operating system.

2014년 7월 7일 월요일

MariaDB 10.0.12 Multi Threaded Slave

MariaDB 10.0 MTS

MariaDB 10.0 and MySQL 5.6 both have multi-threaded replication feature, but there's some differences.
Multi-threaded replication of MySQL 5.6 works based on schema(database) and MariaDB 10.0 works based on domain id of GTID.
MariaDB 10.0's multi-threaded replication based on GTID domain id is related with multi-source replication feature.
But we can control connection(session)'s domain id using gtid_domain_id system variable manually for slave multi-thread replication on single-master environment (Not multi-source replication).
Roughly, it is looks like MySQL 5.6 is automatic and MariaDB 10.0 is semi-automatic multi-threaded replication.

Both feature have some benefits as you think. But sometimes we want to allocate another slave sql thread for special purpose query like ALTER or heavy dml statement.
So MariaDB 10.0's semi-automatic multi-threaded replication is not so bad. On the other hand, sometimes schema based multi-threaded replication is useless because of the sharding. Usually sharded mysql instance has only one small database.

And MariaDB 10.0 has another multi-threaded replication feature. In MariaDB knowledge base, MariaDB 10.0 has two types of multi-threaded replication.
First thing is Out-of-ordered multi-threaded replication. it is GTID's domain id based replication written prior paragraph.
And second thing is In-ordered mutli-threaded replication. This type of multi-threaded replication is based on group commit of binary log on master side.
(In this blog, I will focus on second type of MTS based on group commit.)

Before going to multi-threaded replication on slave side, let's check MariaDB 10.0 group commit out first.

Binary Log Group Commit

This is not the first time MySQL binary log group commit is implemented. But this feature is removed in 5.0 because there's some issue and reborn in MySQL 5.5 and 5.6.
Whatever, MariaDB 10.0 group commit affect not only binary log flushing but also innodb transaction commit. 
So InnoDB transaction might get slow when you activate group commit parameters. But not so much if you control it properly.

MariaDB 10.0 supports two system parameters for binary log group commit.
binlog_commit_wait_usec
MariaDB will flush group of binary log events of multiple transactions requested in short time period. But there's no transactions committed at the exact same time. So MariaDB have to wait a little for waiting some transactions combined together
But MariaDB does not know the proper time to wait because it varies SLA and target performance. So MariaDB supports binlog_commit_wait_usec system variable so that DBA can control the time the oldest transaction can tolerate.
binlog_commit_wait_count
DBA can control how often group commit happen with not only time but also the count of transactions. It's transactions' count not binary log events' count.

We can observe how many transactions are committed together by decoding mysql binary log.

First create table "test" with 3 integer columns. And set binlog_commit_wait_usec and binlog_commit_wait_count system variables.
And open 4 connections and run the insert at the same time. You can run 4 insert statement using CSSHX like tool easily.

MariaDB [test]> set global binlog_commit_wait_usec=5000; /* 5 milli-seconds */
Query OK, 0 rows affected (0.00 sec)

MariaDB [test]> set global binlog_commit_wait_count=50;
Query OK, 0 rows affected (0.00 sec)

Client1 > insert into test values (1,1,1);
Client2 > insert into test values (2,2,2);
Client3 > insert into test values (3,3,3);
Client4 > insert into test values (4,4,4);

Each transaction will take more time(maximum 5 milli seconds in this case), and some trasnaction (lucky one) will take much less than 5 milli seconds.
On MariaDB 10.0, each transaction has "COMMIT ID" and a few transactions might have the same "COMMIT ID" if they committed within same group commit.
You can decode mysql binary log using mysqlbinlog or SHOW BINLOG EVENTS command and you can this "COMMIT ID" in the "BEGIN GTID ..." line. "COMMIT ID" is printed with "cid=".
Below sample is the result of above 4 clients' test.

MariaDB [test]> show binlog events in 'binlog.000015';
+-----------+-----+--------------+..+------------------------------------------------+
| Log_name  | Pos | Event_type   |..| Info                                           |
+-----------+-----+--------------+..+------------------------------------------------+
| binlog.00 |   4 | Format_desc  |..| Server ver: 10.0.11-MariaDB-log, Binlog ver: 4 |
| binlog.00 | 248 | Gtid_list    |..| [0-1-1213]                                     |
| binlog.00 | 287 | Binlog_check |..| binary-log.000014                              |
| binlog.00 | 327 | Binlog_check |..| binary-log.000015                              |
| binlog.00 | 367 | Gtid         |..| BEGIN GTID 0-1-1214 cid=1786                   |
| binlog.00 | 407 | Query        |..| use `test`; insert into test values (3,3,3)    |
| binlog.00 | 501 | Xid          |..| COMMIT /* xid=1786 */                          |
| binlog.00 | 528 | Gtid         |..| BEGIN GTID 0-1-1215 cid=1786                   |
| binlog.00 | 568 | Query        |..| use `test`; insert into test values (1,1,1)    |
| binlog.00 | 662 | Xid          |..| COMMIT /* xid=1788 */                          |
| binlog.00 | 689 | Gtid         |..| BEGIN GTID 0-1-1216 cid=1787                   |
| binlog.00 | 729 | Query        |..| use `test`; insert into test values (4,4,4)    |
| binlog.00 | 823 | Xid          |..| COMMIT /* xid=1787 */                          |
| binlog.00 | 850 | Gtid         |..| BEGIN GTID 0-1-1217 cid=1787                   |
| binlog.00 | 890 | Query        |..| use `test`; insert into test values (2,2,2)    |
| binlog.00 | 984 | Xid          |..| COMMIT /* xid=1789 */                          |
+-----------+-----+--------------+..+------------------------------------------------+

Binary log file (binlog.000015) has 4 transactions. (Each transaction start with "BEGIN GTID ...")
And all transactions have different GTID but same cid value "1786". This means all 4 transactions are (group) committed together.

MariaDB 10.0 Multi-Threaded Replication

MariaDB 10.0's In-ordered multi-threaded replication works based on binary log group commit.
Coordinator thread of slave side read all transactions which have same cid from relay log and distribute across multiple sql threads.
The reason slave can replay it parallel is all transactions within same "COMMIT ID" does not conflicted(changing same record). No confliction is certified on master MariaDB. (I think this is really cool idea. So I like MariaDB ^^).

So more transactions(group committed) on master more parallelism we can expect. 

Advantages for not grouped transactions

We can get some benefits if there's no group committed transactions on master.
MariaDB could run "commit" of each transaction parallel even though all transactions have different "COMMIT ID".
We can get a performance gain when "commit" itself is heavy like (sync_binlog=1 & log_slave_updates) or (innodb_flush_log_at_trx_commit이=1) configuration.


MTS Performance Test

Hardware spec


  • Intel(R) Xeon(R) CPU E3-1240 V2 @ 3.40GHz * 4 with HyperThreading
  • 32G memory
  • 2 SAS + Raid controller (R-1) with 512MB cache


MySQL Configurations


  • innodb_buffer_pool_size = 20G
  • binlog_commit_wait_usec = 1000
  • binlog_commit_wait_count = 50      ## only for MTS env.
  • slave_parallel_threads = 10        ## only for MTS env.
  • slave_parallel_max_queued = 524288 ## 512KB


Sysbench Configurations


  • 20 ~ 50 clients
  • table rows : 10 million
  • run only update_nokey update statement


Test Result


  • 15k Update statement / second  




  • 25k Update statement / second



 In this test I ran 25k update statements(per second), there's no replication delay on MTS replication.
I did not test how many update statements(per second) can be replicated without delay, because it's depend on query characteristics and performance.
And with MTS configuration, Slave server takes a lot of CPU cycles. (But it's fair enough, because Coordinator and SQL threads have to synchronize their status every time.)
And this is not a big deal on stand-by slave as we usually use this replication topology.

  • Group commit vs Master performance

There's one more thing we have to consider, transactions commit performance will be down when binlog_commit_wait_usec and binlog_commit_wait_count is set greater than 0.
Because each transactions have to wait until group conditions are met unless binlog_commit_wait_usec and binlog_commit_wait_count are 0. (Intermittent performance drop, I did not want to tune this because I think this performance drop is nothing to do with group commit performance.)


I ran the test both of group-commit activated and deactivated MariaDB. As you can see, there's some performance regression.
But in this test, this transaction waiting time affected throughput greatly because I made only 50 client threads. If there's 25k update statement (/second) with 5000 client thread, performance difference would be close.
And If we can choose proper waiting time and group-commit count(binlog_commit_wait_usec and binlog_commit_wait_count) this performance gap would not be a problem.



With MTS slave, we can see multiple sql thread replay binary event parallel.

MariaDB [(none)]> show processlist;
+-----+-------------+..+---------+-------------------..-+-----------------------------------------...---+
| Id  | User        |..| Command | State             .. | Info                                    ...   |
+-----+-------------+..+---------+-------------------..-+-----------------------------------------...---+
|   3 | system user |..| Connect | Waiting for prior .. | COMMIT                                  ...   |
|   4 | system user |..| Connect | init              .. | COMMIT                                  ...   |
|   5 | system user |..| Connect | init              .. | UPDATE sbtest set c='491483753-416518378...74 |
|   6 | system user |..| Connect | init              .. | UPDATE sbtest set c='909812426-309814605...27 |
|   7 | system user |..| Connect | Waiting for prior .. | COMMIT                                  ...   |
|   8 | system user |..| Connect | Waiting for prior .. | COMMIT                                  ...   |
|   9 | system user |..| Connect | init              .. | UPDATE sbtest set c='112858395-197504108...38 |
|  10 | system user |..| Connect | freeing items     .. | NULL                                    ...   |
|  11 | system user |..| Connect | init              .. | UPDATE sbtest set c='55183839-759991643-...4- |
|  12 | system user |..| Connect | Waiting for prior .. | COMMIT                                  ...   |
|  14 | root        |..| Sleep   |                   .. | NULL                                    ...   |
|  18 | system user |..| Connect | Waiting for master.. | NULL                                    ...   |
|  19 | system user |..| Connect | Slave has read all.. | NULL                                    ...   |
| 155 | root        |..| Query   | init              .. | show processlist                        ...   |
+-----+-------------+..+---------+-------------------..-+-----------------------------------------...---+