Thursday, January 11, 2018

More Write Set in MySQL: Group Replication Certification

This is the third post in the series on Write Set in MySQL.  In the first post, we explore how Write Set allows to get better parallel replication in MySQL 8.0.  In the second post, we saw how the MySQL 8.0 improvement is an extension of the work done in MySQL 5.7 to avoid replication delay/lag in Group Replication.  In this post, we will see how Write Set is used in Group Replication to detect conflicts in multi-writer mode during certification.  We will also see the impacts, on conflict detection, of the Write Set bug that I presented in the first post.

As for the previous post, the examples in this post assume that you have a working Group Replication installation (with MySQL 5.7.20 if you want to see the impact of the Write Set bug).  I am not describing how to set up Group Replication here, you can read Matt Lord's post for that (MySQL Group Replication quick start guide).  For this post, the group needs to be multi-writer (the official documentation calls this mode multi-primary), so make sure that you run SET GLOBAL group_replication_single_primary_mode = OFF before starting the group.

In multi-primary mode, Group Replication allows transactions to be executed on many members of the group concurrently.  This comes with the hidden complexity (in Group Replication implementation) of detecting (and avoiding/rejecting) transactions that generate conflicts.  When all writes are done on a single node, locking allows to ensure data consistency but it is a little more complex in a distributed systems like Group Replication.

To protect against conflict, Group Replication uses a process called certification.  From the documentation, we can read:
[...] there may be conflicts between transactions that execute concurrently on different servers. Such conflicts are detected by inspecting the write sets of two different and concurrent transactions [...]. The resolution procedure states that the transaction that was ordered first commits on all servers, whereas the transaction ordered second aborts, and thus is rolled back [..]. This is in fact a distributed first commit wins rule.
The image below, also from the manual, explains how Group Replication executes and commits transactions.  If certification fails (a conflict is detected), the transaction rolls-back.


Let's see certification in action and for that, we need to run a few initialisation commands:
> SELECT version();
+------------+
| version()  |
+------------+
| 5.7.20-log |
+------------+
1 row in set (0.00 sec)

> CREATE DATABASE test_jfg_ws;
Query OK, 1 row affected (0.01 sec)

> CREATE TABLE test_jfg_ws.test_jfg_ws (
    ->   id int(10) unsigned NOT NULL AUTO_INCREMENT PRIMARY KEY,
    ->   str varchar(80) NOT NULL UNIQUE);
Query OK, 0 rows affected (0.01 sec)

> CREATE TABLE test_jfg_ws.test_jfg_ws2 (
    ->   id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
    ->   str VARCHAR(80) NOT NULL);
Query OK, 0 rows affected (0.01 sec)
Creating a conflict need precise timing.  Like in the previous post, I do this using a long transaction.  So again, I need a table with many rows (one million in my case) which I created by running the following commands in a Linux shell:
# sql="INSERT INTO test_jfg_ws.test_jfg_ws2 (str) VALUES (RAND());"
# for i in $(seq 100); do
>   ( echo "BEGIN;"; yes "$sql" | head -n 10000; echo "COMMIT;"; ) | mysql
> done
With the table above and on my test system, executing the below ALTER takes between three and four seconds.  If it takes less than two seconds on your system, run the loop above again to get more rows in the table.
> ALTER TABLE test_jfg_ws.test_jfg_ws2 MODIFY COLUMN str VARCHAR(60);
Query OK, 1000000 rows affected (3.6 sec)
Records: 1000000  Duplicates: 0  Warnings: 0
Now we are ready to trigger a conflict.  On two different members of the group and about at the same time, we will INSERT a row that conflict on a UNIQUE KEY.  For that, we are running the first script below on one member of the group, and the second on a different member of the group.  The first script is waiting for the ALTER of the second script to arrive, sleeps a little to make sure the INSERT of the other member reaches this node, is certified but not yet applied (the ALTER is still running), and then tries the conflicting INSERT.
### On one member, run this in a Linux shell:
# sql_conflicts="SELECT COUNT_CONFLICTS_DETECTED FROM performance_schema.replication_group_member_stats\G";
# sql_count="SELECT count(*) as nb_z FROM test_jfg_ws.test_jfg_ws WHERE str = 'z'\G"
# while sleep 1; do
>   mysql <<< "SHOW PROCESSLIST" | grep -q "test_jfg_ws.test_jfg_ws2" || continue
>   sleep 0.5
>   mysql <<< "$sql_conflicts"
>   mysql <<< "BEGIN;
>              INSERT INTO test_jfg_ws.test_jfg_ws (str) VALUES ('z');
>              $sql_count
>              COMMIT"
>   mysql <<< "$sql_count $sql_conflicts"
>   break
> done

### On a different member, run this in a Linux shell:
# mysql <<< "
>   ALTER TABLE test_jfg_ws.test_jfg_ws2 MODIFY COLUMN str VARCHAR(80);
>   INSERT INTO test_jfg_ws.test_jfg_ws (str) VALUES ('z')"
If everything goes as expected (the INSERT on the waiting node is ran while the ALTER is applied and before the INSERT of the other node is executed), we get below as the output of the above script on the waiting member.
*************************** 1. row ***************************
COUNT_CONFLICTS_DETECTED: 0
*************************** 1. row ***************************
nb_z: 1
ERROR 3101 (HY000) at line 4: Plugin instructed the server to rollback the current transaction.
*************************** 1. row ***************************
nb_z: 0
*************************** 1. row ***************************
COUNT_CONFLICTS_DETECTED: 1
As shown by the "nb_z: 1" line in the output above, the INSERT on the waiting member completed (we are not at COMMIT yet), so it is committing the transaction that prints "ERROR 3101" (not INSERT).  After that, we have no row with value "z" in the table (nb_z: 0), which proves that the INSERT of the other member is not yet executed on this member (only certified).  So thanks to Write Set, certification detects the conflict and the transaction is rolled-back during COMMIT.

Let's look again at the last sentence of the previous paragraph: "certification detects the conflict and the transaction is rolled-back during COMMIT".  I do not know of a situation where COMMIT can return an error on a standalone MySQL instance or on a master with standard or semi-sync slaves (if you know of such a situation, please share it in the comments).  In those deployments, COMMIT can hang (and eventually timeout in the application leaving a hung session on the MySQL server), or the connection to MySQL can be lost.  In both cases and as seen from the application, the result of COMMIT is undefined (it might have succeeded or the transaction might be rolled-back), but COMMIT did not return an error.  This is a new failure mode in Group Replication, so if you deploy a group in multi-primary mode, make sure that your application is testing for COMMIT errors as it can happen frequently if your workload generates many conflicts.
In Group Replication, commit can fail (returning an error) !
We have now seen certification in action to detect conflicts, let's now look at the impact of the Write Set bug presented in the first post of this series.  As a quick reminder, Write Set (in MySQL 5.7.20 and 8.0.3) does not identify transaction dependency correctly combined with case insensitive collation (Bug#86078).  Let's see what happens if we run a similar test case as above, but this time with a conflict involving data with different case.
### On one member, run this in a shell:
# while sleep 1; do
>   mysql <<< "SHOW PROCESSLIST" | grep -q "test_jfg_ws.test_jfg_ws2" || continue
>   mysql <<< "INSERT INTO test_jfg_ws.test_jfg_ws (str) VALUES ('Y')"
>   break
> done

### On a different member, run this in a shell:
# mysql <<< "
>   ALTER TABLE test_jfg_ws.test_jfg_ws2 MODIFY COLUMN str VARCHAR(60);
>   INSERT INTO test_jfg_ws.test_jfg_ws (str) VALUES ('y')"
Running both scripts above succeeds without any scary output but a few seconds later, in MySQL 5.7.20 and 8.0.3, Group Replication breaks.  We can look at the state of the group by running the following:
> SELECT MEMBER_ID, MEMBER_STATE FROM performance_schema.replication_group_members;
+--------------------------------------+--------------+
| MEMBER_ID                            | MEMBER_STATE |
+--------------------------------------+--------------+
| _member_uuid_                        | ERROR        |
+--------------------------------------+--------------+
1 row in set (0.00 sec)
I also have below in the error log of the waiting member (where the "while" was run - line wrapping added for better readability):
2018-01-10T16:33:47.206156Z 57 [ERROR] Slave SQL for channel 'group_replication_applier':
  Could not execute Write_rows event on table test_jfg_ws.test_jfg_ws; Duplicate entry 'y' for key 'str',
  Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 168, Error_code: 1062

2018-01-10T16:33:47.206189Z 57 [Warning] Slave: Duplicate entry 'y' for key 'str' Error_code: 1062

2018-01-10T16:33:47.206193Z 57 [ERROR] Error running query, slave SQL thread aborted.
  Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'FIRST' position 144

2018-01-10T16:33:47.206204Z 57 [ERROR] Plugin group_replication reported: 'The applier thread execution was aborted.
  Unable to process more transactions, this member will now leave the group.'

2018-01-10T16:33:47.206257Z 54 [ERROR] Plugin group_replication reported: 'Fatal error during execution
  on the Applier process of Group Replication. The server will now leave the group.'

2018-01-10T16:33:47.206325Z 54 [ERROR] Plugin group_replication reported: 'The server was automatically set into
  read only mode after an error was detected.'
And I have very similar to above in the error log of the other member (where the ALTER was run), only the first three lines are a little different:
2018-01-10T16:33:43.473575Z 28 [ERROR] Slave SQL for channel 'group_replication_applier':
  Could not execute Write_rows event on table test_jfg_ws.test_jfg_ws; Duplicate entry 'Y' for key 'str',
  Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 168, Error_code: 1062

2018-01-10T16:33:43.473603Z 28 [Warning] Slave: Duplicate entry 'Y' for key 'str' Error_code: 1062

2018-01-10T16:33:43.473608Z 28 [ERROR] Error running query, slave SQL thread aborted.
  Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'FIRST' position 65
We can also see the following in Performance Schema on the two members respectively:
> SELECT * FROM performance_schema.replication_applier_status_by_worker
    ->   WHERE CHANNEL_NAME = 'group_replication_applier'\G
*************************** 1. row ***************************
         CHANNEL_NAME: group_replication_applier
            WORKER_ID: 0
            THREAD_ID: NULL
        SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: UUID:111
    LAST_ERROR_NUMBER: 1062
   LAST_ERROR_MESSAGE: Could not execute Write_rows event on table test_jfg_ws.test_jfg_ws;
Duplicate entry 'y' for key 'str', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY;
the event's master log FIRST, end_log_pos 168
 LAST_ERROR_TIMESTAMP: 2018-01-10 17:33:47
1 row in set (0.00 sec)

> SELECT * FROM performance_schema.replication_applier_status_by_worker
    ->   WHERE CHANNEL_NAME = 'group_replication_applier'\G
*************************** 1. row ***************************
         CHANNEL_NAME: group_replication_applier
            WORKER_ID: 0
            THREAD_ID: NULL
        SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: UUID:1000003
    LAST_ERROR_NUMBER: 1062
   LAST_ERROR_MESSAGE: Could not execute Write_rows event on table test_jfg_ws.test_jfg_ws;
Duplicate entry 'Y' for key 'str', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY;
the event's master log FIRST, end_log_pos 168
 LAST_ERROR_TIMESTAMP: 2018-01-10 17:33:43
1 row in set (0.00 sec)
Finally, let's look at the data stored in each node.  On the node where we ran the while and on the node where we ran the ALTER, we respectively have:
> SELECT * FROM test_jfg_ws.test_jfg_ws;
+----+-----+
| id | str |
+----+-----+
| 13 | Y   |
|  4 | z   |
+----+-----+
2 rows in set (0.00 sec)

> SELECT * FROM test_jfg_ws.test_jfg_ws;
+----+-----+
| id | str |
+----+-----+
| 11 | y   |
|  4 | z   |
+----+-----+
2 rows in set (0.00 sec)

At this point, we have a data consistency problem (and an availability problem as the group does not accept writes anymore).  So I opened Bug#89194: Wrong certification lead to data inconsistency and GR breakage.

Also, and I will not discuss those in more detail here, my tests on Group Replication allowed to find a few behaviors that I think should be improved or that I consider wrong.  So I opened the following bugs/feature requests:

And this feature request from the last post is still relevant:

Fixing this broken group will be the subject of a next post.

No comments:

Post a Comment