After upgrading some of our slaves to latest 5.7, I have found what looks like a serious regression introduced in MySQL 5.7.
A couple weeks ago I noticed that the error log file of one of our clusters, where I had implemented my in place transparent compression of binary logs, was literally flooded by the following error:
[ERROR] Binlog has bad magic number; It's not a binary log file that can be used by this version of MySQL
In the above setup this is an harmless error, and it should only happen at server startup, where mysqld opens and reads all available binary log files. The error is due to the fact that since files are now compressed, mysqld doesn't recognize them as valid - not an issue, as only older files are compressed, and only after ensuring all slaves already got them.
But I couldn't understand why the error was now hitting every minute or so, after upgrading from 5.6 to 5.7.
After a short investigation I was able to relate the errors to the exact moment where a binary log was closed and a new one created, this normally happens when the binary log file size reaches the value you have set in max_binlog_size variable. But this indicated that mysqld was continuously trying to open and read all the available binary logs, each time a file was rotated.
Ouch! This wasn't a nice finding.
On this server we have in excess of 3,500 binary log files, and I know about installations which have much more (mind you, we have a retention of 4 days only here). We like to keep binlog files small, we have them sized at 100 MB as this makes easier to peruse them when troubleshooting with mysqlbinlog. Anyways this wouldn't change anything, I mean, having 3,500 100M files or 350 1G files would generate exactly the same amount of I/O when you wanted to read them all....
Anyways, I had just found that every time a binary log is rotated, mysqld needs to open all of the files. What a waste of I/O and performances, I thought. So I decided to have a look at the code to understand what mysqld was actually trying to do with those binlogs.
It wasn't too difficult to narrow down the binlog scan with the function purge_logs_before_date()
that expires logs based on time, as set by the expire_logs_days variable.
But reading the binary log files shouldn't be really needed when you're only looking to expire them, so this looked weird to me.
Further digging into the source code shows that mysqld is actually looking for "lost GTIDs" in the binlog files, and does it every time a binary log is closed for rotation. I haven't looked deeper into what a "lost" GTID means in the code, however whatever the reason for this search is, for sure it shouldn't be done repeatedly at every binlog rotation.
The weird thing is, we aren't even using GTIDs!
But still mysqld opens and reads ALL the available binary log files all the time. This is a serious regression compared to 5.6 where this wasn't happening, and it can hit you hard when you have a large number of binary log files. Caveat emptor.
I have raised a bug for the above, and really hope Oracle gives it the needed attention.
Showing posts with label mysql. Show all posts
Showing posts with label mysql. Show all posts
Monday, December 11, 2017
Monday, November 6, 2017
when xtrabackup fails with 'read' returned OS error 0 after upgrading to 5.7
Here's something that has puzzled me for several weeks.
Right after migrating MySQL from 5.6 to 5.7, we started experiencing random xtrabackup failures on some, but not all, of our slaves.
The failures were only happening when taking an incremental backup, and it would always fail on the same table on each slave, with errors similar to the following:
171106 13:00:33 [01] Streaming ./gls/C_GLS_IDS_AUX.ibd
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: Retry attempts for reading partial data failed.
InnoDB: Tried to read 262144 bytes at offset 0, but was only able to read 114688
InnoDB: File (unknown): 'read' returned OS error 0. Cannot continue operation
InnoDB: Cannot continue operation.
Searching the web didn't yield any useful result.
The error was happening on one slave but not on another, OS, MySQL and Xtrabackup versions being the same:
I even tried to dump the offending schema from a failing slave and import it into a non failing slave, but on the non-failing slave the same table in that schema backed up just fine.
This has been going on for a while, where I used to run without incremental backups on the migrated slaves, until today, when it happened again on yet another slave that we had just migrated, and I had an intuition: since the failure of the incremental backup was only happening after the migration, of course it had to be related to the migration itself. So I started suspecting some incompatibility between 5.7 and/or xtrabackup, and tablespaces created with MySQL 5.6.
Verifying my suspect was as easy as running:
ALTER TABLE C_GLS_IDS_AUX Engine=InnoDB
Et voilĂ !! The rebuilt tablespace was not crashing the incremental backup anymore.
Eventually, another table(space) would crash the backup again later, but running a NULL alter table on it and relaunching the backup would fix that. Wash, rinse, repeat until the incremental completes, and you're done.
Hope this helps someone saving some time..
And if you are looking for xtrabackup based backup scripts, feel free to check my own set here on my GitHub page.
ADDENDUM
Looks like I have been a little overenthusiastic about this.
In fact, the problem reappeared the day after writing this blog post - apparently, the null alter table only fixes it temporarily.
Turns out this has to be some incompatibility between Xtrabackup 2.4.x and MySQL 5.7.x as I was able to reproduce the issue. Please see this bug if you are interested in the details.
Right after migrating MySQL from 5.6 to 5.7, we started experiencing random xtrabackup failures on some, but not all, of our slaves.
The failures were only happening when taking an incremental backup, and it would always fail on the same table on each slave, with errors similar to the following:
171106 13:00:33 [01] Streaming ./gls/C_GLS_IDS_AUX.ibd
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: Retry attempts for reading partial data failed.
InnoDB: Tried to read 262144 bytes at offset 0, but was only able to read 114688
InnoDB: File (unknown): 'read' returned OS error 0. Cannot continue operation
InnoDB: Cannot continue operation.
Searching the web didn't yield any useful result.
The error was happening on one slave but not on another, OS, MySQL and Xtrabackup versions being the same:
- CentOS release 6.8 (Final)
- xtrabackup version 2.4.8 based on MySQL server 5.7.13 Linux (x86_64) (revision id: 97330f7)
- mysqld Ver 5.7.19-17 for Linux on x86_64 (Percona Server (GPL), Release 17, Revision e19a6b7b73f)
I even tried to dump the offending schema from a failing slave and import it into a non failing slave, but on the non-failing slave the same table in that schema backed up just fine.
This has been going on for a while, where I used to run without incremental backups on the migrated slaves, until today, when it happened again on yet another slave that we had just migrated, and I had an intuition: since the failure of the incremental backup was only happening after the migration, of course it had to be related to the migration itself. So I started suspecting some incompatibility between 5.7 and/or xtrabackup, and tablespaces created with MySQL 5.6.
Verifying my suspect was as easy as running:
ALTER TABLE C_GLS_IDS_AUX Engine=InnoDB
Et voilĂ !! The rebuilt tablespace was not crashing the incremental backup anymore.
Eventually, another table(space) would crash the backup again later, but running a NULL alter table on it and relaunching the backup would fix that. Wash, rinse, repeat until the incremental completes, and you're done.
Hope this helps someone saving some time..
And if you are looking for xtrabackup based backup scripts, feel free to check my own set here on my GitHub page.
ADDENDUM
Looks like I have been a little overenthusiastic about this.
In fact, the problem reappeared the day after writing this blog post - apparently, the null alter table only fixes it temporarily.
Turns out this has to be some incompatibility between Xtrabackup 2.4.x and MySQL 5.7.x as I was able to reproduce the issue. Please see this bug if you are interested in the details.
Monday, July 3, 2017
SET PASSWORD will lock you out of your slave in a mixed 5.6/5.7 MySQL topology
Long time no post.... :-)
Here's something interesting.
Last week I decided to give MySQL 5.7 a try (yes, I am kinda conservative DBA...) and the very same day that I installed my first 5.7 replica I noticed that, after changing my own password on the 5.6 master, I could no longer connect to the 5.7 slave.
Very annoying, to say the least! So I went and dug out the root password (which we do not normally use) and when I connected to the slave I was surprised to see that my password's hash on the 5.7 slave was different than the hash on the 5.6 master. No wonder I couldn't connect....
A bit of research on the MySQL documentation and I understood that 5.7 introduced few changes around the way you work with users' passwords. SET PASSWORD is now deprecated in favour of ALTER USER: see MySQL 5.7 Reference Manual page.
So far so good, but it's obvious that when Oracle implemented those changes, they didn't consider upgrade paths, or mixed 5.6/5.7 topologies that are commonly found during a major MySQL version upgrade, like in this case.
Turns out there is a bug first reported by Gillian Gunson back in September 2016 (SET PASSWORD replicates incorrectly between 5.6 and 5.7) which Oracle acknowledged but didn't care to fix at the time of this blog post. Sorry for the rant, but I really feel this to be a real showstopper to any migration where there are many database users and where a password change may happen quite often.
As Gillian pointed out in the above bug, what really happens is that the password is hashed twice, and on the replica we get an hash of the already hashed password - no wonder it doesn't work :-)
The good news is that this is really an easy fix. I was able to patch this myself in few minutes, and the patch is one line of code.... you can find the diff below, this was applied to 5.7.18 Percona source code but I believe it will be the same in Oracle MySQL, regardless the minor version.
Basically, I added a check to avoid hashing the password if the thread is a slave thread and the command is a SET PASSWORD. This check may break the use of SET PASSWORD when master will be on 5.7, I didn't check that - but hey, SET PASSWORD is deprecated anyways... and I will only use my patched version of 5.7 during the migration.
Enjoy!
Here's something interesting.
Last week I decided to give MySQL 5.7 a try (yes, I am kinda conservative DBA...) and the very same day that I installed my first 5.7 replica I noticed that, after changing my own password on the 5.6 master, I could no longer connect to the 5.7 slave.
Very annoying, to say the least! So I went and dug out the root password (which we do not normally use) and when I connected to the slave I was surprised to see that my password's hash on the 5.7 slave was different than the hash on the 5.6 master. No wonder I couldn't connect....
A bit of research on the MySQL documentation and I understood that 5.7 introduced few changes around the way you work with users' passwords. SET PASSWORD is now deprecated in favour of ALTER USER: see MySQL 5.7 Reference Manual page.
So far so good, but it's obvious that when Oracle implemented those changes, they didn't consider upgrade paths, or mixed 5.6/5.7 topologies that are commonly found during a major MySQL version upgrade, like in this case.
Turns out there is a bug first reported by Gillian Gunson back in September 2016 (SET PASSWORD replicates incorrectly between 5.6 and 5.7) which Oracle acknowledged but didn't care to fix at the time of this blog post. Sorry for the rant, but I really feel this to be a real showstopper to any migration where there are many database users and where a password change may happen quite often.
As Gillian pointed out in the above bug, what really happens is that the password is hashed twice, and on the replica we get an hash of the already hashed password - no wonder it doesn't work :-)
The good news is that this is really an easy fix. I was able to patch this myself in few minutes, and the patch is one line of code.... you can find the diff below, this was applied to 5.7.18 Percona source code but I believe it will be the same in Oracle MySQL, regardless the minor version.
Basically, I added a check to avoid hashing the password if the thread is a slave thread and the command is a SET PASSWORD. This check may break the use of SET PASSWORD when master will be on 5.7, I didn't check that - but hey, SET PASSWORD is deprecated anyways... and I will only use my patched version of 5.7 during the migration.
Enjoy!
*** sql/auth/sql_user.cc Mon Jul 3 16:18:55 2017 --- sql/auth/sql_user.cc.patch Mon Jul 3 16:07:01 2017 *************** *** 565,570 **** --- 565,574 ---- */ if ((Str->uses_identified_by_clause || (Str->auth.length == 0 && !user_exists)) + // Rick Pizzi: don't change to hash if thread is a replication thread and a + // SET PASSWORD was issued on master, to avoid double hash + // (fix for https://bugs.mysql.com/bug.php?id=82979) + && (!thd->slave_thread || thd->lex->sql_command != SQLCOM_SET_OPTION)) { st_mysql_auth *auth= (st_mysql_auth *) plugin_decl(plugin)->info; inbuf= Str->auth.str;
Tuesday, May 9, 2017
Slides from Percona Live 2017: Caching your application data with MySQL and TokuDB
Here are the slides from my talk at Percona Live 2017 conference in Santa Clara, CA.
This has been my first opportunity to speak to a public audience and it has been really great!
Looking forward to speak about MySQL again soon. Thank you!!
This has been my first opportunity to speak to a public audience and it has been really great!
Looking forward to speak about MySQL again soon. Thank you!!
Wednesday, April 5, 2017
Caching your application data with MySQL and TokuDB @ Percona Live 2017
The great Percona Live 2017 database conference is approaching, and I am very excited to be there this year - not only as an attendee, but also as a speaker!!
By the way, this will be my first public talk so please bear with me if I'll be a bit nervous :-)
If you are attending, come see me and my colleague Andrea speaking about how we leverage the TokuDB engine to cache our application data using MySQL! We will show you some interesting benchmarks, and will describe our design, setup, and configuration in detail.
See you there!!
By the way, this will be my first public talk so please bear with me if I'll be a bit nervous :-)
If you are attending, come see me and my colleague Andrea speaking about how we leverage the TokuDB engine to cache our application data using MySQL! We will show you some interesting benchmarks, and will describe our design, setup, and configuration in detail.
See you there!!
Monday, February 6, 2017
How to expand a striped LVM database volume in Amazon AWS without downtime
This procedure can be used to expand an LVM database volume on Amazon AWS (but also apply to any storage area network environment equally).
Let me start with this assumption: when you create volumes for database use in AWS using EBS, you stripe data across them in order to enhance performance. If you aren't doing this... well, you should :-)
Under this assumption, when you need to add more disk space to an existing database volume, you can't just add the disk(s) to the volume, as this would make the added space non striped, and would eventually create hotspots in the dataset. The correct approach in this situation is to create a number of new EBS disks enough to contain entire dataset plus the desired added space,so that you can grow the existing dataset while re-striping properly.
To make this clear, let's suppose you have a dataset volume of 3 TB, made of 3 1TB EBS volumes which are striped across, but space is running out. You are in a hurry, so you quickly add a new 1 TB EBS volume to the AWS server, add it to the dataset LVM, and expand the volume, as follows:
# add the new EBS disk to LVM
pvcreate /dev/xvdl
# add the new EBS disk to the dataset volume
vgextend dataset /dev/xvdl
# extend the dataset volume using the new EBS disk
lvextend -i 1 -r /dev/dataset/db /dev/xvdl
This will add space to the dataset volume, however, the fourth EBS volume is just appended to the original striped dataset volume, not striped, thus creating possible hotspots.
The good news is that this can be done without downtime, if you follow the steps below.
IMPORTANT: before proceeding, make sure you have valid backups of the database!!
Now, let's suppose you have added 4 EBS 1 TB disks to the server already, with following path:
The original volume, after you added the "emergency" 1 TB EBS disk above, is using:
The procedure involves migrating the dataset from one group of EBS disks to the other by setting up a mirror, let it sync, then dropping the old mirror side to release the old disks. The important step here is to create the mirror so that it will stripe across the EBS disks properly.
pvcreate /dev/xvd[ponm]
# add new EBS disks to dataset volume
vgextend dataset /dev/xvd[ponm]
# create a mirror out of existing dataset volume
lvconvert -b --stripes 4 -m1 --mirrorlog core /dev/dataset/db /dev/xvd[ponm]
The above commands just add the new disks to LVM, then to the volume, and then the mirror process is started.
You have to wait until the mirror sync is complete, this may take some time, even days depending on the dataset size. You can verify it is synced using the following command and output:
# lvs -a -o +devices
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices
bck backup -wi-ao---- 3.00t /dev/xvdi(0),/dev/xvdj(0),/dev/xvdk(0)
db dataset mwi-aom--- 4.00t 100.00 db_mimage_0(0),db_mimage_1(0)
[db_mimage_0] dataset iwi-aom--- 4.00t /dev/xvdf(0),/dev/xvdg(0),/dev/xvdh(0)
[db_mimage_0] dataset iwi-aom--- 4.00t /dev/xvdl(0)
[db_mimage_1] dataset iwi-aom--- 4.00t /dev/xvdm(0),/dev/xvdn(0),/dev/xvdo(0),/dev/xvdp(0)
You need to check that he copy sync column shows 100% synced. You can run this command to see progress at any time. The syncing is done at very low priority and there is little to no overhead, and zero impact on database. In the example above you can clearly see that the db_mimage0 logical volume, which includes the original side of the mirror, is actually made by 3 striped disks (f, g, h) and one single striped disk (l), this is what we want to avoid and if you look at db_mimage1 (the new mirror side) you can clearly see it is properly striped over 4 disks now.
Once the mirror is in sync we can safely drop the original side of the mirror, then remove the physical disks from LVM and destroy them on the EBS side:
lvconvert -m0 /dev/dataset/db /dev/xvd[fghl]
# removes EBS disks from the dataset VG
vgreduce dataset /dev/xvd[fghl]
# removes EBS disks from LVM
pvremove /dev/xvd[fghl]
Done!! You can safely go to the AWS console and detach the EBS disks you just removed from LVM.
Be careful not to remove the wrong disks, or you will destroy your dataset and database!
Tuesday, January 10, 2017
Galera Cluster: adding a primary key to a MySQL table which lacks it... without downtime
OK, let me start with saying that a table without a primary key shouldn't be something that a DBA should ever stumble into. Sure, InnoDB will secretly add one for us without telling - but that's not something we can use at application or administration level, it's just to make the engine happy.
So let's suppose you find some tables that lack a primary key, of course you need to do something about it, right? Now, put Galera Cluster in the mix - Galera does not support tables without a primary key and will, secretly again, make that table inconsistent at cluster level.
You need to fix the damn table(s)!! And this is where the fun begins... as you can't afford downtime for the operation so you need to resort to an online schema change of some type.
Galera cluster is a very nice HA solution and personally I have embraced it since few years now. I really love it, especially its ability to self-heal when something happens (and when you're in the cloud, there's always something happening). Galera is very different from the classic MySQL replication, and the biggest difference is the fact that it replicates synchronously instead of asynchronously. Now, I will not get into details about how Galera works, there are plenty of blog posts and very good documentation online. But there is something important that you need to be aware of, when running Galera: in the standard operating mode of the cluster, any DDL that you will execute will freeze the entire cluster for the entire duration of the statement. They call this "Total Order Isolation" (TOI) and it is the default setting for the cluster.
Guess what. Months after migrating one of our clusters from classic master-master topology to Galera Cluster, I suddenly found myself in need to clean up some old and barely used schemas, which contained quite a few tables lacking a primary key (or any index, for that matter). My bad, I spent a lot of time checking the main schemas and tables and making sure they were optimized for Galera, but I left the unimportant ones behind and never went back checking on them.
So, here I am with my apparently simple (although boring) task - I need to add a primary key to a set of tables and I don't want downtime and/or block the cluster as a whole in the process.
Preliminary checks with DEVs confirmed that all INSERTs going into the tables I needed to fix were using an explicit field list, so the change was compatible (it is important to remember to check this, in order to avoid application failures).
So, first thing that I tried was to just add the PK using the great pt-online-schema-change tool from the Percona toolkit, which supports Galera Cluster just fine. Unfortunately, I was quickly remembered (by an error message) that the tool cannot operate if the table lacks a primary key or unique index, except when the column that you want to be primary key is already present in the table, which was not my case. D'oh!
Failing that, I thought about running the ALTER statement directly, without using an online schema change tool. Of course, this would have to be done on all nodes, one node at a time, after changing the wsrep_OSU_method global variable of the cluster to 'RSU'. RSU stands for "Rolling Schema Upgrade" - more info here. In this mode, any DDL that you execute will only apply to the current node and will not be propagated to other nodes; the node will be evicted from the cluster during the operation, and will rejoin afterwards, but again, the DDL will NOT replicate.
So you have to run it on each node in order to complete the change. Boring, and not very practical, right. May be good to add an index on a write-heavy critical table, but of course not to change a table definition... especially if you got many tables to change! But I was prepared to do it at any cost anyways (I eventually felt guilty for not checking this before the migration), so I went to one of the nodes, changed the OSU method to RSU, and executed the following DDL:
The operation completed OK, but shortly thereafter, I lost connection to the node (infamous MySQL has gone away.....) WTF?!? I verified and the mysqld process was dead. While it is not a big deal for a 3 node cluster to lose one node in our setup, I did not expect the node to die as a consequence of a DDL. But a quick look at the error log, and I immediately realized what the problem was:
What did just happen? Galera replicates between nodes using ROW based replication - this means that the row data gets replicated, instead of the statement that inserts or updates it. This is done for many good reasons (compatibility with non deterministic statements, for example). But in my situation, this was biting me hard - a row inserted by the application in one of the other cluster nodes reached my node where the table structure had been altered, the local node found that the first field of the row was of a different type, and failed as a result. In order to preserve consistency, after some unsuccessful retries, Galera aborted the MySQL server on the node.
Epic fail....
Note that the only difference here is that I omitted the FIRST clause at the end this time. This will cause the new column to be added to the bottom of the table, as last column.
After executing the DDL I anxiously watched the table waiting for a replicated row to arrive... and I was really happy to see that rows were entering without problems now. Yay!
But wait, how could I now repeat these steps on the other nodes in a consistent way? The new shiny primary key was an auto_increment one (and it had to be, because the application couldn't populate it for me, it didn't even know this column existed), but if I was to repeat the above DDL on other nodes, I would get different PK values inserted in each row - Galera Cluster uses different auto_increment values on each node by design (this is something we also use in classic master-master setups by the way). One option could have been to play with the global variables and change the increment manually on each node before executing my DDL, then put it back to what it was.... but I felt this too dangerous and error prone, and also a bit complicated to repeat for each and every table that I had to fix.
And, last but not least, I wanted the damn PK to be first column in the table (yes I'm picky).
But wait again, now that I've got a PK on this table - pt-online-schema-change could do the hard work for me maybe? pt-osc normally works on the master, creating a copy of the original table with the desired changes, and populates it by writing chunks of rows read from the original table. The original table is never written to, and is not checked on the other nodes. It shouldn't matter that the original table does not have a PK on the other nodes. I thought the tool should be able to complete my change.
I modified the wsrep_OSU_method and put it back to the default value of 'TOI', so that all changes would be propagated (remember that pt-osc executes DDLs to swap tables at the end of the copy process). I fired the tool with the following --alter option:
ADDENDUM:
There is an important step that I forgot to mention in the above blog post (please bear with me) . Of course the node that you set to RSU, and where you run the ALTER statement(s), should not be receiving any write traffic during the operation, to avoid stalls.
But the very important step is: after you ALTER all your tables on the above node, and *before* you start running your pt-online-schema-change session(s), you need to make this node your only master (all writes need to go to this node!) until you're done with pt-osc.
The reason is that the CREATE TRIGGER statements that pt-osc uses to keep your new table up to date while it runs, will fail on the other nodes, because the triggers needs to be created on the column you have just added, and that column is not present on the other nodes yet.
Beware, if you don't follow the above step, all the changes performed by the applications to the table(s) that happen while you run the pt-osc will be lost.
Make sure these changes only come to the node where you run pt-osc, and everything will be safe.
So let's suppose you find some tables that lack a primary key, of course you need to do something about it, right? Now, put Galera Cluster in the mix - Galera does not support tables without a primary key and will, secretly again, make that table inconsistent at cluster level.
You need to fix the damn table(s)!! And this is where the fun begins... as you can't afford downtime for the operation so you need to resort to an online schema change of some type.
Galera cluster is a very nice HA solution and personally I have embraced it since few years now. I really love it, especially its ability to self-heal when something happens (and when you're in the cloud, there's always something happening). Galera is very different from the classic MySQL replication, and the biggest difference is the fact that it replicates synchronously instead of asynchronously. Now, I will not get into details about how Galera works, there are plenty of blog posts and very good documentation online. But there is something important that you need to be aware of, when running Galera: in the standard operating mode of the cluster, any DDL that you will execute will freeze the entire cluster for the entire duration of the statement. They call this "Total Order Isolation" (TOI) and it is the default setting for the cluster.
Guess what. Months after migrating one of our clusters from classic master-master topology to Galera Cluster, I suddenly found myself in need to clean up some old and barely used schemas, which contained quite a few tables lacking a primary key (or any index, for that matter). My bad, I spent a lot of time checking the main schemas and tables and making sure they were optimized for Galera, but I left the unimportant ones behind and never went back checking on them.
So, here I am with my apparently simple (although boring) task - I need to add a primary key to a set of tables and I don't want downtime and/or block the cluster as a whole in the process.
Preliminary checks with DEVs confirmed that all INSERTs going into the tables I needed to fix were using an explicit field list, so the change was compatible (it is important to remember to check this, in order to avoid application failures).
So, first thing that I tried was to just add the PK using the great pt-online-schema-change tool from the Percona toolkit, which supports Galera Cluster just fine. Unfortunately, I was quickly remembered (by an error message) that the tool cannot operate if the table lacks a primary key or unique index, except when the column that you want to be primary key is already present in the table, which was not my case. D'oh!
Failing that, I thought about running the ALTER statement directly, without using an online schema change tool. Of course, this would have to be done on all nodes, one node at a time, after changing the wsrep_OSU_method global variable of the cluster to 'RSU'. RSU stands for "Rolling Schema Upgrade" - more info here. In this mode, any DDL that you execute will only apply to the current node and will not be propagated to other nodes; the node will be evicted from the cluster during the operation, and will rejoin afterwards, but again, the DDL will NOT replicate.
So you have to run it on each node in order to complete the change. Boring, and not very practical, right. May be good to add an index on a write-heavy critical table, but of course not to change a table definition... especially if you got many tables to change! But I was prepared to do it at any cost anyways (I eventually felt guilty for not checking this before the migration), so I went to one of the nodes, changed the OSU method to RSU, and executed the following DDL:
ALTER TABLE myschema.mytable ADD id INT(11) UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY FIRST;
The operation completed OK, but shortly thereafter, I lost connection to the node (infamous MySQL has gone away.....) WTF?!? I verified and the mysqld process was dead. While it is not a big deal for a 3 node cluster to lose one node in our setup, I did not expect the node to die as a consequence of a DDL. But a quick look at the error log, and I immediately realized what the problem was:
2017-01-10 13:53:06 16667 [ERROR] Slave SQL: Column 0 of table 'myschema.mytable' cannot be converted from type 'char(3)' to type 'int(11) unsigned', Error_code: 1677
2017-01-10 13:53:06 16667 [ERROR] WSREP: Failed to apply trx 760719957 4 times
2017-01-10 13:53:06 16667 [ERROR] WSREP: Node consistency compromized, aborting...
Epic fail....
But I still needed to fix the damn tables. While the node was being (automatically) rebuilt, I spent some time thinking about a way to complete this task. Replication had failed because the first column of the modified table was found to be different type vs. the first column of the table on other nodes. Hmmm, what about moving the new PK column to the end of the table instead of keeping it at the top, I thought. Ugly for sure, but I was curious to see if replication would work correctly in this case. In theory, the columns are matched by position in RBR, so replication should not complain after all.
Pronto, as soon as the node came back in service I set it once again in RSU mode and executed my DDL again:
Pronto, as soon as the node came back in service I set it once again in RSU mode and executed my DDL again:
ALTER TABLE myschema.mytable ADD id INT(11) UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY;
Note that the only difference here is that I omitted the FIRST clause at the end this time. This will cause the new column to be added to the bottom of the table, as last column.
After executing the DDL I anxiously watched the table waiting for a replicated row to arrive... and I was really happy to see that rows were entering without problems now. Yay!
But wait, how could I now repeat these steps on the other nodes in a consistent way? The new shiny primary key was an auto_increment one (and it had to be, because the application couldn't populate it for me, it didn't even know this column existed), but if I was to repeat the above DDL on other nodes, I would get different PK values inserted in each row - Galera Cluster uses different auto_increment values on each node by design (this is something we also use in classic master-master setups by the way). One option could have been to play with the global variables and change the increment manually on each node before executing my DDL, then put it back to what it was.... but I felt this too dangerous and error prone, and also a bit complicated to repeat for each and every table that I had to fix.
And, last but not least, I wanted the damn PK to be first column in the table (yes I'm picky).
But wait again, now that I've got a PK on this table - pt-online-schema-change could do the hard work for me maybe? pt-osc normally works on the master, creating a copy of the original table with the desired changes, and populates it by writing chunks of rows read from the original table. The original table is never written to, and is not checked on the other nodes. It shouldn't matter that the original table does not have a PK on the other nodes. I thought the tool should be able to complete my change.
I modified the wsrep_OSU_method and put it back to the default value of 'TOI', so that all changes would be propagated (remember that pt-osc executes DDLs to swap tables at the end of the copy process). I fired the tool with the following --alter option:
modify column id INT(11) UNSIGNED NOT NULL AUTO_INCREMENT FIRST
This basically told the tool to alter the table, and just move the id column on top.
pt-osc diligently created a copy of the table with the PK moved to 1st column, and filled it with all data from the original table that I had previously altered; then swapped the original and the new table.
That did the trick, since all the operations performed by pt-osc got propagated to all nodes of the cluster via replication. When pt-osc completed, the table was consistent on all nodes, and with a brand new auto_increment PK on top.
pt-osc diligently created a copy of the table with the PK moved to 1st column, and filled it with all data from the original table that I had previously altered; then swapped the original and the new table.
That did the trick, since all the operations performed by pt-osc got propagated to all nodes of the cluster via replication. When pt-osc completed, the table was consistent on all nodes, and with a brand new auto_increment PK on top.
Mission accomplished!!
Bottom line: remember to always add primary keys to your tables...
ADDENDUM:
There is an important step that I forgot to mention in the above blog post (please bear with me) . Of course the node that you set to RSU, and where you run the ALTER statement(s), should not be receiving any write traffic during the operation, to avoid stalls.
But the very important step is: after you ALTER all your tables on the above node, and *before* you start running your pt-online-schema-change session(s), you need to make this node your only master (all writes need to go to this node!) until you're done with pt-osc.
The reason is that the CREATE TRIGGER statements that pt-osc uses to keep your new table up to date while it runs, will fail on the other nodes, because the triggers needs to be created on the column you have just added, and that column is not present on the other nodes yet.
Beware, if you don't follow the above step, all the changes performed by the applications to the table(s) that happen while you run the pt-osc will be lost.
Make sure these changes only come to the node where you run pt-osc, and everything will be safe.
Wednesday, December 14, 2016
In place transparent compression of MySQL binary logs
Any DBA who has administered a busy master knows how fast the disk space occupied by binary logs may grow. DBAs have no control on this: the growth depends on the workload, and the workload depends on many factors, e.g.:
- application changes (the applications start writing more due to code changes)
- traffic changes (the peak season arrive, your workload doubles in size)
- infrastructure changes (the devops add more servers)
- business changes (new business flows adds to existing workload)
So either you have being thoughtful and have planned in advance for a large enough storage space (to handle the increase in number of binary logs), or, sooner or later, you will face the usual dilemma - how many retention days dare you give up to accommodate for the binlog growth?
In my case, I was very thoughtful, but the boss didn't listen and gave me servers with very limited binlog storage space and, more important, using internal disks so no way to add more disk space on demand. Pity, as I had previously found a way to move binlogs to a different filesystem without downtime. But unfortunately, there is no "different filesystem" this time. I'm stuck with what I have.
So, the obvious answer at first was to reduce the retention of binary logs via the
expire_logs_days variable. Not the ideal option, as in my situation, I had to reduce it to a mere 2 days, not even enough to survive a weekend. At the same time I didn't like the idea to keep more days and be waken up at 3am by a filesystem full alert.
expire_logs_days variable. Not the ideal option, as in my situation, I had to reduce it to a mere 2 days, not even enough to survive a weekend. At the same time I didn't like the idea to keep more days and be waken up at 3am by a filesystem full alert.
So the idea of compressing the binary logs came to my mind, but I wanted to do it in a transparent way, preserving the existing binlog retention mechanisms (expire_logs_days variable and PURGE BINARY LOGS command should continue to work as before) and without breaking the MySQL server and/or replication. Ideally, the timestamps of the files themselves should also preserved.
I decided to try and compress the binary logs in place. In place means that when you look at your binlogs folder, you don't notice anything unusual. Here's (a part of ) the listing of one of our servers with compression active:
-rw-rw---- 1 mysql mysql 13631245 Dec 14 14:17 dbmd01.079826
-rw-rw---- 1 mysql mysql 13909321 Dec 14 14:23 dbmd01.079827
-rw-rw---- 1 mysql mysql 13656948 Dec 14 14:30 dbmd01.079828
-rw-rw---- 1 mysql mysql 13915222 Dec 14 14:37 dbmd01.079829
-rw-rw---- 1 mysql mysql 104883640 Dec 14 14:44 dbmd01.079830
-rw-rw---- 1 mysql mysql 104898032 Dec 14 14:50 dbmd01.079831
-rw-rw---- 1 mysql mysql 104861122 Dec 14 14:57 dbmd01.079832
You can spot which files are compressed by looking at the file size, but everything else is preserved. Looking with file, here's what's in each of them:
dbmd01.079826: gzip compressed data
dbmd01.079827: gzip compressed data
dbmd01.079828: gzip compressed data
dbmd01.079829: gzip compressed data
dbmd01.079830: MySQL replication log
dbmd01.079831: MySQL replication log
dbmd01.079832: MySQL replication log
Basically, after compression, the compressed file retains the original file name. This is to keep MySQL happy - you know, these files are tracked in the index file and read when mysqld starts.
Speaking of which, here's what happens when you start mysqld with in-place compression:
2016-10-26 03:06:13 0 [Note] /usr/sbin/mysqld (mysqld 5.6.34-79.1-log) starting as process 1407 ...
[ ... ]
2016-10-26 03:06:13 1407 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.6.34-79.1 started; log sequence number 2722915
2016-10-26 03:06:13 1407 [ERROR] Binlog has bad magic number; It's not a binary log file that can be used by this version of MySQL
2016-10-26 03:06:13 1407 [ERROR] Binlog has bad magic number; It's not a binary log file that can be used by this version of MySQL
2016-10-26 03:06:13 1407 [ERROR] Binlog has bad magic number; It's not a binary log file that can be used by this version of MySQL
2016-10-26 03:06:13 1407 [ERROR] Binlog has bad magic number; It's not a binary log file that can be used by this version of MySQL
Although these entries are flagged as ERROR, actually mysqld just ignores the compressed files, and more importantly, does not touch them.
This way, compressed binlog files can be managed exactly like uncompressed, standard ones. Of course, I need to remember that if I need to peruse their content, I have to uncompress them before feeding them to the mysqlbinlog utility.
Last but not least, in order to ensure that we do not inadvertently compress binlogs that haven't been downloaded by all slaves yet, the following extra safety rules are enforced:
- do not compress any binlog which is not at least 1 hour old
- only compress binlogs that have a sequence number which is less than the current sequence number of the most lagging slave
The script runs every 10 minutes, extracts the list of the binary logs which aren't compressed yet, applies the above rules, then it proceeds to compress them.
I have set expire_logs_days back to 7 days and have plenty of disk space for extra growth now....
You can find the compress_binlog.sh which I crafted for the above on my GitHub page.
One caveat - script hasn't been tested with binlog sequence numbers greater than 999,999. I am actually curious to see what happens to the binlog file name when we reach one million - need to either be very patient, or have a look at the source code... :-)
Enjoy!
Rick
Thursday, November 3, 2016
Can RocksDB and InnoDB play along in MySQL?
My recent post about importing a big dataset from InnoDB into MyRocks has attracted quite a lot of attention (thank you Mark!) and also it has been pointed out that what I wrote about coexistence of MyRocks and InnoDB was incorrect.
In fact, I had been asking about it at Percona Live last month, but got a negative answer.... plus I had tried it at first but got a mysqld crash during crash recovery once, so since it was not important for my use case I went ahead and disabled InnoDB entirely.
But of course, as I have written previously, using both engines in the same server is something that I would really, really love to have, so I thought to give it a try with a very simple use case.
So I created the following two tables, each one with a different engine:
In fact, I had been asking about it at Percona Live last month, but got a negative answer.... plus I had tried it at first but got a mysqld crash during crash recovery once, so since it was not important for my use case I went ahead and disabled InnoDB entirely.
But of course, as I have written previously, using both engines in the same server is something that I would really, really love to have, so I thought to give it a try with a very simple use case.
So I created the following two tables, each one with a different engine:
Create Table: CREATE TABLE `some_innodb_table` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`descr` char(10) COLLATE latin1_bin DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COLLATE=latin1_bin
Create Table: CREATE TABLE `some_rocksdb_table` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`descr` char(10) COLLATE latin1_bin DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=ROCKSDB DEFAULT CHARSET=latin1 COLLATE=latin1_bin
The idea was to fill the two tables with some data, then kill the mysqld process and verify that the crash recovery was able to recover both engines fully.
Speaking with my friend, the great René Cannaò, he suggested a very simple way to set up a quick test for the above. Basically, it would be enough to prepare a file with the needed insert statement in them (one file per engine) and then execute many parallel mysql clients, feeding the file with statements into them.
So I quickly created the two files :
$ for i in $(seq 1 1000000); do echo "insert into some_rocksdb_table values (NULL, '$i');"; done > rocksdb_testfile.sql
$ for i in $(seq 1 1000000); do echo "insert into some_innodb_table values (NULL, '$i');"; done > innodb_testfile.sql
The generated files would look like this:
insert into some_innodb_table values (NULL, '1');
insert into some_innodb_table values (NULL, '2');
insert into some_innodb_table values (NULL, '3');
insert into some_innodb_table values (NULL, '4');
insert into some_innodb_table values (NULL, '5');
.....
Now, I needed to instrument the parallel launch of several mysql clients. The idea is that when the mysql server gets killed, each of the clients will spit out the row number for the statement that it was executing when the server died; so I could just get that number for each of the clients, subtract one, and easily get the number of committed transactions that I should find on each engine. This should work because we are running in auto-commit mode, so basically each statement that we send is within its own transaction.
The trouble here was to capture the information spit out by the mysql client in an ordered manner, as all the clients would be launched in background from the shell simultaneously. So I crafted a small script that captures each output and saves it to a dedicated log (I had to be able to differentiate the sums of the two engines).
#!/bin/bash
#
echo -n "Password: "
stty -echo
read pass
stty echo
echo
cat innodb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> innodb_1.log &
cat rocksdb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> rocksdb_1.log &
cat innodb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> innodb_2.log &
cat rocksdb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> rocksdb_2.log &
cat innodb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> innodb_3.log &
cat rocksdb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> rocksdb_3.log &
cat innodb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> innodb_4.log &
cat rocksdb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> rocksdb_4.log &
cat innodb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> innodb_5.log &
cat rocksdb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> rocksdb_5.log &
cat innodb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> innodb_6.log &
cat rocksdb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> rocksdb_6.log &
cat innodb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> innodb_7.log &
cat rocksdb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> rocksdb_7.log &
cat innodb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> innodb_8.log &
cat rocksdb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> rocksdb_8.log &
cat innodb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> innodb_9.log &
cat rocksdb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> rocksdb_9.log &
cat innodb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> innodb_10.log &
cat rocksdb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> rocksdb_10.log &
wait
The script will launch 20 mysql clients in parallel (by running them in background), 10 for each engine, and capture their output for later perusal. Each client will be fed one million inserts, so if you let this run to completion, it will insert 20 million rows, 10 million per table.
Before launching the above script I restarted the server with the proper setting for a crash safe master, following the recommendations from the Wiki. The config used is the one below; notable options here are sync_binlog=1, innodb_flush_log_at_trx_commit=1 for (InnoDB) and rocksdb_disable_2pc=OFF (for MyRocks).
Please note that the coexistence of any two engines in MySQL at this time has problems around XA support - I got repeatable crashes and reported these here even if this is very likely an upstream bug. I have been recommended to enable binary logging to work around this problem, and it actually worked fine when I did.
Another issue I stumbled into while trying to set up this small test is a bug regarding the auto_increment_increment setting, which I accidentally had set to 2 (we use this setup in clusters). You can check this particular issue here. Hey, turns out I have been quite lucky with my previous use case :-)
# INNO
innodb_flush_method = O_DIRECT
innodb_log_files_in_group = 2
innodb_log_file_size = 512M
innodb_flush_log_at_trx_commit = 1
innodb_file_per_table = 1
innodb_buffer_pool_size = 10G
innodb_file_format=barracuda
innodb_log_buffer_size = 64M
# ROCKS
core-file
rocksdb
#skip-innodb
default-storage-engine=rocksdb
default-tmp-storage-engine=MyISAM
collation-server=latin1_bin
binlog_format = ROW
transaction-isolation = READ-COMMITTED
log_bin = /storage/rockstat/binlog/rocks
expire_logs_days = 1
sync_binlog=1
rocksdb_disable_2pc=OFF
rocksdb_max_open_files=-1
rocksdb_base_background_compactions=1
rocksdb_max_background_compactions=8
rocksdb_max_total_wal_size=4G
rocksdb_max_background_flushes=4
rocksdb_block_size=16384
rocksdb_block_cache_size=16G
rocksdb_table_cache_numshardbits=6
rocksdb_default_cf_options=write_buffer_size=128m;target_file_size_base=64m;max_bytes_for_level_base=512m;level0_file_num_compaction_trigger=4;level0_slowdown_writes_trigger=10;level0_stop_writes_trigger=15;max_write_buffer_number=4;compression_per_level=kNoCompression:kNoCompression:kNoCompression:kZlibCompression:kZlibCompression:kZlibCompression;bottommost_compression=kZlibCompression;compression_opts=-14:1:0;block_based_table_factory={cache_index_and_filter_blocks=1;filter_policy=bloomfilter:10:false;whole_key_filtering=1};level_compaction_dynamic_level_bytes=true;optimize_filters_for_hits=true
Well, I launched my script and waited a little while. While waiting I was curious to compare the insert speed between InnoDB and RocksDB:
mysql> select count(*) as innodb from some_innodb_table; select count(*) as rocksdb from some_rocksdb_table;
+---------+
| innodb |
+---------+
| 4544377 |
+---------+
1 row in set (1.28 sec)
+---------+
| rocksdb |
+---------+
| 4361642 |
+---------+
1 row in set (1.85 sec)
I was very happy to verify that RocksDB was able to keep up with InnoDB in this little insert speed test :-) Actually, InnoDB was about 10% faster, but remember that we are running with MyRock's 2 phase commit enabled here (where it is normally OFF). Also, InnoDB insert performance will weaken over time, where MyRocks should hold just fine.
Anyways, I went and killed the mysqld process with kill -9. My script exited and left 20 log files for my perusal:
bash-4.1$ ls -l *log
-rw-r--r-- 1 root root 153 Nov 3 12:45 innodb_1.log
-rw-r--r-- 1 root root 153 Nov 3 12:45 innodb_10.log
-rw-r--r-- 1 root root 153 Nov 3 12:45 innodb_2.log
-rw-r--r-- 1 root root 153 Nov 3 12:45 innodb_3.log
-rw-r--r-- 1 root root 153 Nov 3 12:45 innodb_4.log
-rw-r--r-- 1 root root 153 Nov 3 12:45 innodb_5.log
-rw-r--r-- 1 root root 153 Nov 3 12:45 innodb_6.log
-rw-r--r-- 1 root root 153 Nov 3 12:45 innodb_7.log
-rw-r--r-- 1 root root 153 Nov 3 12:45 innodb_8.log
-rw-r--r-- 1 root root 153 Nov 3 12:45 innodb_9.log
-rw-r--r-- 1 root root 153 Nov 3 12:45 rocksdb_1.log
-rw-r--r-- 1 root root 153 Nov 3 12:45 rocksdb_10.log
-rw-r--r-- 1 root root 153 Nov 3 12:45 rocksdb_2.log
-rw-r--r-- 1 root root 153 Nov 3 12:45 rocksdb_3.log
-rw-r--r-- 1 root root 153 Nov 3 12:45 rocksdb_4.log
-rw-r--r-- 1 root root 153 Nov 3 12:45 rocksdb_5.log
-rw-r--r-- 1 root root 153 Nov 3 12:45 rocksdb_6.log
-rw-r--r-- 1 root root 153 Nov 3 12:45 rocksdb_7.log
-rw-r--r-- 1 root root 153 Nov 3 12:45 rocksdb_8.log
-rw-r--r-- 1 root root 153 Nov 3 12:45 rocksdb_9.log
Each file contained the row corresponding to the statement that each client was executing, for example:
-bash-4.1$ cat rocksdb_1.log
Warning: Using a password on the command line interface can be insecure.
ERROR 2013 (HY000) at line 476838: Lost connection to MySQL server during query
In this example, the client had inserted 476837 rows successfully and died on the 476838th.
Another small script helped me gather the final information that I needed to compare with a count(*) later:
#!/bin/bash
#
c=$(echo $(fgrep ERROR innodb*log | cut -d " " -f 6 | cut -d":" -f 1 | tr "\n" "+" | sed -e "s/+$//") | bc)
echo "InnoDB count: $((c-1))"
c=$(echo $(fgrep ERROR rocksdb*log | cut -d " " -f 6 | cut -d":" -f 1 | tr "\n" "+" | sed -e "s/+$//") | bc)
echo "RocksDB count: $((c-1))"
Here's what I got after the kill:
InnoDB count: 4985188
RocksDB count: 4769893
Cool!
Now, it was time to start MySQL again and let it do crash recovery, then compare the # of rows in each table with the above numbers, hoping for a match.
Unfortunately the fun stopped here, with the following assertion:
mysqld: /home/rpizzi/mysql-fb/rocksdb/db/version_builder.cc:142: void rocksdb::VersionBuilder::Rep::CheckConsistency(rocksdb::VersionStorageInfo*): Assertion `f1->largest_seqno > f2->largest_seqno || (f1->largest_seqno == 0 && f2->largest_seqno == 0)' failed.
12:40:42 UTC - mysqld got signal 6 ;
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x50000
/usr/local/mysql/bin/mysqld(my_print_stacktrace+0x2c)[0x9baa7c]
/usr/local/mysql/bin/mysqld(handle_fatal_signal+0x3a7)[0x6ff7f7]
/lib64/libpthread.so.0(+0xf7e0)[0x7fa8e07d07e0]
/lib64/libc.so.6(gsignal+0x35)[0x7fa8de77b5e5]
/lib64/libc.so.6(abort+0x175)[0x7fa8de77cdc5]
/lib64/libc.so.6(+0x2b70e)[0x7fa8de77470e]
/lib64/libc.so.6(__assert_perror_fail+0x0)[0x7fa8de7747d0]
/usr/local/mysql/bin/mysqld(_ZN7rocksdb14VersionBuilder6SaveToEPNS_18VersionStorageInfoE+0x143)[0xdb6493]
/usr/local/mysql/bin/mysqld(_ZN7rocksdb10VersionSet11LogAndApplyEPNS_16ColumnFamilyDataERKNS_16MutableCFOptionsERKNS_10autovectorIPNS_11VersionEditELm8EEEPNS_17InstrumentedMutexEPNS_9DirectoryEbPKNS_19ColumnFamilyOptionsE+0x3fe)[0xc838fe]
/usr/local/mysql/bin/mysqld(_ZN7rocksdb10VersionSet11LogAndApplyEPNS_16ColumnFamilyDataERKNS_16MutableCFOptionsEPNS_11VersionEditEPNS_17InstrumentedMutexEPNS_9DirectoryEbPKNS_19ColumnFamilyOptionsE+0x44)[0xc0dc54]
/usr/local/mysql/bin/mysqld(_ZN7rocksdb6DBImpl15RecoverLogFilesERKSt6vectorImSaImEEPmb+0x1b8b)[0xc0545b]
/usr/local/mysql/bin/mysqld(_ZN7rocksdb6DBImpl7RecoverERKSt6vectorINS_22ColumnFamilyDescriptorESaIS2_EEbbb+0x8dc)[0xc062bc]
/usr/local/mysql/bin/mysqld(_ZN7rocksdb2DB4OpenERKNS_9DBOptionsERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt6vectorINS_22ColumnFamilyDescriptorESaISD_EEPSC_IPNS_18ColumnFamilyHandleESaISJ_EEPPS0_+0x784)[0xc06ee4]
/usr/local/mysql/bin/mysqld(_ZN7rocksdb13TransactionDB4OpenERKNS_9DBOptionsERKNS_20TransactionDBOptionsERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt6vectorINS_22ColumnFamilyDescriptorESaISG_EEPSF_IPNS_18ColumnFamilyHandleESaISM_EEPPS0_+0x819)[0xd2af19]
/usr/local/mysql/bin/mysqld[0xb95a80]
/usr/local/mysql/bin/mysqld(_Z24ha_initialize_handlertonP13st_plugin_int+0x48)[0x619908]
/usr/local/mysql/bin/mysqld[0x79e76b]
/usr/local/mysql/bin/mysqld(_Z11plugin_initPiPPci+0x5e0)[0x79f3c0]
/usr/local/mysql/bin/mysqld[0x60a5e7]
/usr/local/mysql/bin/mysqld(_Z11mysqld_mainiPPc+0x4d9)[0x60d8d9]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x7fa8de767d1d]
/usr/local/mysql/bin/mysqld[0x5fd329]
Basically, while InnoDB recovered just fine, crash recovery for MyRocks failed with an assertion. I have been unable to recover from this situation (even using suggested recovery modes on MyRocks wiki) and had to delete all data :-(
I have raised an issue and hope to be able to repeat this test soon....
Monday, October 31, 2016
MyRocks: migrating a large MySQL dataset from InnoDB to RocksDB to reduce footprint
I have been following Facebook's MyRocks project (and Mark Callaghan's blog) for a long time. The idea of an LSM based engine for MySQL is actually a great idea.
We all know that InnoDB sucks at INSERTs. BTree in general sucks when it's about insertion speed, and the more rows you insert, the more it sucks at it. There are many blog posts on the web that shows the insert speed degradation in InnoDB when the amount of rows in the table grows. Things get much worse faster if your primary key is a random key, for example an UUID.
We hit this problem with our caching servers (yes, we do caching with MySQL!), and in order to be able to scale these servers up we moved since a couple years to the TokuDB engine with great success. TokuDB is based on fractal tree technology, and guarantees the same insert speed, no matter the number of rows you have in the table; furthermore, it has no problems with random INSERTs (inserting rows with random PK values) and it also achieves great compression when compared to InnoDB.
Enter RocksDB and the LSM (Log Structured Merge) technology: we are entering a new planet as far as data storage goes. The idea of an "append only" type of approach for writing database rows is something I never heard before. Databases usually are the most random writer ever! And although SSD will not suffer from write randomness, there are other factors that need to be taken into consideration. I'll not go into more details about how LSM or RocksDB work, or why LSM is good for your SSD and your workload, as there is plenty of information about it on Mark's blog (see above) or on the official page for RocksDB and for MyRocks (the RocksDB engine for MySQL) on GitHub; just check the Wiki there.
For quite some time I dreamed to give MyRocks a try with our giant (at least by our standards) write-mostly tracking database, currently occupying about 2.1T of disk space and growing about 200G per month (and yes we DO have data retention in place). This cluster has a very write intensive workload and the written data is very seldom read, usually to extract statistical reports, or for application or incident post mortem analysis. At the time of this post, it contains approximately 830 tables.
So I was really excited to hear at the recent Percona Live AMS Yoshinori Matsunobu announcing that MyRocks was ready for production use. I explicitly asked him if partitioning was supported by MyRocks (as we use range partitioning to prune old data) and got a positive response. Yay!!
Even if production ready, MyRocks is still in early stages and therefore I needed to clone the git repository and compile the server from source code. I also had to compile the GCC compiler first, as it is a good idea to use one of the GCC versions that were tested by the Facebook team so that I can report issues that may happen later. Peter Z announced that MyRocks will be included in Percona Server soon, but if you want to give it a try now, this is the only way.
Next step was to obtain a consistent dump of the cluster, along with master position in order to add the MyRocks slave to the cluster after the data import. I knew this dump was going to take a long time, so instead of using mysqldump I decided to use mydumper, the parallel dump-and-load tool that I have used many times in the past, especially to migrate out of Amazon RDS (anyone?).
Also, I knew that some editing of the dump was required in order to accomplish the following:
- remove references to InnoDB and/or replace them with ROCKSDB
- introduce binary collation required by MyRocks (insert "COLLATE latin1_bin" or "COLLATE utf8_bin" in the table definitions)
- remove InnoDB compression where enabled (many of the larger tables are compressed)
So I downloaded, compiled and installed mydumper and after a whopping 44 hours, my dump was complete (and occupying about 2T of disk space uncompressed).
MyDumper creates two files for each table, one contains the table creation command and the other one contains the actual dump of the table's data. I crafted a simple script that goes through all the table creation files and applies the changes above in them:
#!/bin/bash
#
for f in bigdump/*schema*sql
do
fn=$(basename $f)
sed -e "s/InnoDB/ROCKSDB/g" -e "s/ DEFAULT CHARSET=latin1/ DEFAULT CHARSET=latin1 COLLATE=latin1_bin/g" -e "s/ DEFAULT CHARSET=utf8/ DEFAULT CHARSET=utf8 COLLATE=utf8_bin/g" -e "s/ ROW_FORMAT=COMPRESSED/ /g" < bigdump/$fn > bigdump/$fn.mangled
mv bigdump/$fn tmp/$fn.orig
mv bigdump/$fn.mangled bigdump/$fn
done
So I was ready to import this 2T of stuff into MyRocks. Of course, this is also a very time consuming operation, when you have more than a few G's to import. So it is really important to use all the available optimizations in order to speed up the bulk load.
Turns out that MyRocks contains quite a few tweaks to help with this; the most notable ones being rocksdb_skip_unique_check, which disables checking on unique constraints, and rocksdb_commit_in_the_middle which only commits every 1,000 rows or so. More info about options suitable for bulk loading data can be found on the Wiki page here. Beware that these settings cannot be used in regular production mode, they can cause corruption of your data! Therefore it is best to create a dedicated config for bulk data loading purposes only, and use the standard "production" config after the import is completed.
Side note: before you attempt to load large datasets into MyRocks, make sure you set your open file limit high enough! I have learned it the hard way, that MyRocks keeps one file descriptor open for each SST file you have in your database, and the number of SST files can grow very high when you are importing lots of data. In my case, I have topped 34,000 open file descriptors when the load was completed, which are then kept open even during normal use. I had originally reported this issue and following that a better default was recommended to me for the variable target_file_size_base in rocksdb_default_cf_options. Changing the recommended value from 32m to 64m made a big change regarding the amount of file descriptors required, dropping the above number to 12,500.
After a couple failed attempt at importing the data (mostly due to me being too lazy to properly read the available documentation) I finally managed to complete the import with satisfying results, using the configuration below. You will notice by reading the config that MyRocks cannot at this time coexist with InnoDB. I have been inquiring with Yoshinori about this, and the reason is that at this time, the crash recovery mechanism of the two engines will conflict. I can't wait for the day when we will be able to use RocksDB and InnoDB in the same server!
core-file
rocksdb
skip-innodb
default-storage-engine=rocksdb
default-tmp-storage-engine=MyISAM
collation-server=latin1_bin
# config for bulk load
rocksdb_max_open_files=-1
rocksdb_base_background_compactions=1
rocksdb_max_total_wal_size=4G
rocksdb_block_size=16384
rocksdb_block_cache_size=16G
rocksdb_table_cache_numshardbits=6
rocksdb_default_cf_options=write_buffer_size=128m;target_file_size_base=64m;max_bytes_for_level_base=512m;level0_file_num_compaction_trigger=4;level0_slowdown_writes_trigger=256;level0_stop_writes_trigger=256;max_write_buffer_number=16;compression_per_level=kNoCompression:kNoCompression:kNoCompression:kZlibCompression:kZlibCompression:kZlibCompression;bottommost_compression=kZlibCompression;compression_opts=-14:1:0;block_based_table_factory={cache_index_and_filter_blocks=1;filter_policy=bloomfilter:10:false;whole_key_filtering=1};level_compaction_dynamic_level_bytes=true;optimize_filters_for_hits=true
rocksdb_override_cf_options=system={memtable=skip_list:16}
rocksdb_skip_unique_check=1
rocksdb_commit_in_the_middle=1
rocksdb_write_disable_wal=1
rocksdb_max_background_flushes=40
rocksdb_max_background_compactions=40
During the import I have been watching the disk space (footprint) occupied by MyRocks' data directory. MyRocks creates a .rocksdb subfolder in your datadir and stores all its files in there, so I ran a script to monitor both the # of file descriptor in use (with lsof) and the footprint (with du) every 60 seconds:
3888
223G /storage/rockstat/data/.rocksdb
3959
229G /storage/rockstat/data/.rocksdb
3901
219G /storage/rockstat/data/.rocksdb
3924
220G /storage/rockstat/data/.rocksdb
3916
226G /storage/rockstat/data/.rocksdb
3815
211G /storage/rockstat/data/.rocksdb
You can clearly see the compaction of data taking place while the data loads. Compaction pushes data to lower levels, packing and compressing it in the process and therefore the footprint decrease while the import progresses. Cool!!
But what was really cool is finding out that MyRocks was able to pack 2.1T worth of data into a mere 611G - reducing an already compressed dataset by more than 3 times! This is even more amazing if you consider that about a dozen of the larger tables in this particular dataset contain BLOBs which have compressed data in them. In the steps above I had disabled compression at the engine level where it was configured in the source dataset, but of course compressed data in BLOBs will remain compressed and MyRocks did a very good job at compressing regardless.
Importing the dump took about 17 hours. During the import I ran into some stalls that I reported here. Once the import completed I shut the server down and switched the my.cnf file used for bulk load with the "production" one below. Please note that I had to keep rocksdb_block_cache_size low due to other stuff using memory on the test server at the same time; you should raise this accordingly to the amount of memory available, keeping in mind that MyRocks, similarly to TokuDB, also uses filesystem cache extensively. I didn't find a recommendation on the MyRocks wiki, so I decided to set this parameter to 50% of the available server memory, which is the recommended value for Toku.
core-file
rocksdb
skip-innodb
default-storage-engine=rocksdb
default-tmp-storage-engine=MyISAM
collation-server=latin1_bin
rocksdb_max_open_files=-1
rocksdb_base_background_compactions=1
rocksdb_max_background_compactions=8
rocksdb_max_total_wal_size=4G
rocksdb_max_background_flushes=4
rocksdb_block_size=16384
rocksdb_block_cache_size=16G
rocksdb_table_cache_numshardbits=6
rocksdb_default_cf_options=write_buffer_size=128m;target_file_size_base=64m;max_bytes_for_level_base=512m;level0_file_num_compaction_trigger=4;level0_slowdown_writes_trigger=10;level0_stop_writes_trigger=15;max_write_buffer_number=4;compression_per_level=kNoCompression:kNoCompression:kNoCompression:kZlibCompression:kZlibCompression:kZlibCompression;bottommost_compression=kZlibCompression;compression_opts=-14:1:0;block_based_table_factory={cache_index_and_filter_blocks=1;filter_policy=bloomfilter:10:false;whole_key_filtering=1};level_compaction_dynamic_level_bytes=true;optimize_filters_for_hits=true
Using the information saved by mydumper in the "metadata" file I set up replication from the actual InnoDB based slave to the MyRocks slave and let it catch up. I used MTS (Multi Threaded Slave) to speed up the catch up (parallel replication happens to work very well for this cluster's workload) and the slave caught up pretty quickly. After the catch up, I stopped the slave and disabled parallel replication, as it is not compatible with relay_log_recovery in case of a slave crash, and the latter option is needed by MyRocks to guarantee a crash safe slave.
I quickly found out that having a slave which uses an engine different than the master needs also some other tweaks in the configuration. Remember that I had to disable InnoDB in order to make MyRocks work properly. So what happens when on the master some DDL is executed which explicitly references ENGINE=InnoDB?
Here's the answer:
Last_SQL_Errno: 1286
Last_SQL_Error: Worker 7 failed executing transaction '' at master log dbstat02.034383, end_log_pos 99530607; Error 'Unknown storage engine 'innodb'' on query. Default database: 'rick_rocks'. Query: 'create table ttable2 (id int not null auto_increment primary key) engine=innodb'
D'oh! Of course, this is not what I wanted. I quickly remembered about a SQL_MODE option that I always wondered what could be used for: NO_ENGINE_SUBSTITUTION.
From the manual page:
With NO_ENGINE_SUBSTITUTION disabled, for CREATE TABLE the default engine is used and a warning occurs if the desired engine is unavailable. For ALTER TABLE, a warning occurs and the table is not altered.
Hey, sounds like exactly what I want!! We never use ALTER TABLE anyways as we do everything via pt-online-schema-change, so this option should be all we need on the slave.
This option (NO_ENGINE_SUBSTITUTION) is enabled by default since MySQL 5.6.6, so I needed to revert the sql_mode to a blank string, by setting in my.cnf the following:
sql_mode=''
To my dismay, after restarting the MyRocks slave, the behaviour was unchanged. A bit of digging in the MySQL source code and I discovered that the slave uses the master's SQL_MODE, and not its own! The mode gets written to the binary log for each and every transaction, and is used by the slave when applying events.
So, I went to the master and changed the SQL_MODE there. After restarting replication, now the behaviour is correct and a table created with Engine=INNODB is created with engine ROCKSDB on the slave.
Finally, here's a comparison of the disk usage between a classic "InnoDB" slave and our new "RocksDB" one, over the last 48 hours:
MyRocks |
InnoDB |
Can you spot without using the caption which one of the two is our MyRocks slave? The compaction algorithm can be clearly spotted in the graph. The two big drops around 6am is when the older partitions are dropped every night to apply retention.
The graphs above show that the MyRocks slave grew from 47.75% to 48%, while the InnoDB slave grew from 77% to 78%. What they don't show is that the MyRocks storage is a mere 1.4T where the InnoDB one is double that, at 2.8T!
So, bottom line is, MyRocks grew 0.25% of 1.4T = 3.5G while InnoDB grew 1% of 2.8T = 28G. This is a 7x improvement!
We will be monitoring this new slave for some time trying to compare CPU metrics and to observe general behaviour, like lag and general reliability.
But for now, I would say that MyRocks really ... rocks!!! Well done FB Team!
Subscribe to:
Posts (Atom)