Monday, 3 September 2012

MySQL to PostgreSQL database transition

There are quite a few blog posts around about how to convert your database from mysql to postgresql. I feel none of them really cover the entirety of the problem, I think the majority of people are dealing with small datasets or simple data and don't see the issues I had. So I figured I'd document my findings. I'll leave the arguments about why you should use one database or another for a different time.

First you'll want to get your data out of mysql. Personally, I found it faster/easier to just rewrite my entire database schema from scratch for postgres. You will need to be careful with datatypes to make sure the data will match up ok. This is where I hit my first problem, mysql doesn't have a boolean type, it uses a tinyint with the data being 1 and 0. Since we will be dumping the raw data later, it means your postgres database will have to be also a tinyint for now (you can change it once the data is in).

Once you have your schema written up, it's time to dump the data. This is where I hit problem number 2. No matter what your default charset is on your tables, the data will be dumped with the database default. I assumed mysql tables were utf8, postgres tables were utf8, good 'nuff. Wrong. You'll want to dump with --default-charset=utf8 to make sure.

Since we already wrote our schema, we can dump with --skip-create-info. Some other handy options you will want so you don't lock your mysql or run out of memory are --skip-opt --quick --single-transaction. And there are others just so you don't get bloat in your dump file, which are --skip-comments --skip-dump-date.

Now the main dump args you want to make sure it's importable by postgres are --skip-add-locks --complete-insert --compatible=postgres.

Onto problem number 3, mysql uses backslash as an escape char in it's dump data, escaping a whole bunch of stuff like tabs, newlines etc, when it doesn't really need to. At first I thought I just needed to fix the escaped quotes of \' and replace with ''. This allowed me to fully insert the data and I thought i was done. But after a while, I noticed all the extra backslashes in my data. Ouch. I thought I was going to have to get my sed programming hat on and replace every single escaped mysql character (there are lots). But alas, thanks to some dudes in the postgresql IRC channel, all you need is PGOPTIONS='-c standard_conforming_strings=off -c escape_string_warning=off'.

Now we're good right? We have our huge data file, the escape chars are taken care of, encoding is fixed. Nope. Remember the datatypes I referred to earlier. Turns out, postgres doesn't allow the null character (not the NULL value), in text or varchar fields. Any of your data that contains the null char you are going to need to change that field to a bytea. And then guess what, your database dump is now useless, postgres can't nicely import that "string" content into a bytea.

So make a call, do you really care about all your null characters? If you do, you're gonna have to find another way to import your data, if you don't, strip them out with perl or similar, making sure to use a positive look behind so you strip \0 but not \\0 (escaped backslash followed by literal zero). Why perl? Cos sed doesn't support look arounds.

All in all, it's a bit of a nightmare to keep all your data intact, I hope your transition goes smoother than mine.

No comments:

Post a Comment