My words from my mind
From Unicorn to Unicode
What is worse than knowing that Unicorn exists in some other dimensions but you will never be able to see it?
My answer will be the xA0 character from some encoding world that I don’t even know to exist. Being an Earthling, the only encoding world I’ve been and known is the Unicode. More specifically the UTF-8 realm.
Interestingly, many Unicode based systems reject the xA0 (or any nonconvertible characters) and totally crashes the system. Take Python for example, and also PostgreSQL later on.
In Python, there is a function call unicode() that convert a string from other encoding to Unicode.
unicode(object[, encoding[, errors]])
However, the “errors” handling is defaulted to “strict”. It means that it will complain that “Something is wrong” whenever there is an error. Basically it means that it will break the system when there is an untranslatable character in the object that you are trying to convert.
There are two other options in handling conversion errors.
- “replace” to replace the untranslatable character to the official Unicode replacement character
- “ignore” basically replace the untranslatable character with an empty string.
When inserting non Unicode strings into an UTF-8 (Unicode based) databases, PostgreSQL will try to translate them first. Same thing will happen if the said string contain an untranslatable character, it will throw you an error.
This can be a hell of a problem because it technically break your system if your system is a one of those systems that process input and save them into a database.
So the solution is usually to try to catch these unicorns before they escaped into the database.