Learning: Best to not permit NULL values in columns which are used for comparisons
If you compare a NULL value with any other character the output will always be false which sometimes can lead to confusion since it is not the desired behavior.
SELECT CASE WHEN NULL = 'test' THEN 1 ELSE 0 END -- result: 0 SELECT CASE WHEN NULL != 'test' THEN 1 ELSE 0 END -- result: still 0
The case
I am currently involved in developing an open source datawarehouse stack. Within the ETL process we rather want to load only new data into our warehouse. To detect changes we can calculate a hash over a concatinated string of all columns and save it in a separate column. If the hash changes then we know that some entry changed as well and the new row gets inserted (very simplified).
old entries
ColA | ColB | Concatination | md5() |
hello | world | hello_world | 99b1ff8f11781541f7f89f9bd41c4a17 |
bingo | bongo | bingo_bongo | d0946da002ae63f07dc9e1c4489b2c91 |
new entries
ColA | ColB | Concatination | md5() |
hello | world | hello_world | 99b1ff8f11781541f7f89f9bd41c4a17 |
bingo | bongi | bingo_bongi | a303447dc3952485141a4acb6f096f45 |
Note that we at first concatinated the strings using postgres’ || such as
SELECT md5(ColA||'_'|| ColB) FROM table
The pitfall here is, that a concatination with a NULL value results in a NULL value. This leads to a value of NULL in our hash column. This hash column in return is used to detect changes in a way such as:
INSERT INTO table SELECT * FROM table WHERE old.hash != new.hash -- if this statement is true, meaning the hashes differ from each other, changes occured in that row AND old.pk = new.pk
What it does is that it will only insert rows, which have been changed based on the hash. However as mentioned above, a comparison with a NULL value always yields a false result.
This means:
- If the old hash is NULL and our new hash is a legit string, the row does not get updated although changes occured
- A string concatination in such a way as above will result into NULL if any column is NULL. Given 10 columns with the value of NULL. If at any point in time 8 of them would receive entries, the resulting hash would still be NULL therefore detecting no changes even if we would wrap the hash with a COALESCE() function.
Given our insert logic this particular entry will never see an update value again.
Solution
- COALESCE(): By using COALESCE() or ISNULL() (t-sql) it is possible to replace the NULL value with a avlue which can be used to compare to each other, however as above mentioned,that might not work in all cases for example, when both hashes are NULL.
- IS DISTINCT FROM: This operator does exactly what we need – it compares two values and can discern NULL values from any character strings giving us a result of true if that’s the case. Note that this operator is not available on all DBMS.
- CONCAT(): All of this would not happen in the first place, if we would concatinate our strings by using the function CONCAT(). If a NULL value occurs it would replace it with an empty string, therefore preventing NULL values in our updatehash.
Note: Sometimes you might want a result returning NULL if any of its component have a NULL value, e.g. in the case of a compound key. Add the condition of NOT NULL to the key’s column and we have somewhat of a integrity check. Now the components of a compound key should not be NULL but if the case does exist, the concatination of the key results into a NULL value, which then throws an error in the process of insertion into the NOT NULL column which should be the expected behaviour.