Null values and their pitfalls

Learning: Best to not permit NULL values in columns which are used for comparisons

If you compare a NULL value with any other character the output will always be false which sometimes can lead to confusion since it is not the desired behavior.

SELECT CASE WHEN NULL = 'test' THEN 1 ELSE 0 END
-- result: 0
SELECT CASE WHEN NULL != 'test' THEN 1 ELSE 0 END
-- result: still 0

The case

I am currently involved in developing an open source datawarehouse stack. Within the ETL process we rather want to load only new data into our warehouse. To detect changes we can calculate a hash over a concatinated string of all columns and save it in a separate column. If the hash changes then we know that some entry changed as well and the new row gets inserted (very simplified).

old entries

ColA	ColB	Concatination	md5()
hello	world	hello_world	99b1ff8f11781541f7f89f9bd41c4a17
bingo	bongo	bingo_bongo	d0946da002ae63f07dc9e1c4489b2c91

new entries

ColA	ColB	Concatination	md5()
hello	world	hello_world	99b1ff8f11781541f7f89f9bd41c4a17
bingo	bongi	bingo_bongi	a303447dc3952485141a4acb6f096f45

The hash of the second row changed because bongo turned into bongi

Note that we at first concatinated the strings using postgres’ || such as

SELECT md5(ColA||'_'|| ColB) FROM table

The pitfall here is, that a concatination with a NULL value results in a NULL value. This leads to a value of NULL in our hash column. This hash column in return is used to detect changes in a way such as:

INSERT INTO table
SELECT * FROM table
WHERE old.hash != new.hash -- if this statement is true, meaning the hashes differ from each other, changes occured in that row
AND old.pk = new.pk

What it does is that it will only insert rows, which have been changed based on the hash. However as mentioned above, a comparison with a NULL value always yields a false result.
This means:

If the old hash is NULL and our new hash is a legit string, the row does not get updated although changes occured
A string concatination in such a way as above will result into NULL if any column is NULL. Given 10 columns with the value of NULL. If at any point in time 8 of them would receive entries, the resulting hash would still be NULL therefore detecting no changes even if we would wrap the hash with a COALESCE() function.

Given our insert logic this particular entry will never see an update value again.

Solution

COALESCE(): By using COALESCE() or ISNULL() (t-sql) it is possible to replace the NULL value with a avlue which can be used to compare to each other, however as above mentioned,that might not work in all cases for example, when both hashes are NULL.
IS DISTINCT FROM: This operator does exactly what we need – it compares two values and can discern NULL values from any character strings giving us a result of true if that’s the case. Note that this operator is not available on all DBMS.
CONCAT(): All of this would not happen in the first place, if we would concatinate our strings by using the function CONCAT(). If a NULL value occurs it would replace it with an empty string, therefore preventing NULL values in our updatehash.

Note: Sometimes you might want a result returning NULL if any of its component have a NULL value, e.g. in the case of a compound key. Add the condition of NOT NULL to the key’s column and we have somewhat of a integrity check. Now the components of a compound key should not be NULL but if the case does exist, the concatination of the key results into a NULL value, which then throws an error in the process of insertion into the NOT NULL column which should be the expected behaviour.

The case

old entries

new entries

Solution

Leave a Comment Cancel Reply