Yesterday I attempted to solve alphanumeric sorting in MySQL and failed. (read that article here)
I did get close though and had the right concept, just wrong execution.
Today, I woke up and had an epiphany…recursion.
The problem with recursion is that you have to understand recursion to be able to do recursion…and I don’t understand recursion enough to do recursion in MySQL.
However with a bit of Chat Gippity back and forth (by which I mean getting it to write what I asked for, getting back about 25% of what I asked for, fixing it and feeding it into a new chat so it doesn’t keep repeating itself for about 2 hours) I got a working answer!
To the point
May I present to you my swan song, my masterpiece, the answer to life itself (ok fine, the only working solution to true alphanumeric sorting in MySQL I have seen).
WITH RECURSIVE process_numbers AS (
SELECT
data_value,
data_value AS remaining_data,
CAST('' AS CHAR(20000)) AS processed_data,
1 AS iteration
FROM test_data
UNION ALL
SELECT
data_value,
CASE
WHEN LOCATE(REGEXP_SUBSTR(remaining_data, '[0-9]+'), remaining_data) > 0 THEN
SUBSTRING(
remaining_data,
LOCATE(REGEXP_SUBSTR(remaining_data, '[0-9]+'), remaining_data)
+ LENGTH(REGEXP_SUBSTR(remaining_data, '[0-9]+'))
)
ELSE ''
END AS remaining_data,
CONCAT(
processed_data,
CASE
WHEN LOCATE(REGEXP_SUBSTR(remaining_data, '[0-9]+'), remaining_data) > 0 THEN
LEFT(remaining_data, LOCATE(REGEXP_SUBSTR(remaining_data, '[0-9]+'), remaining_data) - 1)
ELSE remaining_data
END,
CASE
WHEN REGEXP_SUBSTR(remaining_data, '[0-9]+') IS NOT NULL THEN
RIGHT(CONCAT('0000000000', REGEXP_SUBSTR(remaining_data, '[0-9]+')), 10)
ELSE ''
END
) AS processed_data,
iteration + 1
FROM process_numbers
WHERE LENGTH(remaining_data) > 0
AND iteration < 100
)
SELECT
data_value,
CONCAT(processed_data, remaining_data) AS sort_key
FROM process_numbers
WHERE remaining_data = ""
ORDER BY sort_key;
And if you want to try it out (and try to break it) you can play with this DB fiddle
So how does this work?
It does what I originally wanted to do, take every group of numbers and pads them out to 10 digits total.
So obviously if you feed this a couple of strings with 11 consecutive numeric digits it won’t work without adjustment, but other than that it works fine!
You see, MySQL can sort numbers correctly, even in lexicographical ordering mode, but it has one flaw.
It counts “11” as smaller than “2” because of the fact it does sorting one character at a time (effectively). So “2” is bigger than “1” so it comes first. Then it checks the next character, by which point the sorting is incorrect (for numbers at least).
To understand this better, imagine if 1 was actually the letter “b” and 2 was the letter “c”.
That is how MySQL “sees” numbers, they are just another character.
So if I had “bb” and “c” you would expect “bb” to come before “c”. Now swap the numbers back in and you can see why “11” comes before “2”.
So this is a hack?
Yes, we remove the issue by moving the numbers “back” through padding.
Going back to our example, if we padded “11” and “2” to 3 in length and use “a” as 0, this is what happens:
011 = abb
002 = aac
notice how now the sorting would go:
- character 1: is “a” bigger than “a” – no, they are the same.
- character 2: is “b” bigger than “a” – yes, put the “a” before the “b”
- character 3: is now irrelevant and we have already found an occurence earlier that was different and larger.
So by that logic we now have:
002 = aac (the second "a" comes before the second "b" in the next row)
011 = abb
And that is how it works!
Are you going to explain the recursion thing?
Kind of. I have been “round the houses” with this one and my knowledge is surface level, but I will give it a go.
The problem came with how RegEx works in MySQL. REGEX_SUBSTR
will only ever find one match and then keep returning that for every other match it finds. So that was why my solution from yesterday was not working correctly.
But REGEX_REPLACE
has its own issues where it doesn’t seem to correctly expose the string length of a match (so we can’t LPAD
with it correctly)
That is why I thought about recursion as the answer.
I can use REGEX_SUBSTR
to get the correct padding behaviour, and as each loop through the RegEx is essentially a new function call it doesn’t “remember” the previous match, so it solves that problem.
And if you want a brief step through of the logic it is actually not as scary as it looks!
- We loop over a given string, looking for any numbers (the entire number, not just a single character).
- We then remove that from
remaining_data
so we don’t match it again. - We take that number we just matched and pad it out to be 10 digits long total.
- We then search for the next numeric part in the string and repeat the process, building up
processed_data
as our final string. - finally once we have no more numbers to process, we add any remaining letters to the end of
processed_data
to complete the transformation and we return this assort_key
.
Then we can use this sort_key
in our query to order the column correctly.
And the iteration
part is purely a safeguarding tool, to make sure it doesn’t completely run the MySQL server out of memory or crash the query if a sufficiently complex string is processed (or there is an error in the logic that means it would recurse forever).
That’s a wrap!
Isn’t it funny how sleeping on things brings new perspective?
Perhaps I should try out polyphasic sleep so I can sleep on problems 2-3 times more often each day and become a 10x developer? haha.
Anyway there you have it, a reasonably robust true alphanumeric sort.
Oh and in reality you should probably convert the sort_key
to a stored column on your database using GENERATE
or a stored procedure. Sadly the playground I use doesn’t seem to support that and it is a Sunday so I will leave that to you, dear viewer!
Have a great rest of your weekend and a great week.
Source link
lol