As you’ll be able to probably guess, this text details obstacles I bumped into at work after I found myself in an identical situation.
Ultimately, the answer was in a somewhat hack-y, but remarkably comprehensible, workaround using the native MySQL JSON data type.
The JSON data type in MySQL was added in version 5.7.8, and provides loads of useful utility for each storage and modeling.
Under the JSON data type umbrella (officially called “JSON documents”) are two different data structures: JSON arrays and JSON objects.
A JSON array can simply be regarded as an array (a listing, when you’re a Pythonista): values enclosed by square brackets [ ] and separated by commas.
- An example MySQL JSON array value:
[“foo”, “bar”, 1, 2]
A JSON object might be regarded as a hash table (or, again in Python terms, a dictionary): key-value pairs, separated by commas, and enclosed by curly brackets { }.
- An example of a MySQL JSON object value:
{“foo”: “bar”, 1: 2}
MySQL has quite a few functions that might be used to take care of each of those formats—almost none of which perform any form of aggregation.
Thankfully, though, there are two that do. They usually each return JSON documents, which suggests we are able to use MySQL’s built-in functions to access the values therein.
The MySQL function JSON_ARRAYAGG acts so much like GROUP_CONCAT. The most important difference is that it returns a JSON array, which, again, comes with several helpful built-in functions linked above.
The JSON array data type solves one among our two problems with astounding simplicity: the issue of reliably counting the variety of subscriptions in a mix. That is achieved using the JSON_LENGTH function. The syntax is splendidly straightforward:
SELECT JSON_LENGTH(JSON_ARRAY("foo", "bar", "hello", "world"));
-- JSON_ARRAY function used here simply to quickly create an example array
The results of this statement is 4, since there are 4 values within the generated JSON array.
But let’s return to the mixture of subscriptions. Unfortunately, JSON_ARRAYAGG doesn’t include the ordering functionality that GROUP_CONCAT has. Ordering the subscription values, even in a CTE before the bottom query, doesn’t return the specified results:
WITH
subscriptions_ordered AS (
SELECT
customer_id,
subscription
FROM subscriptions
ORDER BY subscription
)
, subscriptions_grouped AS (
SELECT
customer_id,
JSON_ARRAYAGG(subscription) AS subscriptions,
JSON_LENGTH(JSON_ARRAYAGG(subscription)) AS num_subscriptions
FROM
subscriptions_ordered
GROUP BY customer_id
)
SELECT
subscriptions,
COUNT(*) AS num_accounts
num_subscriptions
FROM subscriptions_grouped
GROUP BY subscriptions
;
The variety of subscriptions in each combination is there, due to the JSON_LENGTH function—but mixtures which are effectively the identical are once more mischaracterized as distinct due to their order.
Using ROW_NUMBER to force the ordering of values
ROW_NUMBER is a window function that creates an index. The index needs to be defined; that’s, you might have to inform it where to start out, how one can increment (directionally), and where to finish.
We are able to see a fast example of this by applying the ROW_NUMBER function and telling it to order by the subscription field:
SELECT
customer_id,
subscription,
ROW_NUMBER() OVER(ORDER BY subscription) AS alphabetical_row_num
FROM subscriptions
;
Look closely at the outcomes. Though we didn’t use an ORDER BY statement at the top of our query, the information is nonetheless ordered based on the ORDER BY within the OVER clause.
But after all this still isn’t exactly what we wish. What we want to do next is add a PARTITION BY clause to our window function, in order that the ordering of the outcomes are related to (and the truth is bounded by) each customer ID. Like so:
SELECT
customer_id,
subscription,
ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY subscription) AS alphabetical_order
FROM subscriptions
;
You’ll be able to probably see where that is going.
If we execute the JSON_ARRAYAGG function against these ends in a CTE, we see that the duplicate mixtures now look the exact same, due to the subscriptions being forced into an alphabetical order by the ROW_NUMBER function:
WITH
subscriptions_ordered AS (
SELECT
customer_id,
subscription,
ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY subscription) AS alphabetical_order
FROM subscriptions
)
SELECT
customer_id,
JSON_ARRAYAGG(subscription) AS subscriptions
FROM subscriptions_ordered
GROUP BY 1
ORDER BY 2
;
Now all we want to do is add within the grouping CTE following the one executing ROW_NUMBER, and alter the bottom query:
WITH
subscriptions_ordered AS (
SELECT
customer_id,
subscription,
ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY subscription) AS alphabetical_order
FROM subscriptions
)
, subscriptions_grouped AS (
SELECT
customer_id,
JSON_ARRAYAGG(subscription) AS subscriptions,
JSON_LENGTH(JSON_ARRAYAGG(subscription)) AS num_subscriptions
FROM subscriptions_ordered
GROUP BY customer_id
)
SELECT
subscriptions,
COUNT(*) AS num_customers,
num_subscriptions
FROM subscriptions_grouped
GROUP BY subscriptions
ORDER BY num_customers DESC
;
This provides not only accurately distinct mixtures of subscriptions, but additionally the number of consumers who’ve purchased those mixtures, and what number of subscriptions comprise each of them:
Voila!
- We desired to understand how many purchasers purchased different combination of subscriptions, and what number of subscriptions were in each of those mixtures. This presented two problems: how best to acquire the latter, and how one can generate accurately distinct subscription mixtures.
- To acquire the variety of subscriptions in each combination, we selected to go together with one among MySQL’s JSON functions,
JSON_ARRAYAGG. The resulting aggregation was returned to us as a JSON data type, allowing us to make use of theJSON_LENGTHfunction. - We then needed to force the ordering of values contained in the JSON array in order that duplicate mixtures didn’t mistakenly appear distinct. To do that, we used the window function
ROW_NUMBERin a CTE prior to the bottom query, partitioning by customer ID and ordering the subscriptions alphabetically (in ascending order). - This ultimately allowed us to aggregate as much as accurately distinct mixtures of subscriptions; and with this we were capable of use a straightforward
COUNTfunction to see how many purchasers had purchased each combination.