Why you may’t GROUP BY ordinal positions in SQL Server but can in others
After working commonly with open-source databases equivalent to MySQL and PostgreSQL, I recently had the possibility to work on a SQL Server project and discovered a subtle but vital difference within the SQL landscapes. I observed that in SQL Server, I used to be unable to GROUP BY ordinal positions (GROUP BY 1, 2, 3…), which is a functionality I ceaselessly used inside other databases, particularly for rapid testing.
This discovery led me to explore several nuances of each database systems, particularly the SQL execution order, which will likely be the main focus of this text.
Why does this matter? When working with database systems, understanding the subtle differences can greatly impact your workflow and increase your productivity. It will possibly prevent significant troubleshooting time. Also, by understanding the SQL execution order of assorted databases, you may craft more optimal SQL queries based on the system you might be working with.
In this text, we are going to look into one major use case where this behaviour occurs — GROUP BY — and investigate why. Nonetheless, this insight might be applied to the HAVING, WHERE or another SQL command clause.
Let’s take a look at this instance within the query below. This may not work in SQL Server regardless that it really works in MySQL:
SELECT
DATEPART(12 months, day) AS order_date,
SUM(cost) as cost
FROM clean
GROUP BY 1;
In the event you run this, you’ll likely get an error like this:
Each GROUP BY expression must contain at the least one column that shouldn't be an outer reference.
Nonetheless, this revised query works after replacing the GROUP BY ordinal reference with the express expression. You can even notice you could reference the ordinal positions within the ORDER BY clause, which I discovered strange:
SELECT
datepart(12 months, day),
sum(cost) as cost
from clean
GROUP BY datepart(12 months, day)
ORDER BY 1;
In SQL Server, I quickly learned that I had to make use of explicit column names or expressions within the GROUP BY clause. This is taken into account a best practice because it makes the code easier to grasp. Nonetheless, I used to be interested by why this behavior differed between databases. Moreover, I discovered it interesting that the ORDER BY clause in SQL Server works with ordinal positions, which further piqued my curiosity.
To seek out out, lets take a look at the SELECT statement execution/processing order for SQL server vs. other databases. It’s vital to notice that in SQL databases, each a part of a question is executed sequentially and that order differs from the way it is written.
In SQL Server, as an illustration, we are able to see from the image below and from Microsoft docs that the FROM clause is the primary command to be evaluated. Moreover, the SELECT clause runs after the GROUP BY clause. That’s the reason we were unable to reference a column‘s position and even its alias within the GROUP BY clause in our first example!
Nonetheless, we’re free to reference the ordinal position and/or alias within the ORDER BY clause, as that’s evaluated after the SELECT clause. The SELECT clause tells the database what columns will likely be returned and thus, the positioning is thought at this point. Cool, right?
SQL Server execution order
In MySQL nevertheless, I discovered it difficult to search out clear documentation stating the order of execution of a SQL query. The execution order seems to rely on the contents of the query and what the query optimiser defines as the perfect path.
But from what we are able to see from the MySQL docs here, the clue shows us how the execution order is perhaps and that the SELECT clause is evaluated before the GROUP BY clause:
For GROUP BY or HAVING clauses, it searches the FROM clause before searching within the select_expr values. (For GROUP BY and HAVING, this differs from the pre-MySQL 5.0 behavior that used the identical rules as for ORDER BY.)
If we also take a look at the GoogleSQL (formerly Standard SQL) docs which is the syntax utilized in Google BigQuery, you will notice similar deviation from the way in which queries are executed in SQL Server:
GROUP BY and ORDER BY may discuss with a 3rd group: Integer literals, which discuss with items within the SELECT list. The integer 1 refers back to the first item within the SELECT list, 2 refers back to the second item, etc.
As you may see, this behaviour shouldn’t be supported in SQL Server. The Google docs also mention that GROUP BY, ORDER BY, and HAVING, can discuss with aliases from the SELECT list.
With that, we are able to conclude with a high probability that the execution order for these other databases follows a path much like the image below:
MySQL, PostgreSQL & BigQuery probable execution order
Conclusion
This was a brief post where we checked out how the execution order in MySQL, GoogleSQL and other databases’ SQL syntaxes, differ from SQL Server’s, based on the observed behaviours and documentation. SQL Server emphasises explicitness within the GROUP BY clause for code clarity, while MySQL’s execution order definitely evaluates that SELECT clause before the GROUP BY clause, allowing us to reference the ordinal positions in it.
Be happy to share your thoughts on this subject and catch you in the next one.