Using LATERAL Derived Tables To Gather Row-Specific Aggregations In MySQL 8.0.14

0
54


After my database was accidentally upgraded to MySQL 8.0.28, I started to look at the features released in version 8. One fun feature that caught my eye was this idea of a LATERAL derived table. This is a derived table in which the resultant rows are calculated based on other tables in the query. Not only that, the “lateral rows” are based on other rows in recordset. This feature, added in MySQL 8.0.14, will make it easier to gather relevant aggregate data for each row in the results.

In the past, if I needed to pull back a number of related aggregations for a given row, I would JOIN to a derived “stats” table that contained the aggregations for the predicted records in the outer query. The downside to this approach is that – because the derived query can’t reference the outer query – I need to duplicate all of the filtering inside of the derived query in order to limit the number of records that it generates.

To see what I mean, let’s look at an example using data from this blog. In my database schema, I have two tables:

  • member – Contains all of the author information for the comments.
  • blog_comment – Contains all of the comments left by the members.

This is a 1-to-many relationship in which every member can leave N-number of comments (related by blog_comment.member_id). If I want to gather aggregate information about the comments for a given user, I would have to filter records using the member_id in both the outer query and the derived query:

SET @userID = 1;

SELECT
	m.id,
	m.name,
	-- Gather the aggregate data from the derived stats table.
	stats.commentCount,
	stats.minCreatedAt,
	stats.maxCreatedAt
FROM
	member m
INNER JOIN
	-- Generate a DERIVED TABLE with multiple aggregations relating to the
	-- commenting for the given user. This allows us to run the related query
	-- "once for the member" instead of "once for each aggregate".
	(

		SELECT
			c.member_id,
			COUNT( * ) AS commentCount,
			MIN( c.date_created ) AS minCreatedAt,
			MAX( c.date_created ) AS maxCreatedAt
		FROM
			blog_comment c
		-- NOTE: Since the derived query CANNOT reference the outer query - it
		-- has to be a constant evaluation - I have to repeat the filtering in
		-- the derived query in order to avoid a FULL TABLE SCAN. In this case,
		-- it's only one column; but, the more complex the filtering gets, the
		-- more conditions would need to be duplicated.
		WHERE
			c.member_id = @userID
		GROUP BY
			c.member_id

	) AS stats
ON
	(
			m.id = @userID -- Filter OUTER QUERY to given member.
		AND
			stats.member_id = m.id
	)
;

As you can see, I’m using the @userID to limit rows in both the outer query as well as in the derived stats query. Since the derived query cannot reference rows in the outer query – derived queries have to be “constant” in the query execution plan – I have to reproduce all of the desired filtering within the derived query in order to prevent a catastrophic full-table scan.

The new LATERAL derived table removes this point-of-friction and allows our derived query to reference rows in the outer query! All we have to do is include the LATERAL keyword prior to the SELECT. Here’s the same query, refactored to use a LATERAL derived table:

SET @userID = 1;

SELECT
	m.id,
	m.name,
	-- Gather the aggregate data from the derived stats table.
	stats.commentCount,
	stats.minCreatedAt,
	stats.maxCreatedAt
FROM
	member m
INNER JOIN
	-- Generate a DERIVED TABLE with multiple aggregations relating to the
	-- commenting for the given user. This allows us to run the related query
	-- "once for the member" instead of "once for each aggregate".
	LATERAL (

		SELECT
			COUNT( * ) AS commentCount,
			MIN( c.date_created ) AS minCreatedAt,
			MAX( c.date_created ) AS maxCreatedAt
		FROM
			blog_comment c
		-- Since we are using a LATERAL derived table, it means that this inner
		-- query is run once per outer-query row. Which means, we can now
		-- reference the row in the outer-query in order to perform the
		-- necessary filtering.
		WHERE
			c.member_id = m.id

	) AS stats
ON
	m.id = @userID -- Filter OUTER QUERY to given member.
;

As you can see, I included the LATERAL keyword just prior to my derived table calculation. This changes the derived table from one that must be constant within the query execution plan to one that will execute once per row in the outer query. This means that our derived table query can reference the outer query which allows us to simplify:

  1. Remove the duplicated filtering reference, @userID, in the derived table query – we can now limit the query based on the condition, c.member_id = m.id.

  2. Remove the INNER JOIN condition, stats.member_id = m.id. Since we know that our LATERAL derived table query is running once per row, there’s no need to limit the relationship within the ON clause – the derived table SQL is already applying the JOIN condition, essentially.

  3. Remove the c.member_id column from the derived table query. Since we no longer need to use it within the JOIN condition’s ON clause, we no longer need to select the column.

  4. Since we are able to remove the non-aggregate column (c.member_id) from the derived table query, we are also able to remove the GROUP BY and just run the aggregates (COUNT, MIN, MAX) on the entirety of the derived recordset.

Even though this query is relatively low in complexity – we’re filtering based on a single ID, using the LATERAL derived table is already simplifying the structure of the SQL. The more complex the query gets – and the more filtering it uses – the more savings we’ll see with this new type of per-row cross-product.





Source link

Leave a reply

Please enter your comment!
Please enter your name here