Lesser-Known Capabilities of SQL
1) Temp Tables and CTEs vs. Subqueries
Temporary tables (temp tables) and Common Table Expressions (CTEs) allow you to create temporary data transformations and views within a query, much like a subquery. The main difference is that subqueries often can't be optimized, making them slower at scale. The database optimizer applies different algorithms to join different datasets, but only focuses on physical tables (i.e. ones that have been declared before). Because a subquery is contained within the original query, it doesn't get treated as a physical table and can lead to slower performance.
SELECT t1.*, t2.bar FROM table1 t1 LEFT JOIN ( SELECT b.bar, f.foo FROM bar b INNER JOIN foo f ON f.id = b.id ) AS t2 ON t2.foo = t1.foo
The query within the query can trip up the optimizer. Additionally, this can be harder to read and troubleshoot if there are 2+ subqueries, or subqueries within subqueries.
Temp tables allow you to declare and create tables before you run your query. They create temporary variables, so they don't disappear after being used once, and basically exist while you keep your SQL environment open (this can be good and bad).
Our example, but with a temp table:
CREATE TEMP TABLE table2 AS( SELECT f.foo, b.bar FROM bar b INNER JOIN foo f ON f.id = b.id ) SELECT t1.*, t2.bar FROM table1 t1 LEFT JOIN table2 t2 ON t2.foo = t1.foo
This is a bit easier to read, simply because we (or others troubleshooting) see what table2 is supposed to be before reading the query joining table2 and table1. Additionally, temp tables persist in memory, meaning that they'll continue to exist in your local environment after you use this create statement, and then disappear once you close out. I like to think of them like a bar tab, but with data. The statement that I used to make this is a CREATE TABLE AS, which let's you just use a query to make the table. Finally, temp tables deal with a lot of data really well, making them operate well at scale. The major caveat is they'll continue to use memory while your SQL instance is still active, so you can clog up memory if you're not capable and dealing with a ton of data.
Pros: Can handle lots of data, will exist after you've created it, readability
Cons: A bit clunky to write, can clog memory/require maintenance
Common Table Expressions (CTEs) are by far my preferred method because they're easy to write, fast, and memory efficient.
WITH table2 AS( SELECT b.bar, f.foo FROM bar b INNER JOIN foo f ON f.id = b.id ) SELECT t1.*, t2.bar FROM table1 t1 LEFT JOIN table2 t2 ON t2.foo = t1.foo
I like these because the syntax is so easy to use, and they're extremely flexible. They are created using WITH Queries. CTEs are great to use for a quick data transformation, or to compare data to itself. Another strength is that they only exist within the query, so they must be included within the query to run. This is also their greatest weakness, as they can only be used for a join once before being deleted. Thus, it can be useful to chain CTEs together, or simply use a temp table, if you'll be referencing the table more than once in the same query. CTEs are also pretty bad at large scale datasets (e.g. 100mil rows), so they don't scale as well as temp tables. However, most subqueries can be replaced by CTEs to improve performance, as they're basically the same thing (a query), just presented to the computer in different ways.
Pros: Speed, ease of use, no memory/resource issues, they look pretty
Cons: Disappear after 1 use, must be included in query every time, bad at large scale
TL;DR: CTEs and temp tables let you create new temporary tables on the fly, in code, to let you query and transform data even faster and in more complex ways.
Window functions are a great way to subdivide and organize your data. You can rank things by percentile, or only compare averages by brand (or business). You can use them with SUM(), or RANK(), or ROW() to all show you a more useful picture of your data, or even just a simpler way to add it into a new table. Really nice for clean analysis, or for pushing into dashboards via Tableau or Excel.
This lets you tell SQL how to compare apples to apples, or departments to departments. There's a nice primer in Window Functions, and we can go over a quick example.
When you provide OVER(), you usually want to include PARTITION BY, which basically tells the function what 'window' to look through. If I wanted to rank employees by their sales numbers, I could writer something like:
SELECT e.emp_id, e.emp_name, s.sales, RANK() OVER (PARTITION BY e.emp_id ORDER BY s.sales DESC) AS sales_rank FROM employees e INNER JOIN sales s ON s.emp_id = e.emp_id
This breaks up each rank number by distinct employee id, reorders it by the highest sales numbers, and then assigns a ranking based on that calc. I usually think of PARTITION BY as a different way of saying GROUP BY. If you wanted to just see the top 10 performers, you can just add:
ORDER BY sales_rank DESC LIMIT 10
and you've got yourself a view to pop into Tableau or download to Excel.
As mentioned earlier, you can use this logic on things like SUM() and AVG() as well. It can be useful when comparing countries or regions. For example:
WITH emp_sales AS( SELECT e.emp_id, e.region, SUM(s.sales) OVER (PARTITION BY e.emp_id, e.region) AS emp_sales FROM employee e INNER JOIN sales s ON s.emp_id = e.emp_id ) SELECT e.emp_id, e.emp_name, e.region, es.emp_sales, AVG(es.emp_sales) OVER (PARTITION BY e.region) AS regional_avg FROM employee e INNER JOIN emp_sales es ON es.emp_id = e.emp_id AND es.region = e.region
In addition to employee sales, we can also calculate the regional average for comparison, which will change based on region.
Window functions can be a powerful tool when ranking, subdividing, or generally improving SQL to give you more accurate or representative answers. They're especially useful for dashboards, KPI tracking, and other comparative metrics where that 'apples to apples' link is increasingly important. As a wrap up, here is a list of Window Functions (different than the first link!) and Aggregate Functions that use OVER/PARTITION BY.
Example of a query that summarizes backorders for domestic and international based on the sales office code:
select sum (case when h.sales_office_cde = '01' then l.backorder_qty else 0 end) as domestic_bo, sum (case when h.sales_office_cde = '02' then l.backorder_qty else 0 end) as intl_bo from dcs.sales_order_line l join dcs.sales_order_header h on h.sales_order_id = l.sales_order_id and h.sales_order_entry_dte = l.sales_order_entry_dte and h.sales_office_cde = l.sales_office_cde
Using FILTER instead, it would look like this:
select sum (l.backorder_qty) filter (where h.sales_office_cde = '01') as domestic_bo, sum (l.backorder_qty) filter (where h.sales_office_cde = '02') as intl_bo from dcs.sales_order_line l join dcs.sales_order_header h on h.sales_order_id = l.sales_order_id and h.sales_order_entry_dte = l.sales_order_entry_dte and h.sales_office_cde = l.sales_office_cde
select generate_series (1, 100)
Will create 100 rows, from 1 to 100. There is an optional third parameter that allows an interval.
It also works with dates, so if we leverage the interval parameter we could do this:
select generate_series ('2018-01-01'::date, '2018-12-31'::date, interval '1 day')::date
To generate every day in 2018.
Why is this useful? If you are trying to create a graph, for example, of invoices by day and you want to include all dates, even those with no data, a normal query would leave holes in the data when there are no records with those days. Generate_series lets you define the basis of your X values and not miss any datapoints, even if they don't exist in your "real" data.
select s.invoice_date, sum (i.direct_sales_qty) as total_qty, sum (i.direct_sales_amt) as total_dollars from generate_series ('2018-01-01'::date, '2018-12-31'::date, interval '1 day') s(invoice_date) left join dcs.invoice_line i on s.invoice_date = i.invoice_dte and i.invoice_dte between '2018-01-01' and '2018-12-31' group by s.invoice_date order by s.invoice_date