Lesser-Known Capabilities of SQL


86
views
2
3 months ago by
What useful thing(s) can you do with SQL that the average person probably doesn't know about?
Community: Zimmer Biomet

4 Answers


2
3 months ago by
I love this question. I've got a few but I'll break them down into different anwers:

1) Temp Tables and CTEs vs. Subqueries

Temporary tables (temp tables) and Common Table Expressions (CTEs) allow you to create temporary data transformations and views within a query, much like a subquery. The main difference is that subqueries often can't be optimized, making them slower at scale. The database optimizer applies different algorithms to join different datasets, but only focuses on physical tables (i.e. ones that have been declared before). Because a subquery is contained within the original query, it doesn't get treated as a physical table and can lead to slower performance.

For example:
SELECT t1.*, t2.bar
FROM table1 t1
LEFT JOIN (
           SELECT b.bar, f.foo
           FROM bar b
           INNER JOIN foo f
           ON f.id = b.id
           ) AS t2
ON t2.foo = t1.foo​

The query within the query can trip up the optimizer. Additionally, this can be harder to read and troubleshoot if there are 2+ subqueries, or subqueries within subqueries.

Temp tables allow you to declare and create tables before you run your query. They create temporary variables, so they don't disappear after being used once, and basically exist while you keep your SQL environment open (this can be good and bad).
Our example, but with a temp table:
CREATE TEMP TABLE table2 AS(
SELECT f.foo, b.bar
FROM bar b
INNER JOIN foo f
ON f.id = b.id
)

SELECT t1.*, t2.bar
FROM table1 t1
LEFT JOIN table2 t2
ON t2.foo = t1.foo​

This is a bit easier to read, simply because we (or others troubleshooting) see what table2 is supposed to be before reading the query joining table2 and table1. Additionally, temp tables persist in memory, meaning that they'll continue to exist in your local environment after you use this create statement, and then disappear once you close out. I like to think of them like a bar tab, but with data. The statement that I used to make this is a CREATE TABLE AS, which let's you just use a query to make the table. Finally, temp tables deal with a lot of data really well, making them operate well at scale. The major caveat is they'll continue to use memory while your SQL instance is still active, so you can clog up memory if you're not capable and dealing with a ton of data.

Pros: Can handle lots of data, will exist after you've created it, readability
Cons: A bit clunky to write, can clog memory/require maintenance


Common Table Expressions (CTEs) are by far my preferred method because they're easy to write, fast, and memory efficient.

WITH table2 AS(
SELECT b.bar, f.foo
FROM bar b
INNER JOIN foo f
ON f.id = b.id
)

SELECT t1.*, t2.bar
FROM table1 t1
LEFT JOIN table2 t2
ON t2.foo = t1.foo​

I like these because the syntax is so easy to use, and they're extremely flexible. They are created using WITH Queries. CTEs are great to use for a quick data transformation, or to compare data to itself. Another strength is that they only exist within the query, so they must be included within the query to run. This is also their greatest weakness, as they can only be used for a join once before being deleted. Thus, it can be useful to chain CTEs together, or simply use a temp table, if you'll be referencing the table more than once in the same query. CTEs are also pretty bad at large scale datasets (e.g. 100mil rows), so they don't scale as well as temp tables. However, most subqueries can be replaced by CTEs to improve performance, as they're basically the same thing (a query), just presented to the computer in different ways.

Pros: Speed, ease of use, no memory/resource issues, they look pretty
Cons: Disappear after 1 use, must be included in query every time, bad at large scale


TL;DR: CTEs and temp tables let you create new temporary tables on the fly, in code, to let you query and transform data even faster and in more complex ways.
2
3 months ago by
2) Window Functions

Window functions are a great way to subdivide and organize your data. You can rank things by percentile, or only compare averages by brand (or business). You can use them with SUM(), or RANK(), or ROW() to all show you a more useful picture of your data, or even just a simpler way to add it into a new table. Really nice for clean analysis, or for pushing into dashboards via Tableau or Excel.

OVER()
This lets you tell SQL how to compare apples to apples, or departments to departments. There's a nice primer in Window Functions, and we can go over a quick example.

When you provide OVER(), you usually want to include PARTITION BY, which basically tells the function what 'window' to look through. If I wanted to rank employees by their sales numbers, I could writer something like:

SELECT
e.emp_id,
e.emp_name,
s.sales,
RANK() OVER (PARTITION BY e.emp_id ORDER BY s.sales DESC) AS sales_rank
FROM employees e
INNER JOIN sales s
ON s.emp_id = e.emp_id​

This breaks up each rank number by distinct employee id, reorders it by the highest sales numbers, and then assigns a ranking based on that calc. I usually think of PARTITION BY as a different way of saying GROUP BY. If you wanted to just see the top 10 performers, you can just add: 
ORDER BY sales_rank DESC
LIMIT 10​

and you've got yourself a view to pop into Tableau or download to Excel.

Aggregate Functions

As mentioned earlier, you can use this logic on things like SUM() and AVG() as well. It can be useful when comparing countries or regions. For example:

WITH emp_sales AS(
SELECT
e.emp_id,
e.region,
SUM(s.sales) OVER (PARTITION BY e.emp_id, e.region) AS emp_sales
FROM employee e
INNER JOIN sales s
ON s.emp_id = e.emp_id
)


SELECT
e.emp_id,
e.emp_name,
e.region,
es.emp_sales,
AVG(es.emp_sales) OVER (PARTITION BY e.region) AS regional_avg
FROM employee e
INNER JOIN emp_sales es
ON es.emp_id = e.emp_id
AND es.region = e.region


In addition to employee sales, we can also calculate the regional average for comparison, which will change based on region.

Window functions can be a powerful tool when ranking, subdividing, or generally improving SQL to give you more accurate or representative answers. They're especially useful for dashboards, KPI tracking, and other comparative metrics where that 'apples to apples' link is increasingly important. As a wrap up, here is a list of Window Functions (different than the first link!) and Aggregate Functions that use OVER/PARTITION BY.

Nice explanation.  Two of the lesser-known and extremely useful windowing functions are lead and lag, which let you peek at the next (lead) or previous (lag) rows.
written 3 months ago by Christopher Hamel  
Lead and Lag are very useful with the history tables in JDE: work order, purchase orders, item branch revisions, etc. Allows you to see status at certain times. Has been useful with Class A measures, lead time history, etc.
written 3 months ago by Michael Zorn  
2
3 months ago by
The FILTER clause in aggregate functions.  This works in both grouping queries (group by) and windowing functions (partition over).

Example of a query that summarizes backorders for domestic and international based on the sales office code:

select
  sum (case when h.sales_office_cde = '01' then l.backorder_qty else 0 end) as domestic_bo,
  sum (case when h.sales_office_cde = '02' then l.backorder_qty else 0 end) as intl_bo
from
  dcs.sales_order_line l
  join dcs.sales_order_header h on
    h.sales_order_id = l.sales_order_id and
    h.sales_order_entry_dte = l.sales_order_entry_dte and
    h.sales_office_cde = l.sales_office_cde​


Using FILTER instead, it would look like this:

select
  sum (l.backorder_qty) filter (where h.sales_office_cde = '01') as domestic_bo,
  sum (l.backorder_qty) filter (where h.sales_office_cde = '02') as intl_bo
from
  dcs.sales_order_line l
  join dcs.sales_order_header h on
    h.sales_order_id = l.sales_order_id and
    h.sales_order_entry_dte = l.sales_order_entry_dte and
    h.sales_office_cde = l.sales_office_cde


0
3 months ago by
generate_series is very useful for creating a series of objects.  For example,

 select generate_series (1, 100)​

Will create 100 rows, from 1 to 100.  There is an optional third parameter that allows an interval.

It also works with dates, so if we leverage the interval parameter we could do this:

select generate_series ('2018-01-01'::date, '2018-12-31'::date, interval '1 day')::date​

To generate every day in 2018.

Why is this useful?  If you are trying to create a graph, for example, of invoices by day and you want to include all dates, even those with no data, a normal query would leave holes in the data when there are no records with those days.  Generate_series lets you define the basis of your X values and not miss any datapoints, even if they don't exist in your "real" data.

 select
   s.invoice_date, sum (i.direct_sales_qty) as total_qty, sum (i.direct_sales_amt) as total_dollars
 from
   generate_series ('2018-01-01'::date, '2018-12-31'::date, interval '1 day') s(invoice_date)
   left join dcs.invoice_line i on
     s.invoice_date = i.invoice_dte and
     i.invoice_dte between '2018-01-01' and '2018-12-31'
 group by
   s.invoice_date
 order by
   s.invoice_date​
Please login to add an answer/comment or follow this question.

Similar posts:
Search »