SQL 查詢按照家庭住址進行分組時,組內平均年齡小於50歲的組中成員的姓名和年齡?
在可以用window function的地方這種事情用window function會超級方便。在不能用window function的地方(例如MySQL)就…自己手動模擬一下唄
假如有這樣的schema:
create table persons (
id integer,
address varchar(255),
age integer,
name varchar(255));
那這樣就可以查出按住址分組後組內平均年齡在50歲以下的人的住址、姓名和年齡:
select address, name, age
from (select address, name, age, avg(age) over (partition by address) as avg_age from persons) p
where p.avg_age &< 50;
例如說在PostgreSQL里:
postgres=# create table persons (id integer, address varchar(255), age integer, name varchar(255));
CREATE TABLE
postgres=# insert into persons (id, address, age, name) values (1, "foo", 25, "Kaname");
INSERT 0 1
postgres=# insert into persons (id, address, age, name) values (2, "foo", 70, "Nakajima");
INSERT 0 1
postgres=# insert into persons (id, address, age, name) values (3, "foo", 30, "Lightning");
INSERT 0 1
postgres=# insert into persons (id, address, age, name) values (4, "bar", 40, "Webber");
INSERT 0 1
postgres=# insert into persons (id, address, age, name) values (5, "bar", 65, "Cruze");
INSERT 0 1
postgres=# insert into persons (id, address, age, name) values (6, "quux", 20, "Tom");
INSERT 0 1
postgres=# select * from persons;
id | address | age | name
----+---------+-----+-----------
1 | foo | 25 | Kaname
2 | foo | 70 | Nakajima
3 | foo | 30 | Lightning
4 | bar | 40 | Webber
5 | bar | 65 | Cruze
6 | quux | 20 | Tom
(6 rows)
postgres=# select address, name, age, avg(age) over (partition by address) from persons;
address | name | age | avg
---------+-----------+-----+---------------------
bar | Webber | 40 | 52.5000000000000000
bar | Cruze | 65 | 52.5000000000000000
foo | Kaname | 25 | 41.6666666666666667
foo | Nakajima | 70 | 41.6666666666666667
foo | Lightning | 30 | 41.6666666666666667
quux | Tom | 20 | 20.0000000000000000
(6 rows)
postgres=# select address, name, age
postgres-# from (select address, name, age, avg(age) over (partition by address) as avg_age from persons) p
postgres-# where p.avg_age &< 50;
address | name | age
---------+-----------+-----
foo | Kaname | 25
foo | Nakajima | 70
foo | Lightning | 30
quux | Tom | 20
(4 rows)
postgres=#
或者在Spark SQL里用同一句SQL語句:
scala&> val df = Seq(
| (1, "foo", 25, "Kaname"),
| (2, "foo", 70, "Nakajima"),
| (3, "foo", 30, "Lightning"),
| (4, "bar", 40, "Webber"),
| (5, "bar", 65, "Cruze"),
| (6, "quux", 20, "Tom")
| ).toDF("id", "address", "age", "name")
df: org.apache.spark.sql.DataFrame = [id: int, address: string ... 2 more fields]
scala&> df.createOrReplaceTempView("persons")
scala&> spark.sql("""
| select address, name, age
| from (select address, name, age, avg(age) over (partition by address) as avg_age from persons) p
| where p.avg_age &< 50
| """).show
+-------+---------+---+
|address| name|age|
+-------+---------+---+
| foo| Kaname| 25|
| foo| Nakajima| 70|
| foo|Lightning| 30|
| quux| Tom| 20|
+-------+---------+---+
scala&>
超方便的。
對比看針對這條語句在PostgreSQL和Spark SQL中的執行計劃,幾乎是一樣的:
PostgreSQL:
QUERY PLAN
------------------------------------------------------------------------------
Subquery Scan on p (cost=12.85..14.95 rows=23 w=1036)
Filter: (p.avg_age &< "50"::numeric)
-&> WindowAgg (cost=12.85..14.07 rows=70 w=1068)
-&> Sort (cost=12.85..13.02 rows=70 w=1036)
Sort Key: persons.address
-&> Seq Scan on persons (cost=0.00..10.70 rows=70 w=1036)
Spark SQL:
== Physical Plan ==
*Project [address#10, name#12, age#11]
+- *Filter (isnotnull(avg_age#430) (avg_age#430 &< 50.0))
+- Window [avg(cast(age#11 as bigint)) windowspecdefinition(address#10, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS avg_age#430], [address#10]
+- *Sort [address#10 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(address#10, 200)
+- LocalTableScan [address#10, name#12, age#11]
&>_&<
select name,age
from table
where
address in
(select address from table
group by address having avg(age) &<50)
推薦閱讀:
※索引列只要參與了計算, 查詢就會不走索引, 為什麼 MySQL 不對這種情況進行優化?
※有什麼好的書籍了解sql執行細節?比如執行計劃之類的。
※mysql表中查找和小於某個數的所有最前面的記錄?
※以 MySQL 為例,如何進行 SQL 注入和防止被注入?
※分享下你寫過的你覺得最厲害的sql語句?
TAG:SQL |