標籤:

SQL 查詢按照家庭住址進行分組時,組內平均年齡小於50歲的組中成員的姓名和年齡?


在可以用window function的地方這種事情用window function會超級方便。在不能用window function的地方(例如MySQL)就…自己手動模擬一下唄

假如有這樣的schema:

create table persons (
id integer,
address varchar(255),
age integer,
name varchar(255));

那這樣就可以查出按住址分組後組內平均年齡在50歲以下的人的住址、姓名和年齡:

select address, name, age
from (select address, name, age, avg(age) over (partition by address) as avg_age from persons) p
where p.avg_age &< 50;

例如說在PostgreSQL里:

postgres=# create table persons (id integer, address varchar(255), age integer, name varchar(255));
CREATE TABLE
postgres=# insert into persons (id, address, age, name) values (1, "foo", 25, "Kaname");
INSERT 0 1
postgres=# insert into persons (id, address, age, name) values (2, "foo", 70, "Nakajima");
INSERT 0 1
postgres=# insert into persons (id, address, age, name) values (3, "foo", 30, "Lightning");
INSERT 0 1
postgres=# insert into persons (id, address, age, name) values (4, "bar", 40, "Webber");
INSERT 0 1
postgres=# insert into persons (id, address, age, name) values (5, "bar", 65, "Cruze");
INSERT 0 1
postgres=# insert into persons (id, address, age, name) values (6, "quux", 20, "Tom");
INSERT 0 1
postgres=# select * from persons;
id | address | age | name
----+---------+-----+-----------
1 | foo | 25 | Kaname
2 | foo | 70 | Nakajima
3 | foo | 30 | Lightning
4 | bar | 40 | Webber
5 | bar | 65 | Cruze
6 | quux | 20 | Tom
(6 rows)

postgres=# select address, name, age, avg(age) over (partition by address) from persons;
address | name | age | avg
---------+-----------+-----+---------------------
bar | Webber | 40 | 52.5000000000000000
bar | Cruze | 65 | 52.5000000000000000
foo | Kaname | 25 | 41.6666666666666667
foo | Nakajima | 70 | 41.6666666666666667
foo | Lightning | 30 | 41.6666666666666667
quux | Tom | 20 | 20.0000000000000000
(6 rows)

postgres=# select address, name, age
postgres-# from (select address, name, age, avg(age) over (partition by address) as avg_age from persons) p
postgres-# where p.avg_age &< 50; address | name | age ---------+-----------+----- foo | Kaname | 25 foo | Nakajima | 70 foo | Lightning | 30 quux | Tom | 20 (4 rows) postgres=#

或者在Spark SQL里用同一句SQL語句:

scala&> val df = Seq(
| (1, "foo", 25, "Kaname"),
| (2, "foo", 70, "Nakajima"),
| (3, "foo", 30, "Lightning"),
| (4, "bar", 40, "Webber"),
| (5, "bar", 65, "Cruze"),
| (6, "quux", 20, "Tom")
| ).toDF("id", "address", "age", "name")
df: org.apache.spark.sql.DataFrame = [id: int, address: string ... 2 more fields]

scala&> df.createOrReplaceTempView("persons")

scala&> spark.sql("""
| select address, name, age
| from (select address, name, age, avg(age) over (partition by address) as avg_age from persons) p
| where p.avg_age &< 50 | """).show +-------+---------+---+ |address| name|age| +-------+---------+---+ | foo| Kaname| 25| | foo| Nakajima| 70| | foo|Lightning| 30| | quux| Tom| 20| +-------+---------+---+ scala&>

超方便的。

對比看針對這條語句在PostgreSQL和Spark SQL中的執行計劃,幾乎是一樣的:

PostgreSQL:

QUERY PLAN
------------------------------------------------------------------------------
Subquery Scan on p (cost=12.85..14.95 rows=23 w=1036)
Filter: (p.avg_age &< "50"::numeric) -&> WindowAgg (cost=12.85..14.07 rows=70 w=1068)
-&> Sort (cost=12.85..13.02 rows=70 w=1036)
Sort Key: persons.address
-&> Seq Scan on persons (cost=0.00..10.70 rows=70 w=1036)

Spark SQL:

== Physical Plan ==
*Project [address#10, name#12, age#11]
+- *Filter (isnotnull(avg_age#430) (avg_age#430 &< 50.0)) +- Window [avg(cast(age#11 as bigint)) windowspecdefinition(address#10, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS avg_age#430], [address#10] +- *Sort [address#10 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(address#10, 200) +- LocalTableScan [address#10, name#12, age#11]

&>_&<


select name,age
from table
where
address in
(select address from table
group by address having avg(age) &<50)


推薦閱讀:

索引列只要參與了計算, 查詢就會不走索引, 為什麼 MySQL 不對這種情況進行優化?
有什麼好的書籍了解sql執行細節?比如執行計劃之類的。
mysql表中查找和小於某個數的所有最前面的記錄?
以 MySQL 為例,如何進行 SQL 注入和防止被注入?
分享下你寫過的你覺得最厲害的sql語句?

TAG:SQL |