SparkSQL-优雅行列转换

行列转换是数据处理与分析中的关键操作，它能够将数据的结构从行转为列，或从列转为行。这种转换不仅简化了复杂的数据展示，还提升了数据分析的效率。在业务场景中，行列转换常用于报表生成、数据透视和多维度数据分析，通过更直观的方式呈现数据，帮助管理者快速获取关键信息。此外，它还能有效减少数据冗余，优化查询性能，满足灵活多变的业务需求。无论是在财务报表、销售分析，还是市场趋势分析中，行列转换都是不可或缺的工具。

本文会基于 SparkSQL 3.5.x 给出常用的行列转换方式，但本文的重点是介绍pivot和unpivot子句在行列转换场景的应用，其中细节、优雅程度交由开发者自己选择

一、数据准备

以下是城市各年GDP 数据的表结构和测试数据，用于后续演示行列转换

sql">create table city_gdp
(city string comment '城市名',year int comment '年份',gdp  double comment '单位：亿'
) comment '城市 gdp'stored as parquet;insert into city_gdp (city, year, gdp)
values ('北京', 2018, 30320.00),('北京', 2019, 35370.00),('北京', 2020, 36100.00),('上海', 2018, 32600.00),('上海', 2019, 38160.00),('上海', 2020, 38700.00),('广州', 2018, 23000.00),('广州', 2019, 23628.00),('广州', 2020, 25019.00);

二、行列转换

2.1 传统方式

2.1.1 行转列

统计各城市 2018-2020 之间的 GDP，不同年份作为单独列显示。下面是 sql 实现

sql">select city,max(case when year = 2018 then gdp end) as `2018`,max(case when year = 2019 then gdp end) as `2019`,max(case when year = 2020 then gdp end) as `2020`
from city_gdp
group by city;+-------+----------+----------+----------+
| city  |   2018   |   2019   |   2020   |
+-------+----------+----------+----------+
| 上海    | 32600.0  | 38160.0  | 38700.0  |
| 广州    | 23000.0  | 23628.0  | 25019.0  |
| 北京    | 30320.0  | 35370.0  | 36100.0  |
+-------+----------+----------+----------+

2.1.2 列转行

为了更好的演示，将 2.1.1 的结果存储到临时表

sql">create table tmp_city_gdp stored as parquet as
select city,max(case when year = 2018 then gdp end) as `2018`,max(case when year = 2019 then gdp end) as `2019`,max(case when year = 2020 then gdp end) as `2020`
from city_gdp
group by city;

基于 2.1.1 的结果，统计各城市 2018-2020 之间的 GDP，不同年份作为单独行显示。下面是 sql 实现

sql">select city, '2018' as year, `2018` as gdp
from tmp_city_gdp
union all
select city, '2019', `2019`
from tmp_city_gdp
union all
select city, '2020', `2020`
from tmp_city_gdp;+-------+-------+----------+
| city  | year  |   gdp    |
+-------+-------+----------+
| 上海    | 2018  | 32600.0  |
| 北京    | 2018  | 30320.0  |
| 广州    | 2018  | 23000.0  |
| 上海    | 2019  | 38160.0  |
| 北京    | 2019  | 35370.0  |
| 广州    | 2019  | 23628.0  |
| 上海    | 2020  | 38700.0  |
| 北京    | 2020  | 36100.0  |
| 广州    | 2020  | 25019.0  |
+-------+-------+----------+

2.2 pivot 和 unpivot

2.2.1 行转列

pivot的标准语法如下

sql">SELECT [columns]
FROM (SELECT [columns]FROM table_name
) 
PIVOT (aggregate_function([column]) FOR [column_to_pivot] IN ([pivot_values])
)

pivot子句是可以完全替代case when

sql">select *
from (select city, year, gdpfrom city_gdp)
pivot (sum(gdp) for year in (2018, 2019, 2020)
);+-------+----------+----------+----------+
| city  |   2018   |   2019   |   2020   |
+-------+----------+----------+----------+
| 上海    | 32600.0  | 38160.0  | 38700.0  |
| 北京    | 30320.0  | 35370.0  | 36100.0  |
| 广州    | 23000.0  | 23628.0  | 25019.0  |
+-------+----------+----------+----------+

2.2.2 列转行

unpivot的标准语法如下

sql">SELECT [columns]
FROM table_name 
UNPIVOT [ { INCLUDE | EXCLUDE } NULLS ](column_value_unpivot FOR name_column IN (unpivot_column [as alias])
)

Tips:

unpivot 子句会产生新的字段，列名需要自定义
column_value_unpivot 定义 unpivot 度量值字段名
name_column 定义 unpivot 维度值字段名
unpivot_column 限定 unpivot 需要转成行的列，从 table_name 中选取且作为 name_colume 的取值
unpivot 列转行时默认剔除 null 值，如需要体现 null 值所在的 unpivot_column 需要使用 unpivot include nulls

unpivot子句是可以完全替代union all

sql">select *
from tmp_city_gdp
unpivot include nulls (gdp for year in (`2018`, `2019`, `2020`)
);+-------+-------+----------+
| city  | year  |   gdp    |
+-------+-------+----------+
| 上海    | 2018  | 32600.0  |
| 上海    | 2019  | 38160.0  |
| 上海    | 2020  | 38700.0  |
| 北京    | 2018  | 30320.0  |
| 北京    | 2019  | 35370.0  |
| 北京    | 2020  | 36100.0  |
| 广州    | 2018  | 23000.0  |
| 广州    | 2019  | 23628.0  |
| 广州    | 2020  | 25019.0  |
+-------+-------+----------+