一、背景
在对一个Hive历史Job优化后需要比对两张结果表的数据,发现有一张表两个字段有一些差异(优化只是加了优化参数没有修改逻辑,所以数据有差异是不正常的)
二、定位
根据差异字段定位到是一张临时表里面的使用了row_num引起的
1、现象
就是以下代码在同一天不同时间段跑,最终数据量不一致
drop table if exists tz0002_v2 purge;
create external table tmp_tz0002_v2 stored as orc tblproperties ('external.table.purge'='true') as
select cast(vip_card as bigint) member17_nbr,date_format(a.operate_time,'yyyy-MM') as auto_renew_month --签约年月
from (select card_no as vip_card,case when status='1' then '签约' when status='-1' then '解约' else status end status, operate_time, -- 操作时间row_number() over (partition by card_no order by operate_time desc) rnfrom auto_renew_sign_d t1 where date_format(t1.operate_time,'yyyy-MM-dd') <= '${yestoday_str}' )awhere rn=1and status='签约'
group by cast(vip_card as bigint), date_format(a.operate_time,'yyyy-MM')
;
select count(1) as cnt , 'v1' as type from tz0002_v1
union all
select count(1) as cnt , 'v2' as type from tz0002_v2
+----------+-------+
| cnt | type |
+----------+-------+
| 2748473 | v1|
| 2748463 | v2|
+----------+-------+
2、找出差异的id
select a.*
from
tz0002_v1 a
left join tz0002_v2 b
on a.member17_nbr=b.member17_nbr
where b.member17_nbr is null;
拿具体差异的id查看数据
select
uid
,card_no
,status
,operate_time
,row_number() over (partition by card_no order by operate_time desc) rn
from cn_ods_dmp_secure.o_bas_auto_renew_sign_d
where card_no='107421002798006351';
;
+--------------+--------------------+---------+----------------------+-----+
| uid | card_no | status | operate_time | rn |
+--------------+--------------------+---------+----------------------+-----+
| 18181734302 | 107421002798006351 | 1 | 2023-05-14 13:56:19 | 1 |
| 18181734302 | 107421002798006351 | -1 | 2023-05-14 13:56:19 | 2 |
| 18181734302 | 107421002798006351 | 1 | 2022-11-29 16:10:29 | 3 |
| 18181734302 | 107421002798006351 | -1 | 2022-04-09 12:06:28 | 4 |
| 18181734302 | 107421002798006351 | 1 | 2021-12-07 13:37:42 | 5 |
+--------------+--------------------+---------+----------------------+-----+
以card_no分组operate_time倒序排序,会发现最新的两条记录 operate_time相同,status不同【那么不同时间跑有可能这次排序号1的是状态等于1的,下次跑是状态等于-1的】,最终会取状态等于1的且rn等于1的数据
最后然后发现每个ID会有一个update_time,而且最新时间两条不同状态的数据update_time不相同,那么在row_num中将operate_time 修改成update_time更加符合具体业务逻辑,得到的数据也更加准确
+--------------------+---------+----------------------+----------------------+
| card_no | status | operate_time | update_time |
+--------------------+---------+----------------------+----------------------+
| 107421002798006351 | 1 | 2021-12-07 13:37:42 | 2021-12-07 13:37:44 |
| 107421002798006351 | -1 | 2022-04-09 12:06:28 | 2022-04-09 12:06:29 |
| 107421002798006351 | 1 | 2022-11-29 16:10:29 | 2022-11-29 16:10:32 |
| 107421002798006351 | 1 | 2023-05-14 13:56:19 | 2023-05-14 13:56:19 |
| 107421002798006351 | -1 | 2023-05-14 13:56:19 | 2023-05-14 13:56:20 |
+--------------------+---------+----------------------+----------------------+
三 、修复
将代码修改成下面
drop table if exists tz0002_v2 purge;
create external table tmp_tz0002_v2 stored as orc tblproperties ('external.table.purge'='true') as
select cast(vip_card as bigint) member17_nbr,date_format(a.operate_time,'yyyy-MM') as auto_renew_month --签约年月
from (select card_no as vip_card,case when status='1' then '签约' when status='-1' then '解约' else status end status, operate_time, -- 操作时间row_number() over (partition by card_no order by update_time desc) rnfrom auto_renew_sign_d t1 where date_format(t1.operate_time,'yyyy-MM-dd') <= '${yestoday_str}' )awhere rn=1and status='签约'
group by cast(vip_card as bigint), date_format(a.operate_time,'yyyy-MM')
;
经过测试不同时间段跑出来的结果都是相同的,以前同事上线的历史代码修改后再比对数据也是正常的