【企业流行新数仓】Day02:DWS层(按日分区的宽表)、DWT层(全量累计表)、ADS层、总结

2023-02-14,,,,

一、DWS层

1、概括

dwd层的数据,每日轻度聚合,建宽表

表名 粒度
dws_uv_detail_daycount 一个设备是一行
dws_user_action_daycount(只统计今天登录的会员) 一个会员是一行
dws_sku_action_daycount(只统计被下单或平均或支付或加购或收藏的商品) 一个商品是一行
dws_coupon_use_daycount(只统计未过期的优惠券) 一个优惠券是一行
dws_activity_info_daycount(统计所有活动) 一个活动是一行
dws_sale_detail_daycount(每日购买数据) 一个用户购买的一款商品是一行

2、dws_uv_detail_daycount(每日设备行为)-一台设备有多个行为,每列的多个行为进行字符串拼接

(1)建表

create external table dws_uv_detail_daycount
(
-- 从启动日志dwd_start_log表取以下字段
`mid_id` string COMMENT '设备唯一标识',
`user_id` string COMMENT '用户标识',
`version_code` string COMMENT '程序版本号',
`version_name` string COMMENT '程序版本名',
`lang` string COMMENT '系统语言',
`source` string COMMENT '渠道号',
`os` string COMMENT '安卓系统版本',
`area` string COMMENT '区域',
`model` string COMMENT '手机型号',
`brand` string COMMENT '手机品牌',
`sdk_version` string COMMENT 'sdkVersion',
`gmail` string COMMENT 'gmail',
`height_width` string COMMENT '屏幕宽高',
`app_time` string COMMENT '客户端日志产生时的时间',
`network` string COMMENT '网络模式',
`lng` string COMMENT '经度',
`lat` string COMMENT '纬度',
-- 从启动日志dwd_start_log表按照mid_id进行聚合,之后count(*)取以下字段
`login_count` bigint COMMENT '活跃次数'
)

(2)数据导入

insert overwrite table dws_uv_detail_daycount PARTITION(dt='2020-05-06')
select
mid_id,
concat_ws('|',collect_set(user_id)),
concat_ws('|',collect_set(version_code)),
concat_ws('|',collect_set(version_name)),
concat_ws('|',collect_set(lang)),
concat_ws('|',collect_set(source)),
concat_ws('|',collect_set(os)),
concat_ws('|',collect_set(area)),
concat_ws('|',collect_set(model)),
concat_ws('|',collect_set(brand)),
concat_ws('|',collect_set(sdk_version)),
concat_ws('|',collect_set(gmail)),
concat_ws('|',collect_set(height_width)),
concat_ws('|',collect_set(app_time)),
concat_ws('|',collect_set(network)),
concat_ws('|',collect_set(lng)),
concat_ws('|',collect_set(lat)),
count(*)
FROM dwd_start_log where dt='2020-05-06'
GROUP by mid_id

3、dws_user_action_daycount(每日会员行为)

用户登录、加购、下单、支付次数及金额

导入数据时,类似于建表/视图操作

witht1 
as (select user_id,count(*) login_count  from dwd_start_log where dt='2020-05-06' and user_id is not NULL GROUP BY user_id),
t3 as (select user_id,count(*) order_count, sum(final_total_amount) order_amount from dwd_fact_order_info where dt='2020-05-06' GROUP by user_id ),
t4 as (select user_id,count(*) payment_count,sum(payment_amount) payment_amount from dwd_fact_payment_info where dt='2020-05-06' GROUP by user_id),
t2 as (select user_id,count(*) cart_count,sum(cart_price*sku_num) cart_amount from dwd_fact_cart_info where dt='2020-05-06' and date_format(create_time,'yyyy-MM-dd')='2020-05-06' GROUP by user_id )
insert overwrite TABLE dws_user_action_daycount PARTITION(dt='2020-05-06')select   t1.user_id,login_count,   nvl(cart_count,0),   nvl(cart_amount,0),   nvl(order_count,0),  
nvl(order_amount,0),   nvl(payment_count,0),   nvl(payment_amount,0)from t1 left join t2 on t1.user_id=t2.user_idleft join t3 on t1.user_id=t3.user_idleft join t4 on t1.user_id=t4.user_id

4、dws_sku_action_daycount(每日商品行为)

被下单次数、被支付、退款、加购、好评、差评次数

with
t1 as
(select sku_id,count(*) order_count,sum(sku_num) order_num,
sum(total_amount) order_amount
from dwd_fact_order_detail where dt='2020-05-06'
GROUP by sku_id),
t2 as
(select
sku_id,
sum(sku_num) payment_num,sum(total_amount) payment_amount,
count(*) payment_count
from
(SELECT order_id,sku_id,sku_num,total_amount from dwd_fact_order_detail
where dt='2020-05-06' or dt=date_sub('2020-05-06',1)) tmp1
join
(select order_id from dwd_fact_payment_info where dt='2020-05-06') tmp2
on tmp1.order_id=tmp2.order_id
GROUP by sku_id),
t3 as
(SELECT sku_id,
count(*) refund_count,sum(refund_num) refund_num,
sum(refund_amount) refund_amount
from dwd_fact_order_refund_info where dt='2020-05-06'
GROUP by sku_id),
t4 as
(select sku_id,
count(*) cart_count,sum(sku_num) cart_num
from dwd_fact_cart_info where dt='2020-05-06' and sku_num>0
GROUP by sku_id),
t5 as
(SELECT sku_id,
count(*) favor_count
from dwd_fact_favor_info where dt='2020-05-06' and is_cancel=0
group by sku_id),
t6 as
(SELECT sku_id,
sum(if(appraise='1201',1,0)) appraise_good_count,
sum(if(appraise='1202',1,0)) appraise_mid_count,
sum(if(appraise='1203',1,0)) appraise_bad_count,
sum(if(appraise='1204',1,0)) appraise_default_count
from dwd_fact_comment_info where dt='2020-05-06'
group by sku_id)
insert overwrite table dws_sku_action_daycount partition(dt='2020-05-06')
SELECT
nvl(nvl(nvl(nvl(nvl(t1.sku_id,t2.sku_id),t3.sku_id),t4.sku_id),t5.sku_id),t6.sku_id),
nvl(order_count,0),
nvl(order_num,0),
nvl(order_amount,0),
nvl(payment_count,0),
nvl(payment_num,0),
nvl(payment_amount,0),
nvl(refund_count,0),
nvl(refund_num,0),
nvl(refund_amount,0),
nvl(cart_count,0),
nvl(cart_num,0),
nvl(favor_count,0),
nvl(appraise_good_count,0),
nvl(appraise_mid_count,0),
nvl(appraise_bad_count,0),
nvl(appraise_default_count,0)
from t1
full join t2 on t1.sku_id=t2.sku_id
full join t3 on t1.sku_id=t3.sku_id
full join t4 on t1.sku_id=t4.sku_id
full join t5 on t1.sku_id=t5.sku_id
full join t6 on t1.sku_id=t6.sku_id

5、dws_coupon_use_daycount(每日优惠券使用行为)

范围、商品id、品牌、品类、领用次数、下单次数

insert overwrite table dws_coupon_use_daycount PARTITION(dt='2020-05-06')
select
t1.id coupon_id,coupon_name, coupon_type, condition_amount,
condition_num, activity_id, benefit_amount, benefit_discount,
create_time, range_type, spu_id, tm_id, category3_id, limit_num,
get_count,using_count, used_count
from
(SELECT *
from dwd_dim_coupon_info
where dt='2020-05-06' and nvl(expire_time,'9999-99-99') >'2020-05-06') t1
left join
(select coupon_id,
sum(if(date_format(get_time,'yyyy-MM-dd')='2020-05-06',1,0)) get_count,
sum(if(date_format(using_time,'yyyy-MM-dd')='2020-05-06',1,0)) using_count,
sum(if(date_format(used_time,'yyyy-MM-dd')='2020-05-06',1,0)) used_count
from dwd_fact_coupon_use
GROUP by coupon_id) t2
on t1.id=t2.coupon_id

6、dws_activity_info_daycount(每日活动行为)

活动类型、时间、下单、支付次数

with
t1 as

(select
id,activity_name,activity_type,
start_time,end_time,create_time
from dwd_dim_activity_info
where dt='2020-05-06'
GROUP by id,activity_name,activity_type,
start_time,end_time,create_time),
t2 as
(select
activity_id,count(*) order_count
from dwd_fact_order_info
where dt='2020-05-06'
GROUP by activity_id),
t5 as
(SELECT
activity_id,count(*) payment_count
from
(SELECT order_id,id from dwd_fact_payment_info where dt='2020-05-06') t3
join
(SELECT id,activity_id from dwd_fact_order_info WHERE dt='2020-05-06' or dt=date_sub('2020-05-06',1)) t4
on t3.order_id=t4.id
GROUP by activity_id)
insert overwrite table dws_activity_info_daycount partition(dt='2020-05-06')
SELECT
t1.id,activity_name, activity_type,
start_time, end_time, create_time,
nvl(order_count,0),
nvl(payment_count,0)
from t1
left join t2 on t1.id=t2.activity_id
left join t5 on t1.id=t5.activity_id

7、dws_sale_detail_daycount(每日用户购买商品详情)

用户、商品、sku、购买次数、下单次数、下单金额

GROUP by user_id,sku_id)
insert overwrite table dws_sale_detail_daycount PARTITION(dt='2020-05-06')
SELECT
t7.user_id, t7.sku_id, user_gender, user_age,
user_level, order_price, sku_name, sku_tm_id,
sku_category3_id, sku_category2_id, sku_category1_id,
sku_category3_name, sku_category2_name,
sku_category1_name, spu_id, sku_num, order_count,
order_amount
from
(select
nvl(t3.user_id,t4.user_id) user_id,
nvl(t3.sku_id,t4.sku_id) sku_id,
nvl(order_count,0) order_count,
nvl(order_amount,0) order_amount,
nvl(sku_num,0) sku_num
FROM t3 full join t4 on
t3.user_id=t4.user_id and t3.sku_id=t4.sku_id) t7
join t1 on t7.user_id=t1.user_id
join t2 on t7.sku_id=t2.sku_id

二、DWT层-当前表及DWS表中按日期的汇总(合并及更新)

1、概述

将DWS层每日聚合的数据进行累积

不是分区表,是一个累积型全量表

累积型全量表: ①查询要改动的旧数据  ②查询新增和变化的新数据  ③新旧关联,以新换旧  ④导入覆盖

2、dwt_uv_topic

create external table dwt_uv_topic

用户及设备信息、首次活跃、当日活跃、末次活跃、累计活跃天数

今天未登录的老用户:new.mid_id is null

老用户:old.mid_id is not null

新用户:old.mid_id is null

今天登录的老用户:new.mid_id is not null and old.mid_id is not null

insert overwrite table gmall.dwt_uv_topic
select
nvl(old.mid_id,new.mid_id),
concat_ws('|',old.user_id,new.user_id),
concat_ws('|',old.version_code,new.version_code),
concat_ws('|',old.version_name,new.version_name),
concat_ws('|',old.lang,new.lang),
concat_ws('|',old.source,new.source),
concat_ws('|',old.os,new.os),
concat_ws('|',old.area,new.area),
concat_ws('|',old.model,new.model),
concat_ws('|',old.brand,new.brand),
concat_ws('|',old.sdk_version,new.sdk_version),
concat_ws('|',old.gmail,new.gmail),
concat_ws('|',old.height_width,new.height_width),
concat_ws('|',old.app_time,new.app_time),
concat_ws('|',old.network,new.network),
concat_ws('|',old.lng,new.lng),
concat_ws('|',old.lat,new.lat),
nvl(old.login_date_first,'2020-05-06') login_date_first,
IF(new.mid_id is null,old.login_date_last,'2020-05-06') login_date_last,
nvl(new.login_count,0) login_day_count,
nvl(old.login_count,0)+if(new.login_count is not null,1,0) login_count
from
dwt_uv_topic old
full join
(select * from dws_uv_detail_daycount where dt='2020-05-06') new
on old.mid_id=new.mid_id

3、dwt_user_topic

用户首末次登录、下单时间、天数,累计、最近30天下单支付金额、次数

insert overwrite table dwt_user_topic
SELECT
t1.user_id,login_date_first,
login_date_last, login_count, nvl(login_last_30d_count,0),
order_date_first, order_date_last, order_count, order_amount,
nvl(order_last_30d_count,0), nvl(order_last_30d_amount,0), payment_date_first,
payment_date_last, payment_count, payment_amount, nvl(payment_last_30d_count,0),
nvl(payment_last_30d_amount,0)
from
(SELECT
nvl(old.user_id,new.user_id) user_id,
nvl(old.login_date_first,'2020-05-06') login_date_first,
nvl(old.order_date_first,if(new.order_count>0,'2020-05-06',null)) order_date_first,
nvl(old.payment_date_first,if(new.payment_count>0,'2020-05-06',null)) payment_date_first,
if(new.user_id is null,old.login_date_last,'2020-05-06') login_date_last,
if(new.order_count>0,'2020-05-06',old.order_date_last) order_date_last,
if(new.payment_count>0,'2020-05-06',old.payment_date_last) payment_date_last,
nvl(old.login_count,0)+if(new.user_id is not null,1,0) login_count,
nvl(old.order_count,0)+nvl(new.order_count,0) order_count,
nvl(old.order_amount,0)+nvl(new.order_amount,0) order_amount,
nvl(old.payment_count,0)+nvl(new.payment_count,0) payment_count,
nvl(old.payment_amount,0)+nvl(new.payment_amount,0) payment_amount
from
dwt_user_topic old
full join (select * from dws_user_action_daycount where dt='2020-05-06') new
on old.user_id=new.user_id) t1
left join
(
SELECT
user_id,
sum(order_count) order_last_30d_count,
sum(order_amount) order_last_30d_amount,
sum(payment_count) payment_last_30d_count,
sum(payment_amount) payment_last_30d_amount,
count(*) login_last_30d_count
FROM dws_user_action_daycount
where dt BETWEEN date_sub('2020-05-06',29) and '2020-05-06'
GROUP by user_id) t2
on t1.user_id=t2.user_id

4、dwt_sku_topic

最近30日及累计下单、支付、退款、加购、收藏、好中差评次数

create external table dwt_sku_topic
(
sku_id string comment 'sku_id',
spu_id string comment 'spu_id',
-- 从dws_sku_action_daycount 取,where 30天之前<=dt<=今天,sum()
order_last_30d_count bigint comment '最近30日被下单次数',
insert overwrite TABLE dwt_sku_topic
SELECT
t2.sku_id, t2.spu_id,
nvl(order_last_30d_count,0),

5、dwt_coupon_topic

优惠券当日及累计领用、下单、支付次数

insert overwrite table dwt_coupon_topic
select
nvl(old.coupon_id,new.coupon_id) coupon_id,
nvl(new.get_count,0) get_day_count,
nvl(new.using_count,0) using_day_count,
nvl(new.used_count,0) used_day_count,
nvl(old.get_count,0)+nvl(new.get_count,0) get_count,
nvl(old.get_count,0)+nvl(new.using_count,0) using_count,
nvl(old.get_count,0)+nvl(new.used_count,0) used_count
from dwt_coupon_topic old
full join (select * from dws_coupon_use_daycount where dt='2020-05-06')new
on old.coupon_id=new.coupon_id

6、dwt_activity_topic

活动当日及累计下单、支付次数

insert overwrite table dwt_activity_topic
select
nvl(old.id,new.id) id,
nvl(old.activity_name,new.activity_name) activity_name,
nvl(new.order_count,0) order_day_count,
nvl(new.payment_count,0) payment_day_count,
nvl(old.order_count,0)+nvl(new.order_count,0) order_count,
nvl(old.payment_count,0)+nvl(new.payment_count,0) payment_count
from dwt_activity_topic old
full join (select * from dws_activity_info_daycount where dt='2020-05-06')new
on old.id=new.id

三、ADS层

1、概述

将需求根据要查询的数据源进行分类

同一类需求创建一张表进行统计

创建的表均为全量表

2、构造数据

将集群的时间,调整到要导入数据的前一天

上传jar包

启动采集通道,启动hive

执行脚本

3、设备主题

(1)活跃设备数(日、周、月)

从dws_uv_daycount 或 dwt_uv_topic 表取数据

日活、周活、月活【字段:是否是周末或月末】:至少活跃一次

create external table ads_uv_count(
`dt` string COMMENT '统计日期',
// 从dws层取当天的,也可以从dwt层取
`day_count` bigint COMMENT '当日用户数量',
// 从dws层取当周的,也可以从dwt层取
`wk_count` bigint COMMENT '当周用户数量',
// 从dws层取当月的,也可以从dwt层取
`mn_count` bigint COMMENT '当月用户数量',
// 借助next_day()
`is_weekend` string COMMENT 'Y,N是否是周末,用于得到本周最终结果',
// 借助last_day()
`is_monthend` string COMMENT 'Y,N是否是月末,用于得到本月最终结果'
) COMMENT '活跃设备数'
insert into table ads_uv_count
SELECT
'2020-05-06',day_count,wk_count,mn_count,
if('2020-05-06'=date_sub(next_day('2020-05-06','MO'),1),'Y','N') is_weekend,
if('2020-05-06'=last_day('2020-05-06'),'Y','N') is_monthend
from
(SELECT '2020-05-06' dt,count(*) day_count

(2)每日新增设备, login_date_first=今天

insert into ads_new_mid_count
SELECT
'2020-05-06' create_date,
count(*) new_mid_count
FROM dwt_uv_topic
where login_date_first='2020-05-06';

(3)沉默用户数

只在安装当天启动过: login_date_first='当天'= login_date_last

启动时间是在7天前: login_date_last< 今天的7天前

insert into table ads_silent_count
SELECT
'2020-05-06',
count(*)
from dwt_uv_topic
where login_date_first=login_date_last
and
login_date_last<date_sub('2020-05-06',7)

(4)本周回流用户数

本周登录过的,没在上周登录过的老用户数

本周活跃与上周活跃,两个结果集使用left join后取差集(本周活跃但上周不活跃):

on t1.mid_id=t2.mid_id
where t2.mid_id is null

login_date_last >= date_sub(next_day('2020-05-06','MO'),7)

(5)流失用户数:连续7天未活跃的设备

login_date_last<date_sub('2020-05-06',7)

(6)留存率:留存用户 占  某天新增用户的 比率

某天新增的用户中,在n天后继续使用的用户称为留存用户

①某一天新增的人数

②留存的天数,留存的日期=新增的天数+留存的天数

③取留存日期当天的留存人数

(7)最近连续三周活跃用户数

用户在这三周中,都至少需要出现一次

(8)最近七天内连续三天活跃用户数

4、会员主题

(1)会员信息

用户新鲜率、活跃率、付费率

cast(sum(if(login_date_last='2020-05-19',1,0)) / count(*) * 100 as decimal(10,2)) day_users2users

(2)转化率

访问/加购,加购/下单,下单/支付

cast( sum(if(payment_count>0,1,0)) / sum(if(order_count>0,1,0)) * 100 as decimal(10,2))

5、商品主题

(1)商品个数信息-各个商品的种类数

insert into table ads_product_info
SELECT
'2020-05-19' dt,
count(*) sku_num,
count(DISTINCT spu_id) spu_num
from dwt_sku_topic

(2)商品累积销量排名

FROM dwt_sku_topic
where payment_num>0
order by payment_num desc
limit 10

(3)商品收藏排名

(4)加入购物车排名

(5)最近30天退款率

(6)差评率排名

6、营销主题

(1)每日下单统计ads_order_daycount

(2)每日支付统计ads_payment_daycount

(3)品牌的月复购率ads_sale_tm_category1_stat_mn

单次、多次复购率

四、总结

1、数据来源

2、各层数据的来源于导入

数据源 建模 如何导入数据 备注
hdfs     采用lzo压缩的格式
ODS 原数据有几个字段是什么类型,就怎么建模 必须指定ODS的表使用能够读取LZO压缩格式的输入格式,为LZO格式创建索引  
用户行为DWD 用户行为数据根据不同类型数据的字段明细,进行建模 启动日志: get_json_object 事件日志: 自定义UDF,UDTF,将事件日志中的每个事件,解析到一个base_event表中,再使用get_json_object展开事件明细。  
业务数据DWD 维度表:维度退化,将多个同一类型维度的字段合并到一张表中。事实表:采取星型模型,基于3w原则,按照选取业务线---确认粒度---选取维度---选取度量进行建模 维度表:多表Join 事实表:选择一张事实表作为主表,通过外键关联维度表,选取维度字段。再选取度量!  
    事务型事实表:选取ods层某一天分区的数据,再关联维度表,选取维度字段,再选取度量!  
    周期型快照事实表:直接从ODS层全量导入(加入购物车,收藏表)  
    累积型快照事实表: 按照事实发生最初的事件作为分区字段!①选择要覆盖的老的分区的所有数据②选取今日新增和变化的新数据③新旧交替,以新换旧④覆盖到指定的分区  
    拉链表(缓慢变化维度):old left join new ,将old中过期的数据的end_date修改为new中start_date的前一天。 再union all new。导入到临时表,再导入到原表  
dws层 紧紧贴合需求。将同一类型的需求,汇总,分类,以某个需求的统计目标为主题(设备,用户,商品,优惠券,活动,购买行为),创建宽表 取dwd层每日最新的分区,进行多表关联  
dwt层 紧紧贴合需求。将同一类型的需求,汇总,分类,以某个需求的统计目标为主题(设备,用户,商品,优惠券,活动,购买行为),创建宽表 dwt full join dws 当日分区的数据①新旧交替,以新换旧②覆盖原表  
ads 紧紧贴合需求。将同一类型的需求,汇总,分类,以某个需求的统计目标为主题(用户,商品,会员,营销), 取某一天的历史切片数据,从dws层取,如果要取当前的数据或累计状态,从dwt层取  
导出mysql   update_mode: allowinsert update-key: dt  

企业流行新数仓】Day02:DWS层(按日分区的宽表)、DWT层(全量累计表)、ADS层、总结的相关教程结束。

《【企业流行新数仓】Day02:DWS层(按日分区的宽表)、DWT层(全量累计表)、ADS层、总结.doc》

下载本文的Word格式文档,以方便收藏与打印。