Pig:是基于Hadoop并行数据流语言
Pig:输入输出
第一步:加载load
最后执行完数据流之后可以:store 存储 或 dump输出到屏幕
Pig:关系操作
foreach: 对于每一条记录,可以选择相应的字段,传给下一个操作符(相当于SQL中选择需要的列,可以进行count和sum操作)
filter: 过滤(相当于SQL的where)
group: 分组,按照一个一段进行分组,(通过其中包含字段)
Order: 排序 (通过其中包含字段)
Distinct: 去重只会对整个记录去重,不会单独对某个字段去除
Join : 将两个load过来的数据集链接,需要这个两个数据的链接字段,注意不能对同一个关系使用Join
(最好两个关联字段不要重复名字,可以通过起别名的方式)
Limit: 限制数据量。
Count: 使用pig统计行数时,要选择一个不为空的列。
Flatten: 可以将分组的字段的组合拆开。
Pig: 一些基本概念:关系(relation)、包(bag--可看做数据库)、元组(tuple--可看做数据库中行)、字段(field)、数据(data)的关系
一个关系是一个包,一个包由一个或多个元组组成,一个元组由多个字段组成
注意:每个元组的字段的数量可以不一样的
<workflow-app xmlns="uri:oozie:workflow:0.2" name="dashboard-money"
half="3000">
<params>
<param key="yestoday">${yestoday}</param>
<param key="targetDir">/user/cmo_ipc/dashboard/store/repertoryExamination/app_dashboard_store_sales_d/${yestoday}</param>
<param key="mysql_jdbc_1" import="/export/App/etl.sone.jd.local/WEB-INF/classes/conf/jdbc.properties"></param>
<param key="mysql_jdbc_2" import="/export/App/etl.sone.jd.local/WEB-INF/classes/conf/important.properties"></param>
</params>
<start to="genBrand" />
<action name="genBrand">
<delete path="${targetDir}" />
<pig>
<script>yuanshi_fact_table = load '/user/cmo_ipc/app/dashboard/app_dashboard_store_sales_d/tx_dt=${yestoday}/*.lzo' as
( data_date,
data_year,
data_month,
data_week ,
data_day,
data_type,
brand_code,
item_sku_id,
sku_status_cd,
shelves_tm,
otc_tm,
utc_tm,
purchaser_erp_acct,
purchaser_name,
saler_erp_acct,
sale_staf_name,
item_first_cate_cd,
item_second_cate_cd,
item_third_cate_cd,
dept_id_3,
band,
free_goods_flag,
delv_center_num,
major_supp_brevity_code,
ky_stock,
num_order_booking,
num_app_booking,
num_stock,
num_purchase_plan,
num_order_transfer,
num_zt_stock,
num_transfer_plan_in,
num_transfer_plan_out,
basestock,
loweststock,
num_nosale,
target_num_stock,
health_num_stock,
health_je_stock,
mkt_prc,
jd_prc ,
stk_prc,
wh_qtn,
into_wh_qtty ,
status ,
sys_reple_qty,
reple_qty,
sales_qtty_1,
sales_mount_1,
sales_qtty_7,
sales_mount_7,
sales_qtty_14,
sales_mount_14,
sales_qtty_28,
sales_mount_28,
sales_qtty_60 ,
sales_mount_60,
sales_qtty_90 ,
sales_mount_90,
is_xhkc_flag ,
is_sg_flag ,
is_xg_flag ,
is_bh_flag ,
zzts_sl ,
zzts_je ,
nosale_days ,
xiagui_kc_sl ,
xiagui_kc_je ,
xiagui_sc ,
is_dh_flag ,
is_qh_flag ,
qh_sl ,
qh_je ,
is_zx_flag ,
zx_sl ,
zx_je ,
zt_sl ,
kc_je ,
pv_xhl ,
is_bdx_flag ,
bdx_sl ,
bdx_je ,
qh_ss_je );
</script>
<script>temp11= filter yuanshi_fact_table by item_first_cate_cd=='737' ;</script>
<script>
temp1 = FOREACH temp11 GENERATE data_date,
data_year,
data_month,
data_week ,
data_day,
data_type,
brand_code,
item_sku_id as sku_id,
sku_status_cd,
shelves_tm,
otc_tm,
utc_tm,
purchaser_erp_acct,
purchaser_name,
saler_erp_acct,
sale_staf_name,
item_first_cate_cd,
item_second_cate_cd,
item_third_cate_cd,
dept_id_3,
band,
free_goods_flag,
delv_center_num,
major_supp_brevity_code,
ky_stock,
num_order_booking,
num_app_booking,
num_stock,
num_purchase_plan,
num_order_transfer,
num_zt_stock,
num_transfer_plan_in,
num_transfer_plan_out,
basestock,
loweststock,
num_nosale,
target_num_stock,
health_num_stock,
health_je_stock,
mkt_prc,
jd_prc ,
stk_prc,
wh_qtn,
into_wh_qtty ,
status ,
sys_reple_qty,
reple_qty,
sales_qtty_1,
sales_mount_1,
sales_qtty_7,
sales_mount_7,
sales_qtty_14,
sales_mount_14,
sales_qtty_28,
sales_mount_28,
sales_qtty_60 ,
sales_mount_60,
sales_qtty_90 ,
sales_mount_90,
is_xhkc_flag ,
is_sg_flag ,
is_xg_flag ,
is_bh_flag ,
zzts_sl ,
zzts_je ,
nosale_days ,
xiagui_kc_sl ,
xiagui_kc_je ,
xiagui_sc ,
is_dh_flag ,
is_qh_flag ,
qh_sl ,
qh_je ,
is_zx_flag ,
zx_sl ,
zx_je ,
zt_sl ,
kc_je ,
pv_xhl ,
is_bdx_flag ,
bdx_sl ,
bdx_je ,
qh_ss_je;
</script>
<script>temp2=load '/user/cmo_ipc/app/dashboard/app_dashboard_store_score_d/tx_dt=${yestoday}/*.lzo' as
(data_date,data_year,data_month,data_week,data_day,item_sku_id,delv_center_num,score_total,score_qh,score_zx,score_bdx,score_xg,gjdj_total,gjdj_qh,gjdj_zx,gjdj_bdx,gjdj_xg);
</script>
<script>
t2 = FOREACH temp2 GENERATE item_sku_id,score_total,score_qh,score_zx,score_bdx,score_xg;
</script>
<script>
t3 = JOIN temp1 BY sku_id, t2 BY item_sku_id;
</script>
<script>
t1= FOREACH t3 GENERATE data_date,data_type,item_third_cate_cd,brand_code,item_sku_id,band,dept_id_3,purchaser_erp_acct,
major_supp_brevity_code,delv_center_num,CONCAT(CONCAT(item_sku_id,'_'),delv_center_num),num_stock,ky_stock,sales_qtty_1,sales_mount_1,
sales_qtty_7,sales_mount_7,sales_qtty_14,sales_mount_14,sales_qtty_28,sales_mount_28,sales_qtty_60,sales_mount_60,sales_qtty_90,
sales_mount_90,zt_sl,num_zt_stock,num_order_booking,pv_xhl,qh_ss_je,qh_sl,is_qh_flag,zzts_sl,zzts_je,zx_sl,is_zx_flag,nosale_days,
bdx_sl,is_bdx_flag,xiagui_kc_sl,xiagui_sc,is_xg_flag,is_xhkc_flag,is_sg_flag,health_num_stock,score_total,score_qh,score_zx,
score_bdx,score_xg,qh_je,xiagui_kc_je,bdx_je,zx_je,health_je_stock;
</script>
<script>store t1 into '${targetDir}' ;</script>
</pig>
<ok to="end" />
<error to="fail" />
</action>
<kill name="fail" />
<end name="end" />
</workflow-app>