天池新人实战赛之[离线赛]尝试（一） -

ronaldoLY

浏览: 44928 次
性别:

最近访客更多访客>>

AlphaPay

u011997289

qq756514656

jxtlks

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

天池新人实战赛之[离线赛]尝试（一）

博客分类：

python
机器学习

机器学习天池

题目（https://tianchi.aliyun.com/getStart）就不贴了。经过一些百度的资料，可以将这个问题简化为：某个U-I组合在观察日是否有购买行为？(二分类问题)

接下来分几个步骤来拆解整个过程：

一.简单分析

将两个数据表.tianchi_fresh_comp_train_item和tianchi_fresh_comp_train_user存入到数据库中，

对应表名：vipfin.tianchi_fresh_comp_train_item 和vipfin.tianchi_fresh_comp_train_user

查看前一天的用户操作（浏览,收藏，加入购物车）对后一天的购买行为的影响程度。

参考博客https://blog.csdn.net/snoopy_yuan/article/details/72850601 他提交了一份在前一日加入购物车，在后一日未购买的数据。我们来简单验证下他的可行性。

先看加入购物车未购买的操作，以11.18为例

select

count(1)

from (select * from

vipfin.tianchi_fresh_comp_train_user where substr( time,1,10)='2014-11-18' and behavior_type =3) a

left join

(select * from

vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-18' and behavior_type =4) b

on a.user_id=b.user_id

and a.item_id =b.item_id

where b.user_id is null

14998

在11.18有加入购物车，在11.19发生了购买行为的数据

select

count(1)

from (select * from

vipfin.tianchi_fresh_comp_train_user where substr( time,1,10)='2014-11-18' and behavior_type =3) a

inner join

(select * from

vipfin.tianchi_fresh_comp_train_user where substr( time,1,10)='2014-11-19' and behavior_type =4) b

on a.user_id=b.user_id

and a.item_id =b.item_id

614

前一天加入购物车的数据，第二天转换为购买行为的几率为4%。

所以博客中提高直接提交12.18日加入购物车的数据，准确率可想而知，肯定不会超过5%

统计前一天加入购物车这种操作的准确率都只有5%，可以想象的到浏览和收藏，转化率更低。

所以单纯的依靠前一天的操作来预测后一天购买行为是不行滴。

再进行其他的统计，单纯依靠SQL，是无法有太高的准确率的。前一天加入购物车，第二天产生购买的记录占第二天所有购买记录的比例小于10%。所以即使根据前一天加入购物车数据统计的准确率为100%，也只占第二天总购买记录的10%不到。

综上更加坚定了需要用到机器学习了。

所以要考虑从tianchi_fresh_comp_train_user 每天的销售记录中，提取出一些可以衡量用户行为，购买行为，商品属性的特征，用于机器学习模型的输入。

二.数据预处理

几点思路：

1.由于用户行为对购买的影响随时间减弱，根据分析，用户在一周之前的行为对考察日是否购买的影响已经很小，故而只考虑距考察日（预测日）一周以内的特征数据。

2.购买行为具有一定的周期性,选取训练数据，验证数据和预测数据集（排除掉双十二的数据）

	输入	输出
训练数据	11.22~11.27U-I集合行为数据	11.28U-I集合购买记录
验证数据	11.29~12.04U-I集合行为数据	12.05 U-I集合购买记录
预测数据	12.13~12.18U-I集合行为数据	12.19 U-I集合购买记录

使用训练数据训练出模型，通过一些调参数，使模型损失函数最小，准确率较高。

再代入验证数据，预测出结果和真实12.05的数据进行比对，验证其泛化能力，如果验证结果较为理想

则直接使用预测数据进行预测

3.针对当前业务场景，根据user和item数据进行组合构建出各种维度的特征值

4.由于问题已被明确为 U-I 是否发生购买行为（标记label取｛0，1]）的分类问题。特征集合都要以U-I为维度构建。预测时所考虑的U-I集合。如果是笛卡尔积式的(所有用户*所有商品) 预测，数据量太大。这里优先考虑在预测日前一个周期内出现过操作的U-I组合

（这里也会存在问题，输入数据的集合太小，可以扩大到出现过操作的item类别相同的U-I组合，

更严谨一些，类别相同，并且操作最频繁的item（最受所有用户欢迎的商品）产生的U-I组合,待后续探索）

参考https://blog.csdn.net/snoopy_yuan/article/details/75105724 简单提取几个维度的特征值

5.数据集的范围并不是一成不变的，根据预测目标，和训练数据的分布情况，可能需要对数据进行筛选等操作。

特征名称	所属类别	特征含义	特征作用	数量
u_b_count	U	用户在考察日的前一个周期内行为总数	用户活跃度	1
u_bi_count （i=1/2/3/4）	U	用户在前一个周期各种行为的计数	用户活跃度(不同操作)	4
u_b4_rate	U	用户购买转换率	用户购买习惯	1
i_u_count	I	商品在周期内的操作计数	商品热度	1
i_b4_rate	I	商品的点击购买转化率	反映了商品的购买决策操作特点	1
c_u_count	C	类别在周期内的操作计数	反映了item_category的热度	1
c_b4_rate	C	类别的点击购买转化率	反映了item_category的购买决策操作特点	1
ui_b_count	UI	用户-商品对在周期内的行为总数计数	反映了U-I的活跃程度	1
uc_b_count	UC	用户-类别对在周期内的行为总数计数	反映了U-C的活跃程度	1

以上特征值提取，可选择在python pandas里面完成(原博客好像是在excel中统计的)，也可选择使用SQL统计。这里我用后者，因为我对SQL操作更熟悉。

SQL操作

create table temp_fin.temp_tianchi_train1 as 
select a.user_id, a.item_id,a.item_category,1  as  flag
from 
(select *
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27' 
) a 
inner join 
(select *
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) ='2014-11-28' and  behavior_type =4 ) b
on a.user_id=b.user_id
and a.item_id =b.item_id 
union all
select a.user_id, a.item_id,a.item_category,0  as  flag
from 
(select *
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27' 
) a 
left join 
(select *
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) ='2014-11-28' and  behavior_type =4 ) b
on a.user_id=b.user_id
and a.item_id =b.item_id 
where b.user_id is null

create table temp_fin.temp_tianchi_train1_dist as
select   distinct  * from  temp_fin.temp_tianchi_train1
---特征提取
create table temp_fin.temp_tianchi_train1_u_b_count as 
select  distinct a.user_id,b.l_count u_b_count from
temp_fin.temp_tianchi_train1_dist a 
inner join 
(select user_id,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27' 
 group by user_id
)  b 
on a.user_id=b.user_id

create table temp_fin.temp_tianchi_train1_u_b1_count  as 
select  distinct  a.user_id,b.l_count u_b_count from
temp_fin.temp_tianchi_train1_dist a 
inner join 
(select user_id,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  and behavior_type=1
 group by user_id
)  b 
on a.user_id=b.user_id;

create table temp_fin.temp_tianchi_train1_u_b2_count  as 
select distinct a.user_id,b.l_count u_b_count from
temp_fin.temp_tianchi_train1_dist a 
inner join 
(select user_id,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  and behavior_type=2
 group by user_id
)  b 
on a.user_id=b.user_id;

create table temp_fin.temp_tianchi_train1_u_b3_count  as 
select  distinct a.user_id,b.l_count u_b_count from
temp_fin.temp_tianchi_train1_dist a 
inner join 
(select user_id,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  and behavior_type=3
 group by user_id
)  b 
on a.user_id=b.user_id;

create table temp_fin.temp_tianchi_train1_u_b4_count  as 
select   distinct a.user_id,b.l_count u_b_count from
temp_fin.temp_tianchi_train1_dist a 
inner join 
(select user_id,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  and behavior_type=4
 group by user_id
)  b 
on a.user_id=b.user_id;

create table temp_fin.temp_tianchi_train1_u_b4_rate as 
select distinct a.user_id,  d.rate u_b4_rate  from
temp_fin.temp_tianchi_train1_dist a 
inner join 
(select   b.user_id , cast(COALESCE(c.l_count,0)  as double)/b.l_count  rate      from 
(select user_id,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  and behavior_type in (1,2,3,4)
 group by user_id
)  b 
left join 
(select   user_id,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  and behavior_type=4
  group by user_id
)  c
 on b.user_id=c.user_id
 )  d 
 on a.user_id =d.user_id

create table temp_fin.temp_tianchi_train1_i_u_count	 as 
select  distinct a.item_id,  b.l_count   i_u_count  from
temp_fin.temp_tianchi_train1_dist a 
inner join
(select item_id,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  
 group by item_id
)  b 
on a.item_id=b.item_id

create table temp_fin.temp_tianchi_train1_i_b4_rate as 
select  distinct a.item_id,  d.rate i_b4_rate  from
temp_fin.temp_tianchi_train1_dist a 
inner join 
(select   b.item_id , cast(COALESCE(c.l_count,0)  as double)/b.l_count  rate      from 
(select item_id,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  and behavior_type in (1,2,3,4)
 group by item_id
)  b 
left join 
(select   item_id,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  and behavior_type=4
  group by item_id
)  c
 on b.item_id=c.item_id
 )  d 
 on a.item_id =d.item_id

create table temp_fin.temp_tianchi_train1_c_u_count	 as 
select  distinct a.item_category,  b.l_count   c_u_count  from
temp_fin.temp_tianchi_train1_dist a 
inner  join
(select item_category,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  
 group by item_category
)  b 
on a.item_category=b.item_category

create table temp_fin.temp_tianchi_train1_c_b4_rate as 
select    distinct a.item_category,  d.rate c_b4_rate  from
temp_fin.temp_tianchi_train1_dist a 
left join 
(select   b.item_category , cast(COALESCE(c.l_count,0)  as double)/b.l_count  rate      from 
(select item_category,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  and behavior_type in (1,2,3,4)
 group by item_category
)  b 
inner join 
(select   item_category,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  and behavior_type=4
  group by item_category
)  c
 on b.item_category=c.item_category
 )  d 
 on a.item_category =d.item_category
 
 create table temp_fin.temp_tianchi_train1_ui_b_count	 as 
select   distinct a.user_id, a.item_id,  b.l_count   ui_b_count  from
temp_fin.temp_tianchi_train1_dist a 
inner join
(select user_id,item_id,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  
 group by user_id,item_id
)  b 
on a.user_id=b.user_id
and a.item_id=b.item_id

create table temp_fin.temp_tianchi_train1_uc_b_count  as 
select distinct  a.user_id,a.item_category  ,b.l_count   uc_b_count  from
temp_fin.temp_tianchi_train1_dist a 
inner join
(select user_id,item_category,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  
 group by user_id,item_category
)  b 
on a.user_id=b.user_id
and a.item_category=b.item_category

create table temp_fin.temp_tianchi_train1_data as 
select a.user_id, a.item_id,a.item_category
,u_b_count_table.u_b_count
,u_b1_count.u_b_count u_b1_count
,u_b2_count.u_b_count u_b2_count
,u_b3_count.u_b_count u_b3_count
,u_b4_count.u_b_count u_b4_count
,u_b4_rate.u_b4_rate
,i_u_count.i_u_count
,i_b4_rate.i_b4_rate
,c_u_count.c_u_count
,c_b4_rate.c_b4_rate
,ui_b_count.ui_b_count
,uc_b_count.uc_b_count
,a.flag
from temp_fin.temp_tianchi_train1_dist a 
left join temp_fin.temp_tianchi_train1_u_b_count u_b_count_table
on a.user_id =u_b_count_table.user_Id 
left join  temp_fin.temp_tianchi_train1_u_b1_count  u_b1_count 
on a.user_id =u_b1_count.user_Id 
left join  temp_fin.temp_tianchi_train1_u_b2_count  u_b2_count 
on a.user_id =u_b2_count.user_Id 
left join  temp_fin.temp_tianchi_train1_u_b3_count  u_b3_count 
on a.user_id =u_b3_count.user_Id 
left join  temp_fin.temp_tianchi_train1_u_b4_count  u_b4_count 
on a.user_id =u_b4_count.user_Id 
left join  temp_fin.temp_tianchi_train1_u_b4_rate  u_b4_rate 
on a.user_id =u_b4_rate.user_Id 
left join  temp_fin.temp_tianchi_train1_i_u_count i_u_count
on a.item_id =i_u_count.item_id
left join  temp_fin.temp_tianchi_train1_i_b4_rate  i_b4_rate 
on a.item_id =i_b4_rate.item_id
left join  temp_fin.temp_tianchi_train1_c_u_count c_u_count
on a.item_category=c_u_count.item_category
left join  temp_fin.temp_tianchi_train1_c_b4_rate c_b4_rate
on a.item_category=c_b4_rate.item_category
left join  temp_fin.temp_tianchi_train1_ui_b_count ui_b_count
on a.user_id =ui_b_count.user_Id and a.item_id=ui_b_count.item_id
left join  temp_fin.temp_tianchi_train1_uc_b_count uc_b_count
on a.user_id =uc_b_count.user_Id and a.item_category=uc_b_count.item_category;

同理算出其他两个数据集

三.特征处理

处理好后的数据集依然分为三份，每一份大概有这么些列

user_id,item_id,category,特征值（u_b_count...uc_b_count）， label（标签，在观察日是否购买）

有了以上数据。做特征处理，使用pyspark.ml.feature 包。该包下有多类特征转换为一个多维向量的方法，

比如VectorAssembler；也有做特征值缩放，0值处理的方法，比如MaxAbsScaler，MinMaxScaler。

特征处理的两个步骤:

多列特征值 =》一列多维向量 =》向量值缩放

（思考内容：第一步操作能否加入特征权重的概念？毕竟上面那么多特征维度，有些维度更加重要，比如用户活跃度比商品活跃度更加重要。用户活跃度高，才更可能买商品，如果一个爆款商品遇到一个不怎么操作的用户，也是白搭）

注：如果使用sklearn API进行模型学习，输入的特征值格式是一个array，可直接将所有特征值合并起来处理，过程略

过程代码待补充...

四.模型搭建

特征值已经处理为模型可识别的向量，直接在pyspark.ml 中找不同的算法模型，带入计算。根据准确率调整超参。并根据验证数据来验证模型的可靠性。

过程代码待补充...

结尾：参考博客地址https://blog.csdn.net/snoopy_yuan

分享到：

天池新人实战赛之[离线赛]尝试（二） | 使用spark.createDataFrame报错

2018-04-09 16:00
浏览 826
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

天池新人实战赛之[离线赛]尝试（一）

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

天池新人实战赛之[离线赛]尝试（一）

评论

发表评论

相关推荐

机器学习特征值转换(使用spark.ml)

天池新人实战赛之[离线赛]尝试（二）

使用spark.createDataFrame报错

反向传播算法学习

python中调用ipynb格式内的函数

最近访客更多访客>>