kaggle 入门系列翻译（二） Expedia

本来准备搞泰坦尼克号那个的，发现网上资料太多了，就不重复了，写一个酒店推荐的吧：

https://www.kaggle.com/c/expedia-hotel-recommendations

原本准备参考以下链接：https://www.kaggle.com/zfturbo/leakage-solution

不过他居然没写notebook，只有代码。就简单看了一下，发现他使用的是数据的一个leakage。。没有参考价值了就。

故后来参考排名第二的链接：

https://www.kaggle.com/dvasyukova/predict-hotel-type-with-pandas

基本介绍：

题目目的需要预测用户将来会预定哪一个酒店。会根据已有数据（包括用户行为和行为相关的一些事件吧），给用户推荐五个酒店，只要用户最终在推荐的五个酒店中点击，就算预测正确。训练数据是2013到2014年的随机数据，测试集是15年的随机抽取数据。

大概数据如下：

train/test.csv

Column name	Description	Data type
date_time	时间戳	string
site_name	购买网站的id	int
posa_continent	ID of continent associated with site_name	int
user_location_country	国家ID	int
user_location_region	The ID of the region the customer is located	int
user_location_city	The ID of the city the customer is located	int
orig_destination_distance	Physical distance between a hotel and a customer at the time of search. A null means the distance could not be calculated	double
user_id	ID of user	int
is_mobile	1 when a user connected from a mobile device, 0 otherwise	tinyint
is_package	1 if the click/booking was generated as a part of a package (i.e. combined with a flight), 0 otherwise	int
channel	ID of a marketing channel	int
srch_ci	Checkin date	string
srch_co	Checkout date	string
srch_adults_cnt	The number of adults specified in the hotel room	int
srch_children_cnt	The number of (extra occupancy) children specified in the hotel room	int
srch_rm_cnt	The number of hotel rooms specified in the search	int
srch_destination_id	ID of the destination where the hotel search was performed	int
srch_destination_type_id	Type of destination	int
hotel_continent	Hotel continent	int
hotel_country	Hotel country	int
hotel_market	Hotel market	int
is_booking	1 if a booking, 0 if a click	tinyint
cnt	Numer of similar events in the context of the same user session	bigint
hotel_cluster	ID of a hotel cluster	int

destinations.csv

扫描二维码关注公众号，回复： 3276492 查看本文章

Column name	Description	Data type
srch_destination_id	ID of the destination where the hotel search was performed	int
d1-d149	latent description of search regions	double

第一个leakage 发现从数据角度来看：

变量对：* user_location_city *和orig_destination_distance几乎100％概率定义正确的酒店。大多数情况下酒店都有相同的hotel_cluster。因此，如果您在测试集中看到与训练集相同的一对，您可以从通过测试机输出正确的hotel_cluster变量这是数据泄漏。我们计算的第二件事是最受欢迎的hotel_clusters，其中包含相同的“srch_destination_id”（后来社区添加了hotel_country，hotel_marke和book_year以提高准确性）。第三个不太重要的是基于hotel_country和最佳酒店整体的最受欢迎的集群。

由于我们可以输出5个答案，我们从最重要的开始：1）（user_location_city和orig_destination_distance）

2）如果我们在训练集上没有这样的一对，通过srch_destination_id可以输出剩下的五个

3）对hotel_country和overall重复2）程序。

代码如下：

# coding: utf-8
__author__ = 'ZFTurbo: https://kaggle.com/zfturbo'

import datetime
from heapq import nlargest
from operator import itemgetter
from collections import defaultdict


def run_solution():
    print('Preparing arrays...')
    f = open("train.csv", "r")
    f.readline()
    best_hotels_od_ulc = defaultdict(lambda: defaultdict(int))
    best_hotels_search_dest = defaultdict(lambda: defaultdict(int))
    best_hotels_search_dest1 = defaultdict(lambda: defaultdict(int))
    best_hotel_country = defaultdict(lambda: defaultdict(int))
    popular_hotel_cluster = defaultdict(int)
    total = 0

    # Calc counts
    while 1:
        line = f.readline().strip()
        total += 1

        if total % 10000000 == 0:
            print('Read {} lines...'.format(total))

        if line == '':
            break

        arr = line.split(",")
        book_year = int(arr[0][:4])
        user_location_city = arr[5]
        orig_destination_distance = arr[6]
        srch_destination_id = arr[16]
        is_booking = int(arr[18])
        hotel_country = arr[21]
        hotel_market = arr[22]
        hotel_cluster = arr[23]

        append_1 = 3 + 17*is_booking
        append_2 = 1 + 5*is_booking

        if user_location_city != '' and orig_destination_distance != '':
            best_hotels_od_ulc[(user_location_city, orig_destination_distance)][hotel_cluster] += 1

        if srch_destination_id != '' and hotel_country != '' and hotel_market != '' and book_year == 2014:
            best_hotels_search_dest[(srch_destination_id, hotel_country, hotel_market)][hotel_cluster] += append_1
        
        if srch_destination_id != '':
            best_hotels_search_dest1[srch_destination_id][hotel_cluster] += append_1
        
        if hotel_country != '':
            best_hotel_country[hotel_country][hotel_cluster] += append_2
        
        popular_hotel_cluster[hotel_cluster] += 1
    
    f.close()

    print('Generate submission...')
    now = datetime.datetime.now()
    path = 'submission_' + str(now.strftime("%Y-%m-%d-%H-%M")) + '.csv'
    out = open(path, "w")
    f = open("test.csv", "r")
    f.readline()
    total = 0
    out.write("id,hotel_cluster\n")
    topclasters = nlargest(5, sorted(popular_hotel_cluster.items()), key=itemgetter(1))

    while 1:
        line = f.readline().strip()
        total += 1

        if total % 1000000 == 0:
            print('Write {} lines...'.format(total))

        if line == '':
            break

        arr = line.split(",")
        id = arr[0]
        user_location_city = arr[6]
        orig_destination_distance = arr[7]
        srch_destination_id = arr[17]
        hotel_country = arr[20]
        hotel_market = arr[21]

        out.write(str(id) + ',')
        filled = []

        s1 = (user_location_city, orig_destination_distance)
        if s1 in best_hotels_od_ulc:
            d = best_hotels_od_ulc[s1]
            topitems = nlargest(5, sorted(d.items()), key=itemgetter(1))
            for i in range(len(topitems)):
                if topitems[i][0] in filled:
                    continue
                if len(filled) == 5:
                    break
                out.write(' ' + topitems[i][0])
                filled.append(topitems[i][0])

        s2 = (srch_destination_id, hotel_country, hotel_market)
        if s2 in best_hotels_search_dest:
            d = best_hotels_search_dest[s2]
            topitems = nlargest(5, d.items(), key=itemgetter(1))
            for i in range(len(topitems)):
                if topitems[i][0] in filled:
                    continue
                if len(filled) == 5:
                    break
                out.write(' ' + topitems[i][0])
                filled.append(topitems[i][0])
        elif srch_destination_id in best_hotels_search_dest1:
            d = best_hotels_search_dest1[srch_destination_id]
            topitems = nlargest(5, d.items(), key=itemgetter(1))
            for i in range(len(topitems)):
                if topitems[i][0] in filled:
                    continue
                if len(filled) == 5:
                    break
                out.write(' ' + topitems[i][0])
                filled.append(topitems[i][0])

        if hotel_country in best_hotel_country:
            d = best_hotel_country[hotel_country]
            topitems = nlargest(5, d.items(), key=itemgetter(1))
            for i in range(len(topitems)):
                if topitems[i][0] in filled:
                    continue
                if len(filled) == 5:
                    break
                out.write(' ' + topitems[i][0])
                filled.append(topitems[i][0])

        for i in range(len(topclasters)):
            if topclasters[i][0] in filled:
                continue
            if len(filled) == 5:
                break
            out.write(' ' + topclasters[i][0])
            filled.append(topclasters[i][0])

        out.write("\n")
    out.close()
    print('Completed!')

run_solution()

下面主要介绍第二个人的想法：先导入基本库

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output

然后查看数据内容。因为文件很大，是每1000000为一块的单元阅读数据的。同时只取了觉得对最后结果有用的特征。

train = pd.read_csv('../input/train.csv',
                    dtype={'is_booking':bool,'srch_destination_id':np.int32, 'hotel_cluster':np.int32},
                    usecols=['srch_destination_id','is_booking','hotel_cluster'],
                    chunksize=1000000)
aggs = []
print('-'*38)
for chunk in train:
    agg = chunk.groupby(['srch_destination_id',
                         'hotel_cluster'])['is_booking'].agg(['sum','count'])
    agg.reset_index(inplace=True)
    aggs.append(agg)
    print('.',end='')
print('')
aggs = pd.concat(aggs, axis=0)
aggs.head()

srch_destination_id	hotel_cluster	sum	count
0	1	20	0.0	2
1	1	30	0.0	1
2	1	60	0.0	2
3	4	22	1.0	2
4	4	25	1.0	2

接下来，我们再次进行汇总，计算所有区块的预订总数。通过从总行数中减去预订量来计算点击量。用加权的预订量和点击量计算酒店集群的“相关性”。

CLICK_WEIGHT = 0.05
agg = aggs.groupby(['srch_destination_id','hotel_cluster']).sum().reset_index()
agg['count'] -= agg['sum']
agg = agg.rename(columns={'sum':'bookings','count':'clicks'})
agg['relevance'] = agg['bookings'] + CLICK_WEIGHT * agg['clicks']
agg.head()

srch_destination_id	hotel_cluster	bookings	clicks	relevance
0	0	3	0.0	2.0	0.10
1	1	20	4.0	22.0	5.10
2	1	30	2.0	20.0	3.00
3	1	57	0.0	1.0	0.05
4	1	60	0.0	17.0	0.85

根据目的地寻找最受欢迎的酒店集群，为目标组定义一个函数以获得最受欢迎的酒店。以前的版本使用nbiggest()系列方法来获得最大元素的索引。但这种方法相当缓慢。我用一个运行速度更快的版本更新了这个版本。

def most_popular(group, n_max=5):
    relevance = group['relevance'].values
    hotel_cluster = group['hotel_cluster'].values
    most_popular = hotel_cluster[np.argsort(relevance)[::-1]][:n_max]
    return np.array_str(most_popular)[1:-1] # remove square brackets

获得所有目的地最受欢迎的酒店集群：

most_pop = agg.groupby(['srch_destination_id']).apply(most_popular)
most_pop = pd.DataFrame(most_pop).rename(columns={0:'hotel_cluster'})
most_pop.head()

hotel_cluster
srch_destination_id
0	3
1	20 30 60 57
2	20 30 53 46 41
3	53 60
4	82 25 32 58 78

预测测试集：

读取测试集并合并最受欢迎的酒店集群。

test = pd.read_csv('../input/test.csv',
                    dtype={'srch_destination_id':np.int32},
                    usecols=['srch_destination_id'],)

test = test.merge(most_pop, how='left',left_on='srch_destination_id',right_index=True)
test.head()

srch_destination_id	hotel_cluster
0	12243	5 55 37 11 22
1	14474	5
2	11353	0 31 77 91 96
3	8250	1 45 79 24 54
4	11812	91 42 2 48 59

检查测试中的hotel_cluster列中的空值。

test.hotel_cluster.isnull().sum()

14036

看起来好像有14k个新目的地在测试中。让我们用总体上最受欢迎的酒店集群来填充nas。

most_pop_all = agg.groupby('hotel_cluster')['relevance'].sum().nlargest(5).index
most_pop_all = np.array_str(most_pop_all)[1:-1]
test.hotel_cluster.fillna(most_pop_all,inplace=True)
test.hotel_cluster.to_csv('predicted_with_pandas.csv',header=True, index_label='id')

kaggle 入门系列翻译（二） Expedia

猜你喜欢