User portrait series - user portrait data modeling method

With the gradual deepening of the understanding of people, a concept emerges quietly: User Profile, which perfectly abstracts the whole picture of a user's information, which can be regarded as the foundation of enterprise application of big data.

1. What is a user portrait?

Male, 31 years old, married, with an income of more than 10,000 yuan, loves food, is a group buying expert, and likes red wine with cigarettes.

Such a string of descriptions is a typical case of user portraits. If it is described in one sentence, that is: user information tagging.

If you use a picture to show it, that is:

data modeling

2. Why do you need user portraits?

The core work of user portraits is to label users. One of the important purposes of labeling is to make people understandable and facilitate computer processing . For example, classification statistics can be done: How many users like red wine? What is the ratio of males to females among the people who like red wine?

You can also do data mining work: use association rules to calculate, what sports brands do people who like red wine usually like? Using clustering algorithm analysis, what is the age distribution of people who like red wine?

Big data processing is inseparable from computer operations. Labels provide a convenient way for computers to programmatically process information related to people, and even "understand" people through algorithms and models. When computers have such capabilities, whether it is search engines, recommendation engines, advertising and other application fields, it will be able to further improve the accuracy and improve the efficiency of information acquisition.

3. How to build user portraits

A label is usually a highly refined feature identifier prescribed by humans, such as the age group label: 25-35 years old, and the geographical label: Beijing. The label presents two important features: semantics, and people can easily understand the meaning of each label. This also makes the user persona model practical. Can better meet business needs. For example, to determine user preferences. For short text, each label usually only represents one meaning, and the label itself does not need to do preprocessing work such as text analysis, which provides convenience for using machines to extract standardized information.

Humans formulate label rules, and can quickly read the information through labels, and machines are convenient for label extraction and aggregation analysis. Therefore, user personas, namely: user tags, show us a simple and concise way to describe user information.

3.1 Data Source Analysis

The purpose of constructing user portraits is to restore user information, so the data comes from: all user-related data.

对于用户相关数据的分类,引入一种重要的分类思想:封闭性的分类方式。如,世界上分为两种人,一种是学英语的人,一种是不学英语的人;客户分三类,高价值客户,中价值客户,低价值客户;产品生命周期分为,投入期、成长期、成熟期、衰退期…所有的子分类将构成了类目空间的全部集合。

这样的分类方式,有助于后续不断枚举并迭代补充遗漏的信息维度。不必担心架构上对每一层分类没有考虑完整,造成维度遗漏留下扩展性隐患。另外,不同的分类方式根据应用场景,业务需求的不同,也许各有道理,按需划分即可。

本文将用户数据划分为静态信息数据、动态信息数据两大类。

Big Data

静态信息数据

用户相对稳定的信息,如图所示,主要包括人口属性、商业属性等方面数据。这类信息,自成标签,如果企业有真实信息则无需过多建模预测,更多的是数据清洗工作,因此这方面信息的数据建模不是本篇文章重点。

动态信息数据

用户不断变化的行为信息,如果存在上帝,每一个人的行为都在时刻被上帝那双无形的眼睛监控着,广义上讲,一个用户打开网页,买了一个杯子;与该用户傍晚溜了趟狗,白天取了一次钱,打了一个哈欠等等一样都是上帝眼中的用户行为。当行为集中到互联网,乃至电商,用户行为就会聚焦很多,如上图所示:浏览凡客首页、浏览休闲鞋单品页、搜索帆布鞋、发表关于鞋品质的微博、赞“双十一大促给力”的微博消息。等等均可看作互联网用户行为。

本篇文章以互联网电商用户,为主要分析对象,暂不考虑线下用户行为数据(分析方法雷同,只是数据获取途径,用户识别方式有些差异)。

在互联网上,用户行为,可以看作用户动态信息的唯一数据来源。如何对用户行为数据构建数据模型,分析出用户标签,将是本文着重介绍的内容。

3.2 目标分析

用户画像的目标是通过分析用户行为,最终为每个用户打上标签,以及该标签的权重。如,红酒 0.8、李宁 0.6。

标签,表征了内容,用户对该内容有兴趣、偏好、需求等等。

权重,表征了指数,用户的兴趣、偏好指数,也可能表征用户的需求度,可以简单的理解为可信度,概率。

3.3 数据建模方法

下面内容将详细介绍,如何根据用户行为,构建模型产出标签、权重。一个事件模型包括:时间、地点、人物三个要素。每一次用户行为本质上是一次随机事件,可以详细描述为:什么用户,在什么时间,什么地点,做了什么事。

什么用户:关键在于对用户的标识,用户标识的目的是为了区分用户、单点定位。

Big Data

以上列举了互联网主要的用户标识方法,获取方式由易到难。视企业的用户粘性,可以获取的标识信息有所差异。

什么时间:时间包括两个重要信息,时间戳+时间长度。时间戳,为了标识用户行为的时间点,如,1395121950(精度到秒),1395121950.083612(精度到微秒),通常采用精度到秒的时间戳即可。因为微秒的时间戳精度并不可靠。浏览器时间精度,准确度最多也只能到毫秒。时间长度,为了标识用户在某一页面的停留时间。

什么地点:用户接触点,Touch Point。对于每个用户接触点。潜在包含了两层信息:网址 + 内容。网址:每一个url链接(页面/屏幕),即定位了一个互联网页面地址,或者某个产品的特定页面。可以是PC上某电商网站的页面url,也可以是手机上的微博,微信等应用某个功能页面,某款产品应用的特定画面。如,长城红酒单品页,微信订阅号页面,某游戏的过关页。

内容:每个url网址(页面/屏幕)中的内容。可以是单品的相关信息:类别、品牌、描述、属性、网站信息等等。如,红酒,长城,干红,对于每个互联网接触点,其中网址决定了权重;内容决定了标签。

注:接触点可以是网址,也可以是某个产品的特定功能界面。如,同样一瓶矿泉水,超市卖1元,火车上卖3元,景区卖5元。商品的售卖价值,不在于成本,更在于售卖地点。标签均是矿泉水,但接触点的不同体现出了权重差异。这里的权重可以理解为用户对于矿泉水的需求程度不同。即,愿意支付的价值不同。

标签 权重

矿泉水 1 // 超市

矿泉水 3 // 火车

矿泉水 5 // 景区

类似的,用户在京东商城浏览红酒信息,与在品尚红酒网浏览红酒信息,表现出对红酒喜好度也是有差异的。这里的关注点是不同的网址,存在权重差异,权重模型的构建,需要根据各自的业务需求构建。

所以,网址本身表征了用户的标签偏好权重。网址对应的内容体现了标签信息。

什么事:用户行为类型,对于电商有如下典型行为:浏览、添加购物车、搜索、评论、购买、点击赞、收藏 等等。

不同的行为类型,对于接触点的内容产生的标签信息,具有不同的权重。如,购买权重计为5,浏览计为1

红酒 1 // 浏览红酒

红酒 5 // 购买红酒

综合上述分析,用户画像的数据模型,可以概括为下面的公式:用户标识 + 时间 + 行为类型 + 接触点(网址+内容),某用户因为在什么时间、地点、做了什么事。所以会打上**标签。

用户标签的权重可能随时间的增加而衰减,因此定义时间为衰减因子r,行为类型、网址决定了权重,内容决定了标签,进一步转换为公式:

标签权重=衰减因子×行为权重×网址子权重

如:用户A,昨天在品尚红酒网浏览一瓶价值238元的长城干红葡萄酒信息。

标签:红酒,长城
时间:因为是昨天的行为,假设衰减因子为:r=0.95
行为类型:浏览行为记为权重1
地点:品尚红酒单品页的网址子权重记为 0.9(相比京东红酒单品页的0.7)
假设用户对红酒出于真的喜欢,才会去专业的红酒网选购,而不再综合商城选购。

Then the user preference label is: red wine, and the weight is 0.95*0.7 * 1=0.665, that is, user A: red wine 0.665, Great Wall 0.665.

The selection of the above model weight values ​​is only an example and reference. The specific weight values ​​need to be re-modeled according to business requirements. The emphasis here is on how to think holistically to build a user portrait model, and then gradually refine the model.

4. Summary:

This article does not involve specific algorithms, but rather describes an analytical idea, which can provide you with a systematic and framed thinking guide when planning to build user portraits.

The core lies in the understanding of user touchpoints, and the content of the touchpoints directly determines the label information. Content address, behavior type, and time decay determine that the weight model is the key, and the secondary modeling of the weight value itself is a natural progression. The model example focuses on e-commerce, but in fact, touch points can be redefined according to different products.

For example, film and television products, I watched a movie "The True Color of Heroes", and the possible labels are: Chow Yun-fat 0.6, Gunfight 0.5, Hong Kong and Taiwan 0.3.

Finally, the touchpoint itself does not necessarily have content, but can also be generalized as a certain threshold, how many times a certain behavior is exceeded, and how long it takes to reach it.

For example, game products, typical touch points may be, key tasks, key indices (scores) and so on. For example, if the points exceed 10,000, it will be marked as a diamond user. Diamond User 1.0.

 

http://www.36dsj.com/archives/13324

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326772509&siteId=291194637