Data product manager of the technical know python

A, Python Introduction

Small Chennai: In fact, you write the code seems to have different factions?

Daren: You mean the programming language? I'll introduce it, we look at the (city friends programmer, code, Sites) data GitHub, look at the data in a variety of programming languages Pull Requst, Javascript submitted the highest amount, the heyday of the front end, python is in rapid rising, a lot of potential. Java has been very stable, year-round occupied the rear end of the first mainstream programming language.

Two, Python heat Why continue to rise?

python can be used to do?

Back-end development language, common development framework django;
Data analysis, common libraries, pandas;
Reptile, scrapy;
Artificial intelligence, tensorflow.

Artificial intelligence and data analysis, in recent years, demand continues to rise, this treatment is rising talent, since then capable python, naturally heat up rapidly.

Third, reptiles

When it comes to data analysis, we have to say at the data source, typically an internal data, but also external data, external data acquisition there are many, the most common way is a reptile.

Reptile protocol based robots can climb public information on the network.

1. reptiles works

python reptile in a mature framework (scrapy, bs4), as long as you give reptiles a web site, it can take the climb, and enter the url similar but different, except that the reptile will these html files useful information crawl back, and crawlers crawling links to other related sites that, like daquan in the abc, 123,456 and so on.

Small Chennai: crawling information of others will not break the law?

Daren: to see how you crawl, in fact, there is a reptile protocol (robots), each site can be declared, in fact, a statement which files and what not, to the following robots.txt Taobao's an example:

User-agent: Baiduspider

Allow: /article

Allow: /oshtml

Disallow: /product/

Disallow: /

After following the robots protocol, crawling data is not used for commercial, basically all right, if a commercial is still a gray area, chaotic wild stage. (Initiate, seek professionals to answer the next)

2. Taobao shielding Baidu's

当年还可以在百度里搜索到淘宝商品信息，后来淘宝决定对搜索引擎实施不同程度的屏蔽，那时候淘宝体量还没那么大，屏蔽百度，会少了很多站外流量。

但是这个关键性的决定，让用户心智统一（淘宝里才可以搜索商品），后面现金流般的淘宝广告就更不说了，站在当时，很考验产品决策人。

3.搜索引擎

爬虫似乎和搜索引擎密切相关，是的，是时候来科普下，搜索引擎的工作原理。

假如你在“JackSearch”这个搜索引擎里，输入“产品经理”，那么当你点击搜索时，服务器就会去数据库查找，返回相关的文件信息，那么你就会问，这些文件是哪来的？
是爬虫们去网页世界里爬取的。

当然，搜索引擎远比这个复杂，爬虫抓取回来的信息，还需要存储，建立索引，这个推荐一本书《Luence》。

4.数据产品经理（ex-developer）常用工具

在上古时代，Unix系统连界面都没有，程序员都还是用vi来写代码的，就是密密麻麻黑乎乎的命令行状态。到了今天，依然有部分极客只用命令行（terminal）来写代码，然后现在生活条件好了，大部分程序员都是用集成开发环境，这样子可以提高效率，省下不少脑力。

5.PyCharm (python的集成开发环境)

数据库工具：navicat（for mysql）、robomngo（for mongodb）
后端开发工具：jetbian公司系列，webstorm（for JavaScript）、pycharm（for python）
代码管理工具：sourcetree（for git）、svn
原型工具：axure、墨刀、etc

简单介绍下pycharm，大概长这样子，左边是项目文件（1.py , 2.py）, 主界面是文件代码编写窗口，底部是调试窗口。

6.爬虫框架scrapy

Scrapy：Python开发的一个快速、高层次的屏幕抓取和web抓取框架，用于抓取web站点并从页面中提取结构化的数据。Scrapy用途广泛，可以用于数据挖掘、监测和自动化测试。

这里不得不说下，scrapy是分布式爬虫框架，如何理解？上次也有介绍分布式，分布式就是假如100只爬虫，今天的任务是爬取100本小说，那么如果一百台机器，机器上各有一只，每只爬不同的一部小说，那就是分布式。

分布式爬虫方便性能扩张，极大提高程序的抓取效率。

7.爬虫实践

新建工程 scrapy startproject tutorial；
创建爬虫 scrapy genspider -t xxx.com；
修改settings.py，在settings.py中修改DEFAULT_REQUEST_HEADERA和USER_AGENT;
定义item：DemoItem(scrapy.Item)；name = scrapy.Field()；title = scrapy.Field()；link = scrapy.Field()；info = scrapy.Field()
编写spider爬虫逻辑；
储存到数据库（mysql、mongodb、etc）。

8.不想写爬虫又想抓数据？

当然可以的，常见有八抓鱼、火车头之类。八爪鱼有一些优势，比如学习成本低，可视化流程，快速搭建采集系统；能直接导出excel文件和导出到数据库中；降低采集成本，云采集提供10个节点，也能省事不少。

四、数据分析告诉你，天龙八部在讲什么？

小奈：那python在数据分析工作中，如何发挥作用呢？

大仁：数据分析可以用python、r、第三方分析工具，都可以，但最主要还是结合业务，要有分析思路，这个就要求有业务经验了，我举个例子吧，我常看小说，就以天龙八部为例子吧。

“天龙八部”是哪八部？“天龙八部”都是“非人”，包括八种神道怪物，因为以“天”及“龙”为首，所以称为“天龙八部”。
八部者，一天，二龙，三夜叉，四乾达婆，五阿修罗，六迦楼罗，七紧那罗，八摩呼罗迦。

看完介绍，还是不懂，没关系，今天主要讲的是，用数据分析天龙八部里高频词语、人物关系、（关系真的很复杂，电脑都跑的发烫，瑟瑟发抖）以及究竟在讲啥？

1.自己？

看到下面的词云，为什么“自己”这个词，那么高频？估计和写作人称有关，上帝视角？（有点不解，求天龙粉解答）

乍看之下，段誉词频（1551）最高，其实要结合“业务”，实则乔峰才是正主。要从乔峰的身世说起，开头中，乔峰是丐帮帮主，后身世揭破，契丹人也，改名萧峰。

所以乔峰的词频（1900+）=乔峰（963）+萧峰（966）。

从词语中，我们可以看出，写作手法，乔峰(段誉)听/笑/呆/动词，所以人物+动词。

2.人物关系图

故事有好多条主线。

（1）寻仇：

其中虚竹和乔峰，为什么关系最亲密？因为虚竹的爸是杀死乔峰的爸的带头大哥，寻仇是小说的主线之一。

（2）段正淳恋爱史：

从另一角度看，可以说是，大理镇南王，段正淳恋爱史，他和几位女人谈恋爱，并生下的都全都是女儿，女儿再一个个和段誉谈恋爱，搞得段誉很痛苦，最后发现自己不是亲生的故事。

总结来说：故事是由“慕容博”和“段正淳”，两位大Boss挑起的，各负责一条主线：

慕容博想光复燕国，才策划杀死萧家，企图引起两国战乱，引起萧父报仇；
镇南王，则是负责拈花若草，一身情债，一个人很爽，搞得很多人很痛苦，最后自杀。

五、实战

1.词云

这里主要用到了两个库：jieba分词用的，wordcloud词云用，matplib显示用。

下载小说txt文件；
准备一张mask（遮罩）图片；
字体；

#coding:utf-8

from os import path

from collections import Counter

import jieba

from PIL import Image

import numpy as np

import matplotlib.pyplot as plt

from wordcloud import WordCloud, STOPWORDS

if __name__==’__main__’:

#读取文件

d = path.dirname(__file__)

pardir = path.dirname(d)

pardir2 = path.dirname(pardir)

cyqf = path.join(pardir2,’tlbbqf/’)

text = open(path.join(d,’tlbb.txt’), encoding=”utf-8″, errors=”surrogateescape”).read()

jieba_word = jieba.cut(text, cut_all=False) #cut_all 分词模式

data = []

for word in jieba_word:

data.append(word)

dataDict = Counter(data)

with open(‘./词频统计.csv’, ‘w’, encoding=’utf-8′) as fw:

for k,v in dataDict.items():

fw.write(“%s,%d\n” % (k,v))

mask = np.array(Image.open(path.join(d, “mask.png”)))

font_path=path.join(d,”font.ttf”)

stopwords = set(STOPWORDS)

wc = WordCloud(background_color=”white”,

max_words=2000,

mask=mask,

stopwords=stopwords,

font_path=font_path)

# 生成词云

wc.generate(text)

# 生成的词云图像保存到本地

wc.to_file(path.join(d, “wordcloud.png”))

# 显示图像

plt.imshow(wc, interpolation=’bilinear’)

plt.axis(“off”)

plt.show()

2.人物关系图

（1）统计词频

  text = open(path.join(d,'tlbb.txt'), encoding="utf-8", errors="surrogateescape").read()
    jieba_word = jieba.cut(text, cut_all=False)   #cut_all 分词模式
    data = []
    for word in jieba_word:
        data.append(word)
    dataDict = Counter(data)

（2）计算人物之间矩阵关系