Garden blog article Published statistics

I just want to see what we are all Shashi hair blog it! !

first step:

The  https://www.cnblogs.com/  list of articles cast the first down, only 200, the time range is more than a few days a month, do not know all, so be it

The code is simple: https://github.com/dytttf/little_spider/blob/master/cnblogs/blog_index_spider.py

Data format is as follows:

data = {
  "https://www.cnblogs.com/xxx.html": {
    "url": "https://www.cnblogs.com/xxx.html",
    "title": "xxx",
    "summary": "xxx",
    "author": "xxx",
    "author_url": "https://www.cnblogs.com/xxx/",
    "ctime": "2019-11-11 11:11"
  }
}

Step two:

The data are loaded using pandas to come

data_list = list(data.values())
df = pd.DataFrame(data_list)

third step:

Change if the release time format, and then remove it beyond the 30-day data

# Conversion Time Format 
DF [ " the ctime " ] = pd.to_datetime (DF [ " the ctime " ])
 # remove than 30 days 
DF DF = [DF [ " the ctime " ]> DF [ " the ctime " ] .max () - datetime.timedelta (days = 30)]

the fourth step:

Get it distributed hour, draw a map and see

df["hour"] = df["ctime"].apply(lambda x: x.hour)
hour_counter = df["hour"].value_counts().sort_index()
hour_counter.plot.bar()
plt.show()

A little surprised, actually so much in the morning. I always thought the night would be the peak, it seems everyone is very busy during the day ah. But it could also be written last night, did not have time to send it

I wanted to end this, but plans are to start drawing, and painting it more than a few in

the fifth step:

I do not know more within a month someday publish a blog, look

df["day"] = df["ctime"].apply(lambda x: x.day)
day_counter = df["day"].value_counts().sort_index()
day_counter.plot.bar()
plt.show()

Well, this seems quite average, nothing to see

Step Six:

There are no blog look madman

author_counter = df["author"].value_counts()[:100]
author_counter.plot.bar()
# 输出前五
print(author_counter[:5])
plt.show()

Did not seem particularly large, look who it is before 5

"""
阿里巴巴云原生     22
极客挖掘机       22
程序新视界       20
chen_hao    16
赐我白日梦       13
"""

第七步:

看一下大家都在发啥类型的帖子,对标题分词做个词云

for cat in ["title", "summary"]:
    text = re.sub("[^\w]", " ", " ".join(list(df[cat])))
    # 清理空格和单字
    word_counter = Counter([x for x in jieba.cut(text) if len(x.strip()) > 1])
    top_30 = word_counter.most_common(30)
    # 词云
    title_word_cloud = WordCloud(
        background_color="white", font_path="simsun.ttf"
    ).generate_from_frequencies(dict(top_30))
    plt.imshow(title_word_cloud)
    plt.axis("off")
    plt.show()

标题词云:

ps:

1、身为Pythoneer,很不服Java

2、如何?大家很喜欢用问句作为标题?

3、一系列各种实现、源码、解析、框架、模式、原理。感觉学不过来了

摘要词云:

 ps:

1、我们使用一个啥?可以实现啥?博主都很喜欢用我们来拉近距离:)

词云还是挺好玩的。。。 

最后附上代码:

https://github.com/dytttf/little_spider/blob/master/cnblogs/analysis.py

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 
 
 
 
 

Guess you like

Origin www.cnblogs.com/dyfblog/p/11723571.html