If you use Python to analyze 1.4 billion pieces of data! Senior programmers teach you hands-on! Over 100 million!

If you use Python to analyze 1.4 billion pieces of data!  Senior programmers teach you hands-on!  Over 100 million!

If you use Python to analyze 1.4 billion pieces of data!  Senior programmers teach you hands-on!  Over 100 million!

If you use Python to analyze 1.4 billion pieces of data!  Senior programmers teach you hands-on!  Over 100 million!

If you use Python to analyze 1.4 billion pieces of data!  Senior programmers teach you hands-on!  Over 100 million!

challenge

If you use Python to analyze 1.4 billion pieces of data!  Senior programmers teach you hands-on!  Over 100 million!

A 1-gram dataset can be expanded into 27 Gb of data on hard disk, which is a large amount of data when read into python. Python can easily process gigabytes of data at once, but when the data is corrupted and processed, it's slower and less memory efficient.

In total, these 1.4 billion pieces of data (1,430,727,243) are scattered across 38 source files, for a total of 24 million (24,359,460) words (and part-of-speech tags, see below), calculated from 1505 to 2008.

If you use Python to analyze 1.4 billion pieces of data!  Senior programmers teach you hands-on!  Over 100 million!

Loading the data

 All code/examples below are on a 2016 Macbook Pro running on  8 GB RAM . It will perform better if the hardware or cloud instance has a better ram configuration.

The 1-gram data is stored in the file in tab-separated form, and looks like this:

If you use Python to analyze 1.4 billion pieces of data!  Senior programmers teach you hands-on!  Over 100 million!

Each piece of data contains the following fields:

If you use Python to analyze 1.4 billion pieces of data!  Senior programmers teach you hands-on!  Over 100 million!

In order to generate the chart as required, we only need to know this information, which is:

1. Is this word interesting to us? 2. Year of publication 3. Total number of times the word was used

By extracting this information, the extra cost of processing string data of different lengths is ignored, but we still need to compare the values ​​of different strings to distinguish which rows of data have fields of interest. This is what pytubes can do:

If you use Python to analyze 1.4 billion pieces of data!  Senior programmers teach you hands-on!  Over 100 million!

After almost 170 seconds (3 minutes), one_grams is a numpy array with almost 1.4 billion rows of data that looks like this (table header added for illustration):

╒═══════════╤════════╤═════════╕│ Is_Word │ Year │ Count │╞═══════════╪════════╪═════════╡│ 0 │ 1799 │ 2 │├───────────┼────────┼─────────┤│ 0 │ 1804 │ 1 │├───────────┼────────┼─────────┤│ 0 │ 1805 │ 1 │├───────────┼────────┼─────────┤│ 0 │ 1811 │ 1 │├───────────┼────────┼─────────┤│ 0 │ 1820 │ ... │╘═══════════╧════════╧═════════╛

从这开始,就只是一个用 numpy 方法来计算一些东西的问题了:

每一年的单词总使用量

谷歌展示了每一个单词出现的百分比(某个单词在这一年出现的次数/所有单词在这一年出现的总数),这比仅仅计算原单词更有用。为了计算这个百分比,我们需要知道单词总量的数目是多少。

幸运的是,numpy让这个变得十分简单:

If you use Python to analyze 1.4 billion pieces of data!  Senior programmers teach you hands-on!  Over 100 million!

绘制出这个图来展示谷歌每年收集了多少单词:

If you use Python to analyze 1.4 billion pieces of data!  Senior programmers teach you hands-on!  Over 100 million!

很清楚的是在 1800 年之前,数据总量下降很迅速,因此这回曲解最终结果,并且会隐藏掉我们感兴趣的模式。为了避免这个问题,我们只导入 1800 年以后的数据:

If you use Python to analyze 1.4 billion pieces of data!  Senior programmers teach you hands-on!  Over 100 million!

这返回了 13 亿行数据(1800 年以前只有 3.7% 的的占比)

If you use Python to analyze 1.4 billion pieces of data!  Senior programmers teach you hands-on!  Over 100 million!

Python 在每年的占比百分数

获得 python 在每年的占比百分数现在就特别的简单了。

If you use Python to analyze 1.4 billion pieces of data!  Senior programmers teach you hands-on!  Over 100 million!

If you use Python to analyze 1.4 billion pieces of data!  Senior programmers teach you hands-on!  Over 100 million!

绘制出 word_counts 的结果:

If you use Python to analyze 1.4 billion pieces of data!  Senior programmers teach you hands-on!  Over 100 million!

形状看起来和谷歌的版本差不多

If you use Python to analyze 1.4 billion pieces of data!  Senior programmers teach you hands-on!  Over 100 million!

If you use Python to analyze 1.4 billion pieces of data!  Senior programmers teach you hands-on!  Over 100 million!

性能

谷歌生成图片在 1 秒钟左右,相较于这个脚本的 8 分钟,这也是合理的。谷歌的单词计算的后台会从明显的准备好的数据集视图中产生作用。

举个例子,提前计算好前一年的单词使用总量并且把它存在一个单独的查找表会显著的节省时间。同样的,将单词使用量保存在单独的数据库/文件中,然后建立第一列的索引,会消减掉几乎所有的处理时间。

This exploration really shows that it is possible to load, process and extract arbitrary statistics from datasets of billion rows of data in reasonable time using numpy and fledgling pytubes and standard commodity hardware and Python,

If you use Python to analyze 1.4 billion pieces of data!  Senior programmers teach you hands-on!  Over 100 million!

result:

If you use Python to analyze 1.4 billion pieces of data!  Senior programmers teach you hands-on!  Over 100 million!

Compared to Google (without any baseline adjustments):

If you use Python to analyze 1.4 billion pieces of data!  Senior programmers teach you hands-on!  Over 100 million!

More filtering logic - Tube.skip_unless() is a simpler way to filter rows, but lacks the ability to combine conditions (AND/OR/NOT). This can reduce the size of the loaded data faster in some use cases.

Better string matching - simple tests like this: startswith, endswith, contains, and is_one_of can easily be added to significantly improve the efficiency of loading string data.

If you use Python to analyze 1.4 billion pieces of data!  Senior programmers teach you hands-on!  Over 100 million!

thanks for reading! ! Isn't it super dick, 1.4 billion, this is not a small number!

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326769814&siteId=291194637