Analysis of 1.4 billion data with Python, is so exciting!

You treated with Python largest datasets how much? I think it probably will not last more than one hundred million now, today, to share a case analysis of 1.4 billion data processing in Python

The 1.4 billion data sets from Google Books, generated by Google Ngram viewer, according to printed books each year, a record amount of use of a particular word or phrase in Google Books. Data set has tens of millions of books, ranging from the 16th century to 2008 on time. You can draw a word usage changes over time, such as query frequency "Python" appears in history.

Below to load the above data sets using Python PyTubes database, and then analyzed to generate such an above FIG. PyTubes is a library designed to load large data sources.

1-gram sets of data on the hard disk can be expanded to become 27 Gb of data, which is a lot of data at the time the order is read python. Python can process the data once gigabit easily, but when the data is corrupted and processed, will slow down and the memory efficiency becomes low.

Overall, this data is 1.4 billion (1,430,727,243) dispersed in 38 source files, a total of 2,400 million (24,359,460) word (and speech tagging, see below), calculated from 1505 to 2008.

When dealing with 1 billion rows of data, the speed will slow down soon. And raw Python and not optimized processing of data in this regard. Fortunately, numpy really good at dealing with the general volume of data. Using some simple techniques we can use numpy to make this analysis becomes feasible.

In python / numpy string processing is complicated. Python string memory overhead is very significant, and can handle only numpy known and fixed length character string. Based on this situation, most of the words have different lengths, so this is not ideal.
Python learning process there do not understand can join my python zero-based systems Learning Exchange Qiuqiu qun: 784 front, middle 758, followed by 214, the current corporate share Python Python talent needs and how you learn from a zero base, and learn what. Related video learning materials, development tools have to share

Load data sets

All of the following codes / examples are run in 8 GB memory 2016 Macbook Pro's. If the hardware or cloud instances have a better ram configuration, performance will be better.

1-gram of the tab divided data is stored in a file form, it looks as follows:


1Python 1587 4 2
2Python 1621 1 1
3Python 1651 2 2
4Python 1659 1 1

Each of the data comprising the following several fields:


11. Word
22. Year of Publication
33. Total number of times the word was seen
44. Total number of books containing the word


In order to generate charts in accordance with the requirements, we need to know this information, that is:


11.  这个单词是我们感兴趣的?
22. 发布的年份
33. 单词使用的总次数

What distinguished line data by extracting the information, the data processing additional consumption strings of different lengths are ignored, but we still need to compare the value of different strings is our field of interest. This is pytubes can do the job:


1import tubes
 2
 3FILES = glob.glob(path.expanduser("~/src/data/ngrams/1gram/googlebooks*"))
 4WORD = "Python"
 5one_grams_tube = (tubes.Each(FILES)
 6    .read_files()
 7    .split()
 8    .tsv(headers=False)
 9    .multi(lambda row: (
10        row.get(0).equals(WORD.encode('utf-8')),
11        row.get(1).to(int),
12        row.get(2).to(int)
13    ))
14)

After almost 170 seconds (3 minutes), One grams_ is a numpy array, which contains nearly 1.4 billion rows of data, looks like this (the head of the table is added for illustration):


1╒═══════════╤════════╤═════════╕
 2│   Is_Word │   Year │   Count │
 3╞═══════════╪════════╪═════════╡
 4│         0 │   1799 │       2 │
 5├───────────┼────────┼─────────┤
 6│         0 │   1804 │       1 │
 7├───────────┼────────┼─────────┤
 8│         0 │   1805 │       1 │
 9├───────────┼────────┼─────────┤
10│         0 │   1811 │       1 │
11├───────────┼────────┼─────────┤
12│         0 │   1820 │     ... │
13╘═══════════╧════════╧═════════╛

Here you can begin to analyze the data.

The total amount of words per year

Google shows every percentage (the number of times a word appears this year / year the total number of all words appear) occurrences of the word, this word is more than just the original calculation used. To calculate the percentage, we need to know the total number of words is.

Fortunately, numpy let this become very simple:


 1last_year = 2008
 2YEAR_COL = '1'
 3COUNT_COL = '2'
 4year_totals, bins = np.histogram(
 5    one_grams[YEAR_COL],
 6    density=False,
 7    range=(0, last_year+1),
 8    bins=last_year + 1,
 9    weights=one_grams[COUNT_COL]
10)

This figure plotted to show Google collects many words per year:

很清楚的是在 1800 年之前,数据总量下降很迅速,因此这回曲解最终结果,并且会隐藏掉我们感兴趣的模式。为了避免这个问题,我们只导入 1800 年以后的数据:


 1one_grams_tube = (tubes.Each(FILES)
 2    .read_files()
 3    .split()
 4    .tsv(headers=False)
 5    .skip_unless(lambda row: row.get(1).to(int).gt(1799))
 6    .multi(lambda row: (
 7        row.get(0).equals(word.encode('utf-8')),
 8        row.get(1).to(int),
 9        row.get(2).to(int)
10    ))
11)

这返回了 13 亿行数据(1800 年以前只有 3.7% 的的占比)

Python 在每年占比百分数

获得 python 在每年的占比百分数现在就特别的简单了。

使用一个简单的技巧,创建基于年份的数组,2008 个元素长度意味着每一年的索引等于年份的数字,因此,举个例子,1995 就只是获取 1995 年的元素的问题了。

这都不值得使用 numpy 来操作:


1word_rows = one_grams[IS_WORD_COL]
2word_counts = np.zeros(last_year+1)
3for _, year, count in one_grams[word_rows]:
4    word_counts[year] += (100*count) / year_totals[year]

绘制出 word_counts 的结果:

形状看起来和谷歌的版本差不多

实际的占比百分数并不匹配,我认为是因为下载的数据集,它包含的用词方式不一样(比如:Python_VERB)。这个数据集在 google page 中解释的并不是很好,并且引起了几个问题:

  • 人们是如何将 Python 当做动词使用的?

  • ‘Python’ 的计算总量是否包含 ‘Python_VERB’?等

幸运的是,我们都清楚我使用的方法生成了一个与谷歌很像的图标,相关的趋势都没有被影响,因此对于这个探索,我并不打算尝试去修复。

性能

谷歌生成图片在 1 秒钟左右,相较于这个脚本的 8 分钟,这也是合理的。谷歌的单词计算的后台会从明显的准备好的数据集视图中产生作用。

举个例子,提前计算好前一年的单词使用总量并且把它存在一个单独的查找表会显著的节省时间。同样的,将单词使用量保存在单独的数据库/文件中,然后建立第一列的索引,会消减掉几乎所有的处理时间。

这次探索 确实 展示了,使用 numpy 和 初出茅庐的 pytubes 以及标准的商用硬件和 Python,在合理的时间内从十亿行数据的数据集中加载,处理和提取任意的统计信息是可行的,

Python,Pascal 和 Perl 对比

为了用一个稍微更复杂的例子来证明这个概念,我决定比较一下三个相关提及的编程语言: Python,Pascal,Perl.

源数据比较嘈杂(它包含了所有使用过的英文单词,不仅仅是编程语言的提及,并且,比如,python 也有非技术方面的含义!),为了这方面的调整, 我们做了两个事情:

  1. 只有首字母大写的名字形式能被匹配(Python,不是 python)

  2. 每一个语言的提及总数已经被转换到了从 1800 年到 1960 年的百分比平均数,考虑到 Pascal 在 1970 年第一次被提及,这应该有一个合理的基准线。

结果:

对比谷歌 (没有任何的基准线调整):

运行时间: 只有 10 分钟多一点。

Guess you like

Origin blog.csdn.net/weichen090909/article/details/93486903