升级Centos的Python版本，安装使用elasticSearch-py，bulk跳过错误文档

升级服务器

（centos6.9）Python2.6.6到2.7.15

Python -V
cd /opt
wget --no-check-certificate https://www.python.org/ftp/python/2.7.15/Python-2.7.15.tar.xz
tar -xf ./Python-2.7.15.tar.xz
【报错】：tar (child): xz: Cannot exec: No such file or directory
【解决】：yum -y install xz
tar xf Python-2.7.15.tar.xz
rm Python-2.7.15.tar.xz
mkdir -p /usr/local/py27
cd ./Python-2.7.15

11../configure --prefix=/usr/local/py27

–prefix选项是配置安装的路径，如果不配置该选项，安装后可执行文件默认放在/usr /local/bin，库文件默认放在/usr/local/lib，配置文件默认放在/usr/local/etc，其它的资源文件放在/usr /local/share，以后卸载删除可能会麻烦一些。

实际编译中，发现并没有按上述说法安装：
ls /usr/local/py27/也没有任何文件（也许是我哪里理解/执行有误？）

扫描二维码关注公众号，回复： 3444808 查看本文章

/usr/bin/install -c -m 644 ./Lib/email/_parseaddr.py /usr/local/lib/python2.7/email
/usr/bin/install -c -m 644 ./Lib/email/parser.py /usr/local/lib/python2.7/email
/usr/bin/install -c -m 644 ./Lib/email/quoprimime.py /usr/local/lib/python2.7/email
/usr/bin/install -c -m 644 ./Lib/email/utils.py /usr/local/lib/python2.7/email
/usr/bin/install -c -m 644 ./Lib/email/mime/application.py /usr/local/lib/python2.7/email/mime

12../configure --enable-optimizations 在编译python时，–enable-optimizations是做什么的？

13.【编译安装，可能需要7/8分钟时间】
make && make altinstall

关于make altinstall：
如果使用make install，你将会看到在系统中有两个不同版本的Python在/usr/bin/目录中。这将会导致很多问题，而且不好处理。

14./usr/local/bin/python2.7【查看是否成功安装Python2.7.15】

Python 2.7.15 (default, Jun 27 2018, 21:53:07)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-18)] on linux2
Type "help", "copyright", "credits" or "license" for more information.

15.【创建并替换软链接】
mv /usr/bin/python /usr/bin/python.bak.2.6.6 ln -s /usr/local/bin/python2.7 /usr/bin/python python -V

16.【解决升级后不能使用yum的问题】

There was a problem importing one of the Python modules
required to run yum. The error leading to this problem was:

   No module named yum

vi /usr/bin/yum
【编辑文件，在顶部修改成：】
#!/usr/bin/python.bak.2.6.6

17.【检验yum】
yum update

pip install

【安装easy_install】
wget https://bootstrap.pypa.io/ez_setup.py -O - | python
【安装pip】
cd /usr/local/bin/ ./easy_install-2.7 pip
# 成功后的提示
Installed /usr/local/lib/python2.7/site-packages/pip-10.0.1-py2.7.egg
ls -l /usr/local/bin/ 【会发现有pip、pip2、pip2.7三个不同的命令】
区别在于：

The idea is, if you have multiple python installations on your path,
“pip” will find the first one, “pip2” will find the first python2 one
(so if the first one was python3, or vice versa), and pip2.7 will find
2.7 even if e.g. 2.6 was before it on the path.
It makes a bit more sense on Unix where all of these are in /usr/bin and
only the primary “pip” [of whatever your main python installation]
actually exists.
【安装elasticsearch（默认下载最新的，虽然是兼容的，但是建议定义主版本号。可以在 setup.py 中定义）】
pip2.7 install elasticsearch
Successfully installed elasticsearch-6.3.0 urllib3-1.23
【连接elasticsearch时，可用的transport参数：】

class elasticsearch.Transport（hosts，connection_class = Urllib3HttpConnection，connection_pool_class = ConnectionPool，host_info_callback = construct_hosts_list，sniff_on_start = False，sniffer_timeout = None，sniff_on_connection_fail = False，serializer = JSONSerializer（），max_retries = 3，kwargs ）
【关于Helpers.bulk时，遇见不符合规范的文档怎么办】

stats_only设置为True，则返回成功数以及错误数，否则返回错误列表。
但是请注意，默认情况下，当我们遇到错误时，我们会抛出BulkIndexError，只有当raise_on_error设置为False时，才会应用stats_only的配置。

源码如下：

    # list of errors to be collected is not stats_only
    errors = []

    # make streaming_bulk yield successful results so we can count them
    kwargs['yield_ok'] = True
    for ok, item in streaming_bulk(client, actions, **kwargs):
        # go through request-reponse pairs and detect failures
        if not ok:
            if not stats_only:
                errors.append(item)
            failed += 1
        else:
            success += 1

    return success, failed if stats_only else errors

7.bulk函数仅仅处理lient, actions, stats_only这三个参数，其他的参数，比如上面提到的raise_on_error，是由streaming_bulk负责的。

Any additional keyword arguments will be passed to
:func:~elasticsearch.helpers.streaming_bulk which is used to execute
the operation, see :func:~elasticsearch.helpers.streaming_bulk for more
accepted parameters.

def streaming_bulk(client, actions, chunk_size=500, max_chunk_bytes=100 * 1024 * 1024,
                   raise_on_error=True, expand_action_callback=expand_action,
                   raise_on_exception=True, max_retries=0, initial_backoff=2,
                   max_backoff=600, yield_ok=True, **kwargs):

Parameters:
client – instance of Elasticsearch to use
actions – iterator containing the actions
thread_count – size of the threadpool to use for the bulk requests
chunk_size – number of docs in one chunk sent to es (default: 500)
max_chunk_bytes – the maximum size of the request in bytes (default: 100MB)
raise_on_error – raise BulkIndexError containing errors (as .errors) from the execution of the last chunk when some occur. By default we raise.
raise_on_exception – if False then don’t propagate exceptions from call to bulk and just report the items that failed as failed.
expand_action_callback – callback executed on each action passed in, should return a tuple containing the action line and the data line (None if data line should be omitted).
queue_size – size of the task queue between the main thread (producing chunks to send) and the processing threads.

8.【example】

errors, success = 0, 0
suc, err = helpers.bulk(es, actions, raise_on_error=False, stats_only=True)
errors += err
success += suc

9.【导入数据时序列化报错：】

elasticsearch.exceptions.SerializationError:

【解决：】"_source": {line} => "_source": line

10.【完整示例代码：】

#!/Users/wanghai/anaconda2/bin/python
# -*- coding: utf-8 -*-
# @Time    : 2018/6/28 上午9:17
# @Author  : wanghai
# @Site    :
# @File    : import_es.py
# @Software: PyCharm

import time
import os
from elasticsearch import Elasticsearch
from elasticsearch import helpers


def get_files_to_import(path):
    f_list = os.listdir(path)
    files_ = []
    for i in f_list:
        if os.path.splitext(i)[1] == '.json':
            print(i)
            files_.append(i)
    return files_


# es = Elasticsearch('http://elastic:[email protected]:443')
if __name__ == '__main__':

    # 默认不开启嗅探功能 es = Elasticsearch()
    es = Elasticsearch(["***:9200", "***:9200"],
                       sniff_on_start=True,
                       sniff_on_connection_fail=True,
                       sniffer_timeout=60)
    actions = []
    workspace = u'./files/'
    files = get_files_to_import(workspace)

    id_num, errors, success = 0, 0, 0
    for json in files:
        json = workspace + json
        print(time.strftime('%y-%m-%d %H:%M:%S', time.localtime()))
        this_file = open(json)
        for line in this_file:
            action = {
                "_index": "idglab",
                "_type": "pappers",
                "_id": id_num,
                "_source": line
            }
            id_num += 1
            # if id_num == 900000:
            #     print("++++++++++++++++++++++")
            actions.append(action)
            if len(actions) == 2000:
                # print("======================")
                suc, err = helpers.bulk(es, actions, chunk_size=2000, raise_on_error=False, stats_only=True)
                errors += err
                success += suc
                del actions[0:len(actions)]

        if len(actions) > 0:
            suc, err = helpers.bulk(es, actions, chunk_size=2000, raise_on_error=False, stats_only=True)
            errors += err
            success += suc
            del actions[0:len(actions)]
        print("finish process file:%s" % json)

    print(" down!\n success_num:\t %d" % success + " \n errors_num:\t %d" % errors)

TODO

1.如何查看数据在ES集群中的分布情况，有没有类似Hadoop那样的浏览文件夹的操作。

参考

[1]. python-pip
[2]. elasticsearch-py文档
[3]. pip & pip2 & pip2.7