需要安装的依赖：

1.Python

2.setuptools

3.twisted

4.zope.interface

5.w3lib

6.libxml2

7.libxslt

8.lxml

9.scrapy

Scrapy是一个开源的基于twisted框架的python的单机爬虫，该爬虫实际上包含大多数网页抓取的工具包，用于爬虫下载端以及抽取端。

yum install gcc python-devel

http://www.cnblogs.com/xiaoruoen/archive/2013/02/27/2933854.html

http://www.coder4.com/archives/3660

 
 vim 
 ~ 
 / 
 . 
 bashrc 

 
 export  
 LD_LIBRARY_PATH 
 = 
 $ 
 LD_LIBRARY_PATH 
 : 
 / 
 home 
 / 
 liheyuan 
 / 
 env 
 / 
 lib 

 
 若出现gcc exit的情况适用下面的命令安装 

yum install gcc libffi-devel python-devel openssl-devel

下面文章来源于http://www.cnblogs.com/xiaoruoen/archive/2013/02/27/2933854.html

Centos下安装Scrapy

Scrapy是一个开源的机遇twisted框架的python的单机爬虫，该爬虫实际上包含大多数网页抓取的工具包，用于爬虫下载端以及抽取端。

安装环境:

centos5.4
python2.7.3

安装步骤:

1.下载python2.7 http://www.python.org/ftp/python/2.7.3/Python-2.7.3.tgz

[root@zxy-websgs ~]# wget http://www.python.org/ftp/python/2.7.3/Python-2.7.3.tgz -P /opt

[root@zxy-websgs opt]# tar xvf Python-2.7.3.tgz 

[root@zxy-websgs Python-2.7.3]# ./configure 

[root@zxy-websgs Python-2.7.3]# make && make install

　验证python2.7安装

[root@zxy-websgs Python-2.7.3]# python2.7
Python 2.7.3 (default, Feb 28 2013, 03:08:43) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-50)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> exit()

2.安装setuptools,http://pypi.python.org/packages/source/s/setuptools/setuptools-0.6c11.tar.gz

[root@zxy-websgs ~]# wget http://pypi.python.org/packages/source/s/setuptools/setuptools-0.6c11.tar.gz -P /opt/
[root@zxy-websgs opt]# tar zxvf setuptools-0.6c11.tar.gz 
[root@zxy-websgs setuptools-0.6c11]# python2.7 setup.py  install

3.安装Twisted

[root@zxy-websgs setuptools-0.6c11]# easy_install Twisted
......
Installed /usr/local/lib/python2.7/site-packages/Twisted-12.3.0-py2.7-linux-x86_64.egg
......
Installed /usr/local/lib/python2.7/site-packages/zope.interface-4.0.4-py2.7-linux-x86_64.egg

Twisted要安装zope.interface,可以从下面地址下载

zope.interface:http://pypi.python.org/packages/source/z/zope.interface/zope.interface-4.0.1.tar.gz

twisted:http://twistedmatrix.com/Releases/Twisted/12.1/Twisted-12.1.0.tar.bz2

5.安装w3lib

[root@zxy-websgs setuptools-0.6c11]# easy_install -U w3lib
Searching for w3lib
Reading http://pypi.python.org/simple/w3lib/
Reading http://github.com/scrapy/w3lib
Best match: w3lib 1.2
Downloading http://pypi.python.org/packages/source/w/w3lib/w3lib-1.2.tar.gz#md5=f929d5973a9fda59587b09a72f185a9e
Processing w3lib-1.2.tar.gz
Running w3lib-1.2/setup.py -q bdist_egg --dist-dir /tmp/easy_install-wm_1BB/w3lib-1.2/egg-dist-tmp-2DQHY_
zip_safe flag not set; analyzing archive contents...
Adding w3lib 1.2 to easy-install.pth file

Installed /usr/local/lib/python2.7/site-packages/w3lib-1.2-py2.7.egg
Processing dependencies for w3lib
Finished processing dependencies for w3lib

w3lib:http://pypi.python.org/packages/source/w/w3lib/w3lib-1.2.tar.gz

6.安装libxml2或者用easy_install安装lxml

[root@zxy-websgs lxml-3.1.0]# easy_install lxml

验证lxml安装

[root@zxy-websgs lxml-3.1.0]# python2.7
Python 2.7.3 (default, Feb 28 2013, 03:08:43) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-50)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import lxml
>>> exit()

也可以安装libxml2,官网上推荐安装2.6.28或者以上的版本，但在官网上没找到，我先是安装的2.6.9的版本，运行scrapy时报以下错误

Traceback (most recent call last):
  File "/usr/local/bin/scrapy", line 5, in <module>
    pkg_resources.run_script('Scrapy==0.14.4', 'scrapy')
  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 489, in run_script
  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 1207, in run_script
  File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/EGG-INFO/scripts/scrapy", line 4, in <module>
    execute()
  File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/cmdline.py", line 112, in execute
    cmds = _get_commands_dict(inproject)
  File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/cmdline.py", line 37, in _get_commands_dict
    cmds = _get_commands_from_module('scrapy.commands', inproject)
  File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/cmdline.py", line 30, in _get_commands_from_module
    for cmd in _iter_command_classes(module):
  File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/cmdline.py", line 21, in _iter_command_classes
    for module in walk_modules(module_name):
  File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/utils/misc.py", line 65, in walk_modules
    submod = __import__(fullpath, {}, {}, [''])
  File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/commands/shell.py", line 8, in <module>
    from scrapy.shell import Shell
  File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/shell.py", line 14, in <module>
    from scrapy.selector import XPathSelector, XmlXPathSelector, HtmlXPathSelector
  File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/selector/__init__.py", line 30, in <module>
    from scrapy.selector.libxml2sel import *
  File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/selector/libxml2sel.py", line 12, in <module>
    from .factories import xmlDoc_from_html, xmlDoc_from_xml
  File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/selector/factories.py", line 14, in <module>
    libxml2.HTML_PARSE_NOERROR + \
AttributeError: 'module' object has no attribute 'HTML_PARSE_RECOVER'

升级到2.6.21版本以后解决了。

libxml2.6.1:ftp://xmlsoft.org/libxml2/python/libxml2-python-2.6.21.tar.gz

7.安装pyOpenSSL(这个是可选安装的，主要为了使scrapy能够支持https)

用easy_install pyOpenSSL安装的是pyOpenSSL-0.13版本，没安装成功，于是手动下载.011版本来进行安装。

[root@zxy-websgs opt]# wget http://launchpadlibrarian.net/58498441/pyOpenSSL-0.11.tar.gz -P /opt
[root@zxy-websgs opt]# tar zxvf pyOpenSSL-0.11.tar.gz 
[root@zxy-websgs pyOpenSSL-0.11]# python2.7 setup.py install

pyOpenSSL:http://launchpadlibrarian.net/58498441/pyOpenSSL-0.11.tar.gz

8.安装scrapy

[root@zxy-websgs pyOpenSSL-0.11]# easy_install -U Scrapy

验证安装

[root@zxy-websgs pyOpenSSL-0.11]# scrapy
Scrapy 0.16.4 - no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  fetch         Fetch a URL using the Scrapy downloader
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

  [ more ]      More commands available when run from project directory

Use "scrapy <command> -h" to see more info about a command

scrapy:http://pypi.python.org/packages/source/S/Scrapy/Scrapy-0.14.4.tar.gz

总结：

pyOpenSSL单独安装的时候不成功，也可以先下载pyOpenSSL0.11进行安装，再使用easy_install -U Scrapy进行全程安装

Python安装lxml(编译安装其依赖的库)

Leave a reply

依赖说明：

lxml是libxml2、libxslt的PythonBinding，依赖于libxml2和libxslt。libxslt依赖于libxml2。

由于我们的程序可能要分发到别的机器运行，所以要提取出一个可独立拷贝走的运行时环境，假设该目录为/home/liheyuan/env

1、安装libxml2

       
           1 
         
           2 
         
           3 
         
           4 
         
           5 
         
          wget  
          http 
          : 
          //xmlsoft.org/sources/libxml2-2.9.0.tar.gz 
         
          tar 
          - 
          xzvf  
          libxml2 
          - 
          2.9.0.tar.gz 
         
          . 
          / 
          configure 
          -- 
          prefix 
          = 
          / 
          home 
          / 
          liheyuan 
          / 
          env 
          -- 
          without 
          - 
          python 
         
          make 
         
          make  
          install

2、安装libxslt

       
   

       
   
 
     
      
       
           1 
         

           2 
         

           3 
         

           4 
         
 
        
          wget  
          http 
          : 
          //xmlsoft.org/sources/libxslt-1.1.27.tar.gz 
         
 
          . 
          / 
          configure 
          -- 
          prefix 
          = 
          / 
          home 
          / 
          liheyuan 
          / 
          env 
          -- 
          without 
          - 
          crypto 
          -- 
          without 
          - 
          python 
          -- 
          with 
          - 
          libxml 
          - 
          prefix 
          = 
          / 
          home 
          / 
          liheyuan 
          / 
          env 
          / 
         
 
          make 
         
 
          make  
          install 
         
 
      
 
     
   

3、安装lxml

依赖的库都搞定了，终于轮到Python的Binding了。

我们假设Python已经通过编译安装的方式，放到了同样的目录下：/home/liheyuan/env

       
   

       
   
 
     
      
       
           1 
         

           2 
         

           3 
         

           4 
         

           5 
         

           6 
         

           7 
         
 
        
          # 下载 
         
 
          wget  
          http 
          : 
          //pypi.python.org/packages/source/l/lxml/lxml-3.0.1.tar.gz#md5=0f2b1a063ab3b6b0944cbc4a9a85dcfa 
         
 
          tar 
          - 
          xzvf  
          lxml 
          - 
          3.0.1.tar.gz 
         
 
          cd  
          lxml 
          - 
          3.0.1 
         
 
          # 解压缩、编译 
         
 
          / 
          home 
          / 
          liheyuan 
          / 
          env 
          / 
          bin 
          / 
          python 
          . 
          / 
          setup 
          . 
          py  
          build 
          -- 
          with 
          - 
          xslt 
          - 
          config 
          = 
          / 
          home 
          / 
          liheyuan 
          / 
          env 
          / 
          bin 
          / 
          xslt 
          - 
          config 
         
 
          / 
          home 
          / 
          liheyuan 
          / 
          env 
          / 
          bin 
          / 
          python 
          . 
          / 
          setup 
          . 
          py  
          install 
         
 
      
 
     
   

最后看下效果：

       
   

       
   
 
     
      
       
           1 
         

           2 
         

           3 
         

           4 
         

           5 
         

           6 
         
 
        
          / 
          home 
          / 
          liheyuan 
          / 
          env 
          / 
          bin 
          / 
          python 
         
 
          Python 
          2.7.3 
          ( 
          default 
          , 
          Oct 
          22 
          2012 
          , 
          13 
          : 
          32 
          : 
          03 
          ) 
         
 
          Type 
          "help" 
          , 
          "copyright" 
          , 
          "credits" 
          or 
          "license" 
          for 
          more  
          information 
          . 
         
 
          >>> 
          import  
          lxml 
         
 
          >>> 
          import  
          lxml 
          . 
          html 
         
 
          >>>报错 
          . 
          . 
          . 
          . 
          . 
         
 
      
 
     
   

错误提示，提示etree.so依赖错误！

由于so是我们自己build的，且不在系统默认环境变量路径内，所以我们需要把path加到系统环境变量(so)路径内，如下：

       
           1 
         
           2 
         
          vim 
          ~ 
          / 
          . 
          bashrc 
         
          export  
          LD_LIBRARY_PATH 
          = 
          $ 
          LD_LIBRARY_PATH 
          : 
          / 
          home 
          / 
          liheyuan 
          / 
          env 
          / 
          lib

下次重新登陆Terminal就可以了！

在Linux环境下安装Scrapy框架

Python安装lxml(编译安装其依赖的库)

猜你喜欢