Projects reptiles explain Case I: Getting Started

xpath: positioning an element in the document structure
/ to select elements from the root
// nodes in a document to be selected regardless of the current position from the current node matches the selection
to select the current node.
.. select the parent node of the current node
@ select property

/ html
body / div selects all div elements child elements belonging to the body in the
// div div tags to select all child elements, regardless of their position in the html document
@lang select names for all of lang attribute

Wildcard
* element matches any node
@ * matches any attribute node
// * select all elements in the document among
@ title [@ *] select all the title elements with attributes

|
Path expression, | Note: represents and relations
// body / div | // body / li select all of the body element div element and li element
// div | // li selected document and all of the div li element

beautifulSoup: What is beautilful:
is a python library that can extract data from html or xml file
& enc = utf-8 when the page is garbled copy, plus at the end of this, will display properly.

Common statement:
cd Part6 (go to the next one Project)
scrapy startproject [name 1]
cd [name 1]
scrapy genspider tonghuashun (the crawling name) [Address]


package which items are a number of variables, data written reptiles crawling those
based data processing pipelines
seetings disposed
at py files they have to write the following code crawled:
# - * - Coding: UTF-. 8 - * -
Import Scrapy


class TonghuashunSpider(scrapy.Spider):
name = 'tonghuashun'
allowed_domains = ['stockpage.10jqka.com.cn']
start_urls = ['http://basic.10jqka.com.cn/600004/company.html']

def parse(self, response):
# // *[ @ id = "ml_001"] / table / tbody / tr[1] / td[1] / a
res_selector=response.xpath("//*[@id=\"ml_001\"]/table/tbody/tr[1]/td[1]/a/text()");
name=res_selector.extract();
print(name);
pass

After the commissioning phase, a maon.py create your own file
from the Execute scrapy.cmdline Import
Import SYS
Import os
# debug a written
sys.path.append (os.path.dirname (os.path.abspath (__ file__ )));
execute ( "scrapy", "crawl ", "tonghuashun"); # the first two parameters are fixed, the last parameter is the name of your own creation


Hair reptiles way:
write the following contents in seetings "
Method 1:
USER_AGENT =" Mozilla / 4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; the .NET the CLR 1.1.4322; the .NET the CLR 2.0.50727) "
mode two:
DOWNLOADER_MIDDLEWARES = {
'books.middlewares.RandomUserAgent':. 1,
}

USER_AGENTS = [
"the Mozilla / 4.0 (compatible; MSIE 6.0; the Windows NT 5.1; SVl; AcooBrowser; .NET the CLR 1.1.4322; .NET the CLR 2.0.50727 ) ",
" Mozilla / 4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; the .NET the CLR 2.0.50727; Media Center PC 5.0; the .NET the CLR 3.0.04506) ",
" Mozilla / 4.0 (compatible; 7.0 MSIE; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; the .NET the CLR 1.1.4322; the .NET the CLR 2.0.50727) ",
" Mozilla / 5.0 (Windows; U; MSIE 9.0; Windows NT 9.0;en-US)",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
"Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
"Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
"Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
]

 

Guess you like

Origin www.cnblogs.com/jxxgg/p/11666827.html
Recommended