Reptile crawling Baidu Sogou pictures persistent storage

1. Download the picture

 

# Baidu Image: HTTP: //image.baidu.com/
# Sogou Image: https: //pic.sogou.com/

 

 

Picture # crawling:
1) Find images downloaded url: elements and network packet capture
2) browser to access url, for verification.
3) to write code to get url.
4) request url address to obtain a binary stream.
5). the binary stream to write files

 

 

# 百度图片:
import time
import requests
from lxml import etree
from selenium import webdriver

Examples of # browser object
browser = webdriver.Chrome ( './ chromedriver.exe' )

# Access and manipulate web page elements get search results
browser.get ( 'http://image.baidu.com/')
The input_tag = browser.find_element_by_id ( 'kw')
input_tag.send_keys ( 'Qiao Biluo')
search_button = browser.find_element_by_class_name ( 's_search')
search_button.click ()

# Js achieved by the mouse to scroll down the page for more source
js = 'window.scrollTo (0, document.body.scrollHeight)'
for the Range Times in (3):
browser.execute_script (js)
the time.sleep (3)
html = browser.page_source

# 解析数据获取图片连接:
tree = etree.HTML(html)
url_list = tree.xpath('//div[@id="imgid"]/div/ul/li/@data-objurl')
for img_url in url_list:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'
}
content = requests.get(url=img_url, headers=headers).content
if 'token' not in img_url:
with open('./baidupics/%s'%img_url.split('/')[-1], 'wb') as f:
f.write(content)

 

 

 

# Sogou Image:
Import Requests
Import Re

url = 'http://pic.sogou.com/pics?'
params = {
'query': '韩美娟'
}
res = requests.get(url=url, params=params).text
url_list = re.findall(r',"(https://i\d+piccdn\.sogoucdn.com/.*?)"]', res)
for img_url in url_list:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'
}
print(img_url)
content = requests.get(url=img_url, headers=headers).content
name = img_url.split('/')[-1] + '.jpg'
with open('./sougoupics/%s.jpg'%name, 'wb') as f:
f.write(content)

 

2.JS dynamic rendering

1) .selenium crawling: selenium test frame, fully human operator browser mode, *** *** page_source
2) basic syntax:.
From the webdriver Selenium Import
# browser object instantiated:
Browser = webdriver.Chrome ( 'browsing drive path ') in the current path #:' ./chromedriver.exe '
URL address of the access response #:
browser.get (URL)
# acquires page elements:
find_element_by_id
find_element_by_name (): name is the name attribute value tag
find_element_by_class_name: class attribute values
find_element_by_xpath: the expression xpath targeting elements
find_element_by_css_selector: the css selector
# example: obtaining an input id of the input box kw
the input_tag = browser.find_element_by_id ( 'kw')
# typing:
input_tag.clear ()
the input_tag .send_keys ( 'Qiao Biluo Highness')
# click button button:
Button.Click ()
# execute JS code:
= JS 'the window.scrollTo (0, document.body.scrollHeight)'
for I in Range (. 3):
browser.execute_script (JS)

# get the HTML source code: Remember no parentheses *****
HTML # = browser.page_source str type

# parsing data:
. 1) to extract data .xpath:
2) regular extraction: writing a regular expression using the re module +
3) .Beautifulsoup: CSS selectors -> (node selection, a method selector, CSS selectors)

# media types: video, images, compressed, software installation package
1) download link.
2) .requests request: response.content -> binary stream
scrapy frame: response.body -> binary stream
3) write file:
with Open ( './ jdkfj / name', 'wb') AS f:
f.write (res.content | response.body)

3. data analysis

1.Xpath
# encoding process
from lxml Import etree
# instantiated objects etree
Tree = etree.HTML (res.text)
# call xpath expression to extract data
li_list = tree.xpath ( 'xpath expression') # xpath extracted data list
# nesting
for li in li_list:
li.xpath ( 'xpath expression')
# ./
# .//

# Basic grammar:
./: matching the current root node down
../: from anywhere matches the current node
the nodeName: positioning node name
nodename [@ attributename = "value" ]: The targeting property
multivalued attribute matching unit : contains -> div [contains ( @class, "item")]
multiple attribute match: and -> div [@ class = "Item" and @ name = "divtag"]
@attributename: extracting attribute values
text ( ): extracting text information

# sequential selection:
1) index Location: an index from the beginning, res.xpath ( '// div / ul / li [1] / text ()'): positioning a first label li
requests object modules in response to the request:
res.text -> text
res.json () -> python basic data types -> Dictionary
res.content -> binary stream
2) .last () function positioning: the last one, penultimate: last () -. 1
res.xpath ( '// div / UL / Li [last ()]'): positioning a last
res.xpath ( '// div / ul / li [last () - 1] '): positioning penultimate
3) .position () function: position
res.xpath('//div/ul/li[position()<4]')

2.BS4 basic grammar:
# encoding process:
from the BeautifulSoup BS4 Import
# instantiated objects soup
soup = the BeautifulSoup (res.text, 'lxml')
# positioning node
soup.select ( 'CSS selectors')
# select the CSS syntax:
ID: #
class:.
soup.select ( 'div> UL> Li') # single stage selector
soup.select ( 'div li') # multi-level selector
# acquired node text or attributes:
tag.string: text taken directly -> when the tag bytes in addition to the text, also contains other labels, fail to direct text
tag.get_text (): get the text
tag [ 'attributename']: a get attribute (attribute has two try ( comprising a) when the data type returned value or more)
3. Re & module regular
packet non-greedy & match :() -> 'dfkjd (KDF dfdf *) dfdf.?'
<a the href = "https://www.baidu .com / kdjfkdjf.jpg "> this is a label </a> ->
'<a href="(https://www.baidu.com/.*?\.jpg)">'
Quantifier:
+: Match 1 or more times
*: 0 times get multiple matches
{m}: Match m times
{m, n}: n times matched to m
{m,}: at least m
{, n}: n times up
re module:
the re.findall ( 'regular expression', res.text) -> list list

4. persistent store

1.txt
############# 写入txt文件 ###############
if title and joke and comment:
# with open('qbtxt.txt', 'a', encoding='utf-8') as txtfile:
# txtfile.write('&'.join([title[0], joke[0], comment[0]]))
# txtfile.write('\n')
# txtfile.write('********************************************\n')


2.json
############# 写入json文件 ################
# dic = {'title': title[0], 'joke':joke[0], 'comment':comment[0]}
# with open('jsnfile.json', 'a', encoding='utf-8') as jsonfile:
# jsonfile.write(json.dumps(dic, indent=4, ensure_ascii=False))
# jsonfile.write(','+'\n')

3.csv
############# 写入CSV文件 ##################
with open('csvfile.csv', 'a', encoding='utf-8') as csvfile:
writer = csv.writer(csvfile, delimiter=' ')
writer.writerow([title[0], joke[0], comment[0]])
############# scrapy框架 ##################
FEED_URI = 'file:///home/eli/Desktop/qtw.csv'
FEED_FORMAT = 'CSV'
4.mongodb

5.mysql

 

Guess you like

Origin www.cnblogs.com/huanghaobing/p/11755238.html