记录在写爬虫遇到的部分坑

1.Python中对变量是否为None的判断

三种主要的写法有：
第一种：if X is None
第二种：if not X
当X为None, False, 空字符串"", 0, 空列表[], 空字典{}, 空元组()这些时，not X为真，即无法分辨出他们之间的不同。
第三种：if not X is None;

在Python中，None、空列表[]、空字典{}、空元组()、0等一系列代表空和无的对象会被转换成False。除此之外的其它对象都会被转化成True。
在命令if not 1中，1便会转换为bool类型的True。not是逻辑运算符非，not 1则恒为False。因此if语句if not 1之下的语句，永远不会执行。

2.python下载图片

import os
os.makedirs('./image/', exist_ok=True)
IMAGE_URL = "http://image.nationalgeographic.com.cn/2017/1122/20171122113404332.jpg"
 
def urllib_download():
    from urllib.request import urlretrieve
    urlretrieve(IMAGE_URL, './image/img1.png')     
 
def request_download():
    import requests
    r = requests.get(IMAGE_URL)
    with open('./image/img2.png', 'wb') as f:
        f.write(r.content)                      
 
def chunk_download():
    import requests
    r = requests.get(IMAGE_URL, stream=True)    
    with open('./image/img3.png', 'wb') as f:
        for chunk in r.iter_content(chunk_size=32):
            f.write(chunk)
 
urllib_download()
print('download img1')
request_download()
print('download img2')
chunk_download()
print('download img3')

3.解决模拟点击插件使用问题

from selenium import webdriver
chrome_driver=r"D:\Python\Anaconda\Lib\site-packages\selenium\webdriver\chrome\chromedriver.exe"
driver=webdriver.Chrome(executable_path=chrome_driver)

4.Beautifulsoup获取id

def get_title(url):
    resp = urllib.request.urlopen(url)
    html = resp.read()
    bs = BeautifulSoup(html, "html.parser")
    title = bs.find('th', id='DetailTilte').h1.get_text()
    return title

5.Selenium中单击Element：ElementClickInterceptedException

方法一：

element = driver.find_element_by_css('div[class*="loadingWhiteBox"]')

driver.execute_script("arguments[0].click();", element)

方法二：

element = driver.find_element_by_css('div[class*="loadingWhiteBox"]')

webdriver.ActionChains(driver).move_to_element(element ).click(element ).perform()

6.字符串转为url编码

import urllib
poet_name = "李白"
url_code_name = urllib.quote(poet_name)
print url_code_name
#输出
#%E6%9D%8E%E7%99%BD
1
2
3
4
5
6

2.url编码转为字符串
import urllib
url_code_name = "%E6%9D%8E%E7%99%BD"
name = urllib.unquote(url_code_name)
print name
#输出
#李白

 import urllib.parse
url_new = urllib.parse.quote（url,safe=''）(safe中放不需要处理的字符)

7.python 读取文件时报错UnicodeDecodeError: ‘gbk’ codec can’t decode byte 0x80 in position 205: illegal multibyte sequence

  FILE_OBJECT= open('order.log','r', encoding='UTF-8')

8.Python 正则表达式匹配字符串中的http链接

利用Python正则表达式匹配字符串中的http链接。主要难点是用正则表示出http 链接的模式。

import re
pattern = re.compile(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')    # 匹配模式

string = 'Its after 12 noon, do you know where your rooftops are? http://tinyurl.com/NYCRooftops '
url = re.findall(pattern,string)
print url

>>['http://tinyurl.com/NYCRooftops']

9 python截取字符串的方法详解

下面是基于python2+版本；python3+ print输出的内容要加括号

str = '0123456789'
print str[0:3] #截取第一位到第三位的字符
print str[:] #截取字符串的全部字符
print str[6:] #截取第七个字符到结尾
print str[:-3] #截取从头开始到倒数第三个字符之前
print str[2] #截取第三个字符
print str[-1] #截取倒数第一个字符
print str[::-1] #创造一个与原字符串顺序相反的字符串
print str[-3:-1] #截取倒数第三位与倒数第一位之前的字符
print str[-3:] #截取倒数第三位到结尾
print str[:-5:-3] #逆序截取
输出结果如下：
012
0123456789
6789
0123456
2
9
9876543210
78
789
96

10.消除空格

字符串外面的空格：strip()
里面的空格： replace(" ",'')

GertHONG

发布了7 篇原创文章 · 获赞 6 · 访问量 708

私信关注