Selenium2+python automation 37-crawl page source code (page_source)

foreword

Sometimes it may not be easy to find an element on the page through the attributes of the element. At this time, the desired information can be crawled from the source code. The page_source method of selenium can get the page source code.

The page_source method of selenium is rarely used. Xiaobian accidentally found this method while watching the api recently, so I came up with a whim. Here, combined with python's re module, we use regular expressions to crawl out all the url addresses on the page, and you can request page urls in batches. address to see if there is an exception such as 404

1. page_source

1. The page_source method of selenium can directly return the page source code

2. Print it out after reassignment

Second, re non-greedy mode

1. Here you need to import the re module

2. Regular matching with re: non-greedy mode

3. The findall method returns a list collection

4. After matching, it is found that there are some links that are not urls, you can delete them

3. Delete the selected url address

1. Add an if statement to judge, 'http' in the url indicates that it is a normal url address

2. Put all the url addresses into a set, which is the result we want

4. Reference code

# coding:utf-8
from selenium import webdriver
import re
driver = webdriver.Firefox()
driver.get("http://www.cnblogs.com/yoyoketang/")
page = driver.page_source
# print page
# "Non-greedy Match, re.S('.' matches characters, including newlines)"
url_list = re.findall('href=\"(.*?)\"', page, re.S)
url_all = []
for url in url_list:
    if "http" in url:
        print url
        url_all.append(url)
# final url collection
print url_all

 
 

foreword

Sometimes it may not be easy to find an element on the page through the attributes of the element. At this time, the desired information can be crawled from the source code. The page_source method of selenium can get the page source code.

The page_source method of selenium is rarely used. Xiaobian accidentally found this method while watching the api recently, so I came up with a whim. Here, combined with python's re module, we use regular expressions to crawl out all the url addresses on the page, and you can request page urls in batches. address to see if there is an exception such as 404

1. page_source

1. The page_source method of selenium can directly return the page source code

2. Print it out after reassignment

Second, re non-greedy mode

1. Here you need to import the re module

2. Regular matching with re: non-greedy mode

3. The findall method returns a list collection

4. After matching, it is found that there are some links that are not urls, you can delete them

3. Delete the selected url address

1. Add an if statement to judge, 'http' in the url indicates that it is a normal url address

2. Put all the url addresses into a set, which is the result we want

4. Reference code

# coding:utf-8
from selenium import webdriver
import re
driver = webdriver.Firefox()
driver.get("http://www.cnblogs.com/yoyoketang/")
page = driver.page_source
# print page
# "Non-greedy Match, re.S('.' matches characters, including newlines)"
url_list = re.findall('href=\"(.*?)\"', page, re.S)
url_all = []
for url in url_list:
    if "http" in url:
        print url
        url_all.append(url)
# final url collection
print url_all

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325457010&siteId=291194637