Python crawling rental data instance, it is said to be the entry of a small reptile case!

First, what is reptiles

Reptiles, also known as "web crawler", is the ability to automatically access the Internet and web content downloaded program. It is also the basis for the search engines, like Baidu and GOOGLE are with a strong web crawler to retrieve the vast amounts of information on the Internet and then stored in the cloud for users to provide quality search services.

Second, what is the use reptiles

You might say, in addition to doing the search engine company, reptiles learn what use is it? Haha, someone finally asked to the point. Analogy it: A company built a user forum, many users leave a message in the forum talking about their own experience and so on. A now need to understand user needs, analyze user preferences, for the next iteration of the product update in preparation. So how to get the data, of course, is the need to obtain web crawler slightly from the forum. So in addition to Baidu, GOOGLE, many companies are hiring engineers reptiles in salaries. You to search on any job site, "reptile engineer" Look at the number of jobs and salary ranges to understand how popular a crawler.

Third, the principle of reptiles

Initiation request: the target site through the HTTP protocol to send a request (a request), and waits for a response destination site server.

Content acquisition response: If the server can be a normal response, will get a Response. Response content is the content of the page you want to capture, content of the response may be HTML, Json string, binary data (such as Images), and so on.

Analytical content: the content may be obtained by HTML, can be resolved with regular expressions, page parsing library; Json may be, can be directly converted to parse Json objects; may be binary data, or can be stored for further processing.

Data storage: After the completion of data analysis, will be preserved. Both can be saved as a text document can be saved to the database.

Four, Python reptile examples

Introduced in front of the definition of information reptile, the role of principle, I believe there are a lot of small partners have started to be interested in reptiles, and ready to try it. And now on to the "dry", ready to paste a simple piece of Python code reptiles:

1. preparatory work: Install Python environment, PYCHARM install software, install MYSQL database, the new database exam, the exam in the construction of a table used to store the results of reptile house [SQL statement: create table house (price varchar (88), unit varchar (88), area varchar (88));]

2. reptiles goal: crawling price, per unit area and renting a home online all the links in the listing, and then save crawler structure to the database.

3. reptiles source code: the following

Import requests # request URL page content 

from BS4 Import BeautifulSoup # get a page element 

Import pymysql # link database 

Import Time # time function 

Import lxml # parsing library (support for HTML \ XML parsing, support XPATH parsing) 

# get_page function role: by requests of get the method of linked content url obtained, and then integrated into a format that can be processed BeautifulSoup 

DEF the get_page (url): 

Response = requests.get (url) 

Soup = BeautifulSoup (response.text, ' lxml ' ) 

return Soup 

# action get_links function: Get list of all renters link

DEF get_links (LINK_URL): 

Soup = the get_page (LINK_URL) 

links_div = soup.find_all ( ' div ' , the class_ = " PIC-Panel " ) 

links = [div.a.get ( ' the href ' ) for div in links_div] 

return links 

# get_house_info function role is to: get a page of a rental information: price, unit, area, etc. 

DEF get_house_info (house_url): 

Soup = get_page (house_url) 

. price = soup.find ( ' span ' , class_ = ' Total ') .text 

Unit = soup.find ( ' span ' , class_ = ' Unit ' ) .text.strip () 

area = ' test '  # this area we customize a field test for testing 

info = { 

' price ' :. price, 

' unit ' : unit, 

' area ' : area 

} 

return info 

# configuration information is written to the database dictionary to 

the dataBase = { 

' Host ' : ' 127.0.0.1 ' , 

'Database ' : ' Exam ' , 

' User ' : ' the root ' , 

' password ' : ' the root ' , 

' charset ' : ' utf8mb4 ' } 

# link database 

DEF get_db (Setting): 

return pymysql.connect (** Setting) 

# inserting data into a database crawlers obtained 

DEF iNSERT (DB, House): 

values = " '{}', " * 2 + "'{}'"

sql_values = values.format (House [ ' price ' ], House [ ' units ' ], House [ ' area ' ]) 

SQL = "" " 

INSERT INTO House (. price, Unit, Area) values ({}) 

" "" .format (sql_values) 

Cursor = db.cursor () 

the cursor.execute (SQL) 

the db.commit () 

# main processes: 1. 2. the connection to the database to obtain a list of individual listings of the URL from the first loop 3.FOR URL listing start getting specific information (price, etc.) 4. one by one into the database (python learning exchange group 631 441 315) 

db = get_db (dataBase) 

links = get_links ( ' https://bj.lianjia.com/zufang/ ')

for link in links:

time.sleep(2)

house = get_house_info(link)

insert(db,house)

 

First, the "工欲善其事必先利其器", written in Python crawler is the same reason, the process of writing reptiles need to import a variety of file libraries, and it is these useful libraries to help us complete the reptiles most of the work, we just need an excuse for the transfer of the relevant function can be. Import format is the import library file name. It should be noted here is to install libraries in PYCHARM, the cursor can be placed on the library file name, press ctrl + alt key way to install, can also command line (Pip install the library file name) are installed, if the installation fails or is not installed, then the follow-up program will certainly be an error of reptiles. In this code, the five elements are introduced before the program associated libraries: requests for requesting the URL page content; the BeautifulSoup used to parse the page elements; pymysql connect to a database; time include various time function; lxml is a parsing library for parsing file HTML, XML format, but it also supports XPATH parsing.

Secondly, we begin to see the whole process from the reptile's last main program code:

Get_db function by connecting to the database. And then deep into the get_db function, you can see the database connection is achieved by calling Pymysql the connect function, here ** seting is a way to collect Python keyword arguments, we put the database connection information is written to a dictionary DataBase Lane, the dictionary's information to connect to make the argument.

By get_links function for links to the chain of home network rental home listing of all. Links Links in the presence of all the listings in list form. get_links function to get the contents of the chain of home network home page requests by request, and then to organize content by BeautifuSoup interface format, into a format it can handle. Finally, find all div style contains pictures by electrophoresis find_all function, and then to get the hyperlink tab (a) of the content (that is, the contents of the href attribute) all div style contains through a for loop, all hyperlinks are stored in in the list of links.
Through the FOR loop, through which all links in a link (such as one of the links is: https://bj.lianjia.com/zufang/101101570737.html )

Using the same method and 2), obtaining 3) link in the price, per unit, area information by using the find function element positioning, this information is written to a dictionary Info inside.

Call a function to insert a link in the resulting Info information into the database table to house. Deep into the insert function, we can know that it is up to the implementation of a SQL statement by cursor functions cursor database () and then commit database operations to achieve response function. Here SQL statement wording is rather special, use the format function to format, this is done in order to facilitate multiplexing function.

Finally, run the reptile code, you can see the chain of home network Home of all listings are written to the data in the. (Note: test that I manually specified test string)

 

image.png
 

Postscript: In fact, Python is not difficult reptiles, reptile after familiar with the whole process, is the need to pay attention to some details, such as how to get the page elements, how to build SQL statements and so on. Do not panic having problems, see the tips of the IDE can wipe out one by one BUG, ​​we finally get the expected structure.

Guess you like

Origin www.cnblogs.com/qingdeng123/p/11299528.html