First, this study is the use of writing Python scripts to obtain information on the web page, and save him to our database finally form an Excel spreadsheet
Download the source code and third-party modules installed MongoDB
At first we need to do some preparation:
install third-party modules
https://fastdl.mongodb.org/linux/mongodb-linux-x86_64-rhel70-3.2.5.tgz
Ideas are as follows:
1. Visit the website to get the html page
get headers:
Script 1:
Before running the open mongod:
./mongod &
2. Extract the contents of html which we want
Script 2:
Long Jump and View Graph is based on the label that they can navigate to the information we want to get the
This script does not need to finish the run, his third script url is imported
3. We climbed the content into the database
Script 3:
Will check whether MongoD run before the run, after running into the database to see the information we can into the
bin under the MongoDB
./mongo
use iaaf
db.athletes.find()
4. converted to Excel spreadsheet
Script 4:
5.requests, pymongo, usage summary of bs4
requests is a very useful Python HTTP client library, often used when writing reptiles and test server response data. It can be said, Requests to fully meet the needs of today's network
1. Role: send request acquirer Why requesst?
1) requests the underlying implementation is urllib2) requests in python2 and python3 in general, exactly the same way
3) requests ease of use (python characteristic)
4) Requests can help us extract the contents of the response (self-extracting complete request header to automatically obtain cookie)
- Send a simple get request, the fetch response response = requests.get (url)
pymongo mongodb operation is python toolkit
bs4 concept:
bs4 library is resolved, traverse, maintenance, "tag tree" function library
popular thing to say is: bs4 the HTML source code library re-formatted,
so as to facilitate us to one of the nodes, tags, attributes, etc. to operate
2.BS4 4 objects
①Tag objects: a html tag is, BeautifulSoup with specific content can be parsed tag, in particular
the format 'soup.name', where the label is the name under html.
②BeautifulSoup objects: entire html text object can be used as a Tag object
③NavigableString objects: text object in the label
④Comment objects: is a special NavigableString object if html tags in the comment memory, it can filter out the comment text annotation symbols reserved
the most commonly used or BeautifulSoup objects and objects Tag