Reptile - Today's headlines

1. Analysis of Today's headlines

  Data can be found when looking at the headlines to show out of the pages are some of the packages had js code or css code, so this time you need to consider the data pages is not enclosed in a cookie inside

  Looking back to see the cookie you can find a s_v_web_id the cookie field, and then go try to get the current web page is really the source code, so we can think together based on the cookie and web server sends the past to obtain real data

2, select the appropriate method crawling

  When we get the real data after parsing the content of this it is to go inside, and I take a closer look and found all the information I have on the data inside the dictionary, so I cycle data, then get inside the title and the id (the id to remember that there is a need for stitching, so only manually splicing)

3, select storage

  I did not write the code inside storage, but I generally use mongodb more, so you can save data directly to the mongo

 

Specific code: https://github.com/1213William/toutiao_spider

Guess you like

Origin www.cnblogs.com/tulintao/p/11486268.html