1. Analysis of Today's headlines
Data can be found when looking at the headlines to show out of the pages are some of the packages had js code or css code, so this time you need to consider the data pages is not enclosed in a cookie inside
Looking back to see the cookie you can find a s_v_web_id the cookie field, and then go try to get the current web page is really the source code, so we can think together based on the cookie and web server sends the past to obtain real data
2, select the appropriate method crawling
When we get the real data after parsing the content of this it is to go inside, and I take a closer look and found all the information I have on the data inside the dictionary, so I cycle data, then get inside the title and the id (the id to remember that there is a need for stitching, so only manually splicing)
3, select storage
I did not write the code inside storage, but I generally use mongodb more, so you can save data directly to the mongo
Specific code: https://github.com/1213William/toutiao_spider