python web crawler - 2019 I cracked enterprise business data + Trademark + data network construction tender - crawler technology sharing

python web crawler - 2019 I cracked enterprise business data + Trademark + data network construction tender - crawler technology sharing

In a recent in-depth study of artificial intelligence "deep learning" TensorFlow technology, the use of AI techniques for data crawlers and data mining work, AI technology is mainly reptiles model, depth training robot, sample data to validate the model, and finally we can be like robots we do live reptiles mighty force of work, and verification code to solve a variety of reptile industry experience, letters IP, encryption issues, we can say 2019 was launched in big data, many of my friends ask me for help I crawled through the data source to solve the technical problems, do we know whether it is big or artificial intelligence data premise is the need to have data, so the data is very large fire industry, but also the cause of many big data business visionary people into !

Second, the industrial and commercial networks - corporate data -python data mining techniques to share (the national enterprise business data mining techniques latitude 46)

Because the depth of reptiles has been studying cutting-edge technology, recently I have a few friends ask me for help venture enterprise established large databases and large data library policy, because some time ago I helped a friend had just completed a "day search and seek companies seeking inspection" Enterprise Library Data distributed crawler system research and development, and technology to solve the blocked IP verification code to crack, vip simulated landing climb data issues, industrial and commercial enterprise data mining data I had the latitude contains 46 data latitude, every latitude is a data table.
Use crawler technology, network data mining so much of enterprise data requires a sufficient number of servers and large data search engine architecture, first of all I had to divide the enterprise database dimension by city, country each city to establish its own database, then use your own Python crawler technology enterprise thesaurus + + + proxy IP pool technical architecture of distributed multi-process developed a "depth enterprise big data mining system." We currently more popular data mining technology development language is python, as python has a very complete various libraries can be used directly, for example: image recognition library, requests the library and so on.

2.1 Big Data problems faced by businesses:
in fact, I cracked a day crawling search, check these sites do a large enterprise data mining system developed depth when enterprises need to address the following three questions:
1 crack the code automatically log collection is completed vip to complete vip data
cleansing 2 data structure and data, the data is flushed to the database
3 data batch distributed, multi-process, multi-tasking crawling solve the whole amount of crawling speed problem
2.2 problem solution Share:
because like the days of a check, a check that website prices, if not vip login, phone number and email these important data we can not see, are vip need to login to see, so to get the complete data you need vip crack landing, to get the complete data vip, once the vip cracked like no one broke into the open plain want to take what data is what means the data. The following question is how to solve the data problems crawling the full amount, so many corporate data can be completed a month I would like to do? From a technical point of view it is 100% possible, as long as we make the program a distributed architecture + + multi-process multiple tasks can be achieved much want to climb to climb much, but there is a problem that we need servers require enough Oh, otherwise you a computer process to open 100 to no avail speed is so little. Distributed crawler is deployed into 10 computers go, if you really do not have computers to go to Internet cafes, Internet cafes in the reptile end computer line a few days to estimate all climbed, so quickly resolve problems crawling the full amount. So I was in the development of large industrial and commercial enterprises depth data mining system when the system architecture is divided into a reptile I end proxy IP + database + pool + cookies + Pool + enterprise data dictionary management background, such as a set of data mining, data cleansing, data storage, data management solutions.
(Welcome to the data mining technology and large reptiles friends are interested plus I qq: 2779571288)

Third, the Trademark -Python reptile cracking techniques to share:
As has been engaged in the depth of data mining technology research network depth reptiles, did a key move Taobao shop merchandise from help electricity supplier company, to Taobao goods by copying crawler technology a key go to your site, to make competing products by AI image recognition technology analysis, and then guide do to help a friend do monitoring public opinion through web crawler technology, to large industrial and commercial enterprises depth data mining, data mining trademark big and so little domestic relatively large core site of anti-climb mechanisms to find out again, and then a number of network depth mining process, different sites have different anti-climbing technology, such as a check that day have landed vip + verification code technology, then such as Taobao is used in anti-climb blocked IP technology, do not log can also search for product data, but so many web sites is one of the hardest moderate trademark crawling, talking trademark difficult to climb does not mean it's anti-climb technology more cattle, in fact, anti-climb mesh trademark technique has not been an investigation of cattle days, mainly at the expense of the user experience trademark to intercept a large number of reptiles crawling Data, network anti-climb trademark technique includes the following two parts:
3.1 Trademark - url to access encryption technology to track taken:

Trademark expense of user experience for each trademark registration number must first query into the search page and search registration number climb to take, if you do not pass straight home page and a list of direct access to detailed page is directly sealed, that is the URL detail page is through home visits cookie + list cookie + tracking encryption expiration time process, so be crawling its data had from this simulated search registration number and then simulate click into the detail page and then crawling detail page and trademark process data page, so that we do is climb to get the data, but crawled a bit slow, because now we have to follow it into the search page to search for analog simulation registration number and then click into the detail page, so speed is much slower needs.
3.2 Trademark - took the letters IP anti-climbing technology:
Trademark addition to tracking the access path url, also took the letters IP technology, that is, when you keep to simulate search registration number or analog search company for detailed pages of data, it will monitor your IP, if you find your IP so frequently directly pull the black sealing of your IP, you could not open the trademark site. It takes a day or a period of time before you re-released. So to solve this seal IP is very simple, I was doing architecture Trademark great depth data mining system when the system architecture consists of: reptiles end IP + Agent + pool simulation AI Artificial Intelligence Technology + multi-process anti-climb solve its problems and the problems crawling speed. Below we explain the proxy IP pool:
proxy IP pool: Maybe you will ask the proxy IP pool is what to do, but also how to achieve it? When we crawl the site, you need not use our own computer's IP climb by proxy ip, because that is your computer's IP to run code directly on your computer's IP reptiles so, if the other party has to collect frequent the site, the other will detect your IP directly to your IP blacklisted in the future you can not collected. So all my batch capture reptiles are using proxy IP's climb, python how using proxy IP it? In fact, it is so very simple line of code to solve:
 RESP = requests.get (url, headers = self.headers, timeout = 20, = Proxies Proxy)
Our call is to get inside there is a method requests url, and headers as well, proxies proxy IP parameter settings.
url: is our collection target website address
headers: what we simulate visit header parameter site when the other side need to simulate (this parameter is how come it is actually very simple, direct use Firefox to open the other website to see network there and request headers those parameters copy come to)
 Proxies: that we set our proxy IP, IP proxy What does it mean? The working mechanism of the proxy server much like the agents of our lives often mentioned, assuming your machine is A machine, you want to get the data provided by the B machine, the proxy server machine is C, then the specific connection process is as follows. First, it establishes the C A machine connection request to the C, C receives the data request A machine dryer immediately establish a connection with B, and B data on machine A download request to the local machine, and then sends this data A machine to complete the agent task. Downloading data such other sites a proxy server, but the proxy server IP is random variation, the other to catch who keeps his collection of the data. We know that the proxy ip, ip agent pool this what is it? We run our python program when an http request sent per second climb other sites once again need to request an IP, then how to do this ip? We can buy the kind of third-party online ip interface, such as: every 10 seconds an IP return them to us, if every time we climb the transfer of data must first obtain the proxy IP interface IP climb up again then the other sites efficiency and code quality is low, because people are in the 10 seconds before an IP, your program speed and efficiency of direct card because the proxy IP interface control problems, so this time you need to improve the proxy IP code architecture, first of every 10 in the second reading the proxy IP interface IP cache to get reis go, and set 60 seconds expired, then the redis will form a pool of IP proxy, your program code crawl other sites when reading directly from the IP redis climb, such speed is fast, the program will optimize the architecture.

(Welcome to the data mining technology and large reptiles friends are interested plus I qq: 2779571288)

Acquisition speed too frequent, will be how to solve the problem of blocked IP

When we sent out the day http request to seek search site, under normal circumstances, return to state 200, indicating that the request is legitimate to accept, and will see the data returned, but the days of the investigation plan has its own set of anti-climbing algorithm mechanism, If the check to the same IP to continue to collect data of his site, then he will be included in this IP anomaly blacklist, you go to its website data collection time, it may never be blocked. How to solve this problem is very simple, there is nothing wrong with a proxy IP to go, and each time when using the proxy IP requests are ways to request, and the proxy IP is random variation, each request is different, so the use of this agent IP technology to solve the problem of being sealed.

Build their own pool of ip proxy

Crawler technology to do all know that the proxy IP's quality determines the efficiency of reptiles, in order to climb a little faster, you must choose a good quality, high anonymous, is not repeated, long-time IP, IP agents in the choice of the time, the market price of a good Dali IP is generally 6000 / month, so at that time in order to avoid this overhead costs on their own to set up their own IP agent pool, use of technology to save a large portion of the cost.

Charles is a climbing day we know how to use a proxy IP it?

Want to know proxy IP's question, I must first understand clearly the proxy IP in the "transparent", "anonymous" and "high hide" are what means:
transparent proxy IP :: IP agent that we use this time to collect the check eye in the sky, not hide our own IP, our IP one is transparent storm drain, then a day of investigation will soon recognize the same client IP repeat visits to its Web site data collection, this IP will be pulled into the blacklist do marks, the next time you go when collected directly sealed.
Ordinary anonymous proxy IP: We can hide the real IP client, but there is a downside but it will change our request for information, the day a check is likely to think that we use a proxy. However, when the use of such agents, although the site can not be accessed know your ip address, but know that you can still use a proxy, of course, some can detect ip pages can still be found in your ip, so this is IP not suitable for data collection to check the eye in the sky.
High anonymous proxy IP: This IP will not change the client's request, so that the server seems like there is a real customer with a browser to access it, then the customer's real IP is hidden, the server-side (eye in the sky Charles) do not think we use a proxy, we collect eye in the sky in the investigation should use this high anonymous IP, then where to find such a proxy IP it, the following summary will tell you.
Want to solve the problem of bypassing the eye in the sky to climb check code data, we need a verification code will appear under what circumstances, eye in the sky is to recognize that we check whether the browser to access two ways to access or reptiles analysis:
how do you know one day we seized our IP?
You climb a day to check the time, if your IP is properly sealed or not, some day Charles will return to the login page, the login screen appears to explain your ip was closed, or to be monitored anomaly. So when we crawl data, with regular matches to see if there is an interface html tag character login and register, if the login screen appears once again from a different IP re-request, it keeps looping until the change to the normal of available IP so far can be.
Why after using proxy IP or was blocked?
Do reptiles when we can not use their own computer's IP to cycle collection days for a check website or directly sealed IP, or there is a verification code to solve this problem is very simple is to use dynamic proxy IP, but too many people use proxy IP or there will be closure, in fact, because you're using this proxy IP quality is not, or you are using a kind of free proxy IP. Or use a proxy IP will be closed for the following reasons:
(1) your proxy IP expiry time is too short, not to finish a http request becomes ineffective, leading to never be closed or invalid
(2) your proxy IP network does not reach unreasonable cause
(3) your IP is not high anonymous proxy directly be identified
(4) your proxy IP has been used by other people climb day a check, has long been blacklisted, causing you to be forever sealed IP .
(5) Do you always have to climb a few cycles proxy IP? The correct approach is to randomly change each request a different IP, so you need to choose the kind of online does not limit the number of proxy IP, rather than a fixed number of IP.
4, a day climbing the investigation should choose what kind of proxy IP?
We recommend that your proxy IP should meet the following requirements:
(1) high anonymous, ordinary anonymous No, it must be high anonymous
(2), a long period of time, for at least 2 minutes to be effective
(3), is not repeated, at least in 30 days of non-repetition of IP
, not limit, the daily limit IP data is not randomly change.

Time is limited, to share here, I have to write code to ..........

(Welcome to the data mining technology and large reptiles friends are interested plus I qq: 2779571288)

Guess you like

Origin blog.51cto.com/13968545/2401020