Crawler tools htmlunit, selemium, beatifulsoup

   Need crawler, tried these three methods, htmlunit and selemium use java language, beatifulsoup uses python.
   beatifulSoup crawls the page code, and can find the corresponding tags according to the html code, but the search method is relatively rigid and difficult, and then I checked other python-based methods and said that it seems that beatifulsoup is not as easy to use,
   htmlunit is after I used it I feel the best way, the getById, Tag, attr method can be targeted through some unique features of the tag, crawl the desired data, and at the same time can modify the data request header, for the token method to prevent crawlers website. The feature of selemium is to simulate the operation of the browser, and the function is similar to the button wizard or the Robot in java. It can be considered for some anti-crawler anti-vibrant websites. At present, Google and IE browsers are supported.

   There is no in-depth study of the efficiency and applicability of each crawler tool. Remind yourself that when you encounter a token-type anti-crawler, modify the htmlunit request header, cookie and browser model. (The selemium package is too large to be uploaded...) It is available on the Internet

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326530326&siteId=291194637