Currency station log 2-- my "currency station" is illegal? Where legal boundaries of reptiles in?

Currency station log 2-- my "currency station" is illegal? Where legal boundaries of reptiles in?

Case Studies

I found on github a project, only updated on the 10th, a summary of some of the reptiles and the relevant key ( https://github.com/HiddenStrawberry/Crawler_Illegal_Cases_In_China )
10 days prior to the final submission. After reading found

Where legal boundaries of reptiles?

Software is so important in real life, and now is the era of data is king. The thoroughness of the law may unavoidably

But the law does not mean that the provisions did what software really can do whatever they want , we have to write the real world or whether it is the rules in the digital world of software, or both shall abide in the virtual world of the game

  • 1, damage to property of others do not do.

For example, I see a lot of data in your house, I would like to make use of them can use it? If you do not disagree, I think it is, as I want to see your house painting walls, you can not tell me. But while the walls of your home at the edge of the street, but I can not bring it over to see. Corresponds to the reptiles, you climb people's data, adding to damage other people's server uptime, or someone else to use *** robot inside the others deleted data. This must not be done, so we write reptiles, do not over-emphasis on efficiency, slow it does not matter, take someone else's data but also respect for others. Do not drive hundreds of threads the server Gaosi people.

  • 2, do not let other people see the data, do not climb.

For example, people obviously vip to look at the data, you have to climb out for all to see. With examples of painting on the walls, the others, since we use on pieces of cloth, you will not be able to tear the cloth to see my sister in the street is allowed, torn clothes afraid to look into the criminal oh.

  • 3, personal data can not climb! ! ! ! !

The original coin circle so fast hardware treasure is the public trust this thing go in, go in specific reason may have two.
1 credit crime information to help collect illegal p2p users, in fact, he seems to write a plug-credit crawl.
2 crawling illegal user private data.

Where is a gray area in reptiles?

Although it says a lot, it seems like a very clear, in fact, is very vague. such as

  • What kind of load is considered not to affect someone else's server to run it
  • Copyright open the article how count it?

    Reptiles and understanding what the final climb of the site is reached?

    Yes, although a lot of gray area, but there is not a law, wrote in, but we have accepted to perform in that robots.txt.
    He gave the owner of the site which allows those who crawling data, which are not allowed to climb
    to see under the robots.txt csdn

    
    User-agent: * 
    Disallow: /scripts 
    Disallow: /public 
    Disallow: /css/ 
    Disallow: /images/ 
    Disallow: /content/ 
    Disallow: /ui/ 
    Disallow: /js/ 
    Disallow: /scripts/ 
    Disallow: /article_preview.html* 
    Disallow: /tag/
    Disallow: /*?*
    Disallow: /link/

Sitemap: http://www.csdn.net/article/sitemap.txt

里面明确规定了,不要去爬他的资源网站,还有没有被归类的预览网站。但其他没有限制的,理论上你是可以爬的。

在看看爬虫的鼻祖,搜索引擎的
先看看百度的

User-agent: Baiduspider
Disallow: /baidu
Disallow: /s?
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/

User-agent: Googlebot
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/

User-agent: MSNBot
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/

User-agent: Baiduspider-image
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/

User-agent: YoudaoBot
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/

User-agent: Sogou web spider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/

User-agent: Sogou inst spider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/

User-agent: Sogou spider2
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/

User-agent: Sogou blog
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/

User-agent: Sogou News Spider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/

User-agent: Sogou Orion spider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/

User-agent: ChinasoSpider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/

User-agent: Sosospider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/

User-agent: yisouspider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/

User-agent: EasouSpider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/

User-agent: *
Disallow: /

我们重点关注一个

User-agent: *
Disallow: /

不是他指定的搜索引擎,那么你一个数据都不许爬!
再看看google的

User-agent:
Disallow: /search
Allow: /search/about
Allow: /search/static
Allow: /search/howsearchworks
Disallow: /sdch
Disallow: /groups
Disallow: /index.html?
Disallow: /?
Allow: /?hl=
Disallow: /?hl=
&
Allow: /?hl=&gws_rd=ssl$
Disallow: /?hl=
&&gws_rd=ssl
Allow: /?gws_rd=ssl$
Allow: /?pt1=true$
Disallow: /imgres
Disallow: /u/
Disallow: /preferences
Disallow: /setprefs
Disallow: /default
Disallow: /m?
Disallow: /m/
Allow: /m/finance
Disallow: /wml?
Disallow: /wml/?
Disallow: /wml/search?
Disallow: /xhtml?
Disallow: /xhtml/?
Disallow: /xhtml/search?
Disallow: /xml?
Disallow: /imode?
Disallow: /imode/?
Disallow: /imode/search?
Disallow: /jsky?
Disallow: /jsky/?
Disallow: /jsky/search?
Disallow: /pda?
Disallow: /pda/?
Disallow: /pda/search?
Disallow: /sprint_xhtml
Disallow: /sprint_wml
Disallow: /pqa
Disallow: /palm
Disallow: /gwt/
Disallow: /purchases
Disallow: /local?
Disallow: /localurl
Disallow: /shihui?
Disallow: /shihui/
Disallow: /products?
Disallow: /product

Disallow: /products_
Disallow: /products;
Disallow: /print
Disallow: /books/
Disallow: /bkshp?
q=
Disallow: /books?
q=
Disallow: /books?
output=
Disallow: /books?
pg=
Disallow: /books?
jtp=
Disallow: /books?
jscmd=
Disallow: /books?
buy=
Disallow: /books?
zoom=
Allow: /books?
q=related:
Allow: /books?
q=editions:
Allow: /books?
q=subject:
Allow: /books/about
Allow: /booksrightsholders
Allow: /books?
zoom=1
Allow: /books?
zoom=5
Allow: /books/content?
zoom=1
Allow: /books/content?
zoom=5
Disallow: /ebooks/
Disallow: /ebooks?
q=
Disallow: /ebooks?
output=
Disallow: /ebooks?
pg=
Disallow: /ebooks?
jscmd=
Disallow: /ebooks?
buy=
Disallow: /ebooks?
zoom=
Allow: /ebooks?
q=related:
Allow: /ebooks?
q=editions:
Allow: /ebooks?
q=subject:
Allow: /ebooks?
zoom=1
Allow: /ebooks?
zoom=5
Disallow: /patents?
Disallow: /patents/download/
Disallow: /patents/pdf/
Disallow: /patents/related/
Disallow: /scholar
Disallow: /citations?
Allow: /citations?user=
Disallow: /citations?
cstart=
Allow: /citations?view_op=new_profile
Allow: /citations?view_op=top_venues
Allow: /scholarshare
Disallow: /s?
Allow: /maps?output=classic
Allow: /maps?*file=
Allow: /maps/d/
Disallow: /maps?
Disallow: /mapstt?
Disallow: /mapslt?
Disallow: /maps/stk/
Disallow: /maps/br?
Disallow: /mapabcpoi?
Disallow: /maphp?
Disallow: /mapprint?
Disallow: /maps/api/js/
Allow: /maps/api/js
Disallow: /maps/api/place/js/
Disallow: /maps/api/staticmap
Disallow: /maps/api/streetview
Disallow: /maps/
/sw/manifest.json
Disallow: /mld?
Disallow: /staticmap?
Disallow: /maps/preview
Disallow: /maps/place
Disallow: /maps/timeline/
Disallow: /help/maps/streetview/partners/welcome/
Disallow: /help/maps/indoormaps/partners/
Disallow: /lochp?
Disallow: /center
Disallow: /ie?
Disallow: /blogsearch/
Disallow: /blogsearch_feeds
Disallow: /advanced_blog_search
Disallow: /uds/
Disallow: /chart?
Disallow: /transit?
Allow: /calendar$
Allow: /calendar/about/
Disallow: /calendar/
Disallow: /cl2/feeds/
Disallow: /cl2/ical/
Disallow: /coop/directory
Disallow: /coop/manage
Disallow: /trends?
Disallow: /trends/music?
Disallow: /trends/hottrends?
Disallow: /trends/viz?
Disallow: /trends/embed.js?
Disallow: /trends/fetchComponent?
Disallow: /trends/beta
Disallow: /trends/topics
Disallow: /musica
Disallow: /musicad
Disallow: /musicas
Disallow: /musicl
Disallow: /musics
Disallow: /musicsearch
Disallow: /musicsp
Disallow: /musiclp
Disallow: /urchin_test/
Disallow: /movies?
Disallow: /wapsearch?
Allow: /safebrowsing/diagnostic
Allow: /safebrowsing/report_badware/
Allow: /safebrowsing/report_error/
Allow: /safebrowsing/report_phish/
Disallow: /reviews/search?
Disallow: /orkut/albums
Disallow: /cbk
Disallow: /recharge/dashboard/car
Disallow: /recharge/dashboard/static/
Disallow: /profiles/me
Allow: /profiles
Disallow: /s2/profiles/me
Allow: /s2/profiles
Allow: /s2/oz
Allow: /s2/photos
Allow: /s2/search/social
Allow: /s2/static
Disallow: /s2
Disallow: /transconsole/portal/
Disallow: /gcc/
Disallow: /aclk
Disallow: /cse?
Disallow: /cse/home
Disallow: /cse/panel
Disallow: /cse/manage
Disallow: /tbproxy/
Disallow: /imesync/
Disallow: /shenghuo/search?
Disallow: /support/forum/search?
Disallow: /reviews/polls/
Disallow: /hosted/images/
Disallow: /ppob/?
Disallow: /ppob?
Disallow: /accounts/ClientLogin
Disallow: /accounts/ClientAuth
Disallow: /accounts/o8
Allow: /accounts/o8/id
Disallow: /topicsearch?q=
Disallow: /xfx7/
Disallow: /squared/api
Disallow: /squared/search
Disallow: /squared/table
Disallow: /qnasearch?
Disallow: /app/updates
Disallow: /sidewiki/entry/
Disallow: /quality_form?
Disallow: /labs/popgadget/search
Disallow: /buzz/post
Disallow: /compressiontest/
Disallow: /analytics/feeds/
Disallow: /analytics/partners/comments/
Disallow: /analytics/portal/
Disallow: /analytics/uploads/
Allow: /alerts/manage
Allow: /alerts/remove
Disallow: /alerts/
Allow: /alerts/$
Disallow: /ads/search?
Disallow: /ads/plan/action_plan?
Disallow: /ads/plan/api/
Disallow: /ads/hotels/partners
Disallow: /phone/compare/?
Disallow: /travel/clk
Disallow: /travel/hotelier/terms/
Disallow: /hotelfinder/rpc
Disallow: /hotels/rpc
Disallow: /commercesearch/services/
Disallow: /evaluation/
Disallow: /chrome/browser/mobile/tour
Disallow: /compare//apply
Disallow: /forms/perks/
Disallow: /shopping/suppliers/search
Disallow: /ct/
Disallow: /edu/cs4hs/
Disallow: /trustedstores/s/
Disallow: /trustedstores/tm2
Disallow: /trustedstores/verify
Disallow: /adwords/proposal
Disallow: /shopping/product/
Disallow: /shopping/seller
Disallow: /shopping/ratings/account/metrics
Disallow: /shopping/reviewer
Disallow: /about/careers/applications/
Disallow: /landing/signout.html
Disallow: /webmasters/sitemaps/ping?
Disallow: /ping?
Disallow: /gallery/
Disallow: /landing/now/ontap/
Allow: /searchhistory/
Allow: /maps/reserve
Allow: /maps/reserve/partners
Disallow: /maps/reserve/api/
Disallow: /maps/reserve/search
Disallow: /maps/reserve/bookings
Disallow: /maps/reserve/settings
Disallow: /maps/reserve/manage
Disallow: /maps/reserve/payment
Disallow: /maps/reserve/receipt
Disallow: /maps/reserve/sellersignup
Disallow: /maps/reserve/payments
Disallow: /maps/reserve/feedback
Disallow: /maps/reserve/terms
Disallow: /maps/reserve/m/
Disallow: /maps/reserve/b/
Disallow: /maps/reserve/partner-dashboard
Disallow: /about/views/
Disallow: /intl/*/about/views/
Disallow: /local/dining/
Disallow: /local/place/products/
Disallow: /local/place/reviews/
Disallow: /local/place/rap/
Disallow: /local/tab/
Allow: /finance
Allow: /js/

AdsBot

User-agent: AdsBot-Google
Disallow: /maps/api/js/
Allow: /maps/api/js
Disallow: /maps/api/place/js/
Disallow: /maps/api/staticmap
Disallow: /maps/api/streetview

Certain social media sites are whitelisted to allow crawlers to access page markup when links to google.com/imgres* are shared. To learn more, please contact [email protected].

User-agent: Twitterbot
Allow: /imgres

User-agent: facebookexternalhit
Allow: /imgres

Sitemap: https://www.google.com/sitemap.xml


google的比较长,但是我们只需要关注一个
User-agent: *
这说明,只要它允许的,我们都能爬!
## 最后,我写的小站是否违法了?
- 1、我没有让对方服务器瘫痪,或者增加压力的可能
我的爬取设置的是10min一次,一次爬3,5个新闻,而且没有加载别人的图片和js之类的
- 2、这些网站的robots里面并没有限制我爬取
- 3、我没有损害这些网站的利益
我虽然也拿到了这些网站的新闻内容,但是我并没有直接展示,而是需要打开文章原来的位置

Guess you like

Origin blog.51cto.com/14633800/2456580