Site Crawl Instructions

Website crawling:
1. Baidu keyword crawling (such as Beijing company, Beijing enterprise)
2. Baidu enterprise name crawling (company name)
The initial crawling is through Baidu, the collected title and bd_url (not the website url, refers to the url that Baidu jumps to) just grab
the first page, you don’t need to grab the second page ' or bd_name like '%Official website%'or bd_name like '%Group%'or bd_name like '%Home%' Filter out the data whose title contains this keyword, which is relatively accurate, otherwise the number is too large 4. According to the catch The bd_url of the data, collect the web_url, web_title, web_context in the website, and save it . Which region the website belongs to? The website without the ICP record number may have problems. It needs to be studied ----------------- The relevant sql for obtaining information such as ICP and address from the content of the webpage (Relevant information is intercepted from the content of the website, preferably text, do not store html) 1.ICP














update table tt set tt.ent_icp=substr(tt.web_text,instr(tt.web_text,'京ICP',1)-1,30)
where tt.web_text like '%京ICP%'

update table tt set tt.ent_icp=substr(tt.web_text,instr(tt.web_text,'ICP',1)-1,30)
where tt.web_text like '%ICP%' and tt.ent_icp is null
2.地址
update table tt set tt.ent_address=substr(tt.web_text,instr(tt.web_text,'地址',1),50)
where tt.web_text like '%地址%'
3.电话
update table tt set tt.tellphone=substr(tt.web_text,instr(tt.web_text,'电话',1),20)
where tt.web_text like '%电话%'

update table tt set tt.tellphone=substr(tt.web_text,instr(tt.web_text,'热线',1),20)
where tt.web_text like '%热线%' and tt.tellphone is null
4. Company name

update table tt set tt.ent_name=substr(tt.web_text,instr(tt.web_text,'copyright',1)-30,30)
where tt.web_text like '%copyright%'

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326151528&siteId=291194637