sciencedirect website crawling process

 

 

Development environment

      C#+SQLite

Software Tutorial:

Settings page

1, first need to enter keyword query, if you need to query based on, you can check the corresponding year, the year to support multiple queries. Click [Settings] button keyword, the keywords in the query queue to be queried.

2. Modify the query paging delay and delay information article, click the Modify button [change] delayed entry into force.

3, click [Start / Pause] button control query operation.

4, the bottom of the display sub-pages to be queried until the number of queries article, the amount of data to be put in storage and warehousing has been the amount of data.

 

 

 

 

Run Log

Execution of each step will have a corresponding file description is displayed in the log, including query paging, inquiry documents, operational errors, data warehousing, data validation and all other log information.

 

 

 

 

 

Data Preview

All data will be stored in a SQLite database in real time, the data will be saved permanently. The main function is to preview data paging query, data export function.

If the data does not need to be deleted data.db files in the software directory.

 

 

 

 

 

Summary of problems during development

Data Capture

Difficulties all reptiles has never been technical, but the site's data analysis, surface data may have seen the difference between imagination displayed. For example, the article details the author information is processed by js Json format data show. To find specific data needs to parse the entire Json data.

Json data nor the most rare, Json data analysis is the most rare.

KeyValue format data Key = "$" / "$$" / "_" / "Get-Text", etc. In short C # how incompatible how come.

I can think of 2 solutions for the above data:

1, traversal key / value pairs of all data, and then matching the data acquired key information according to the corresponding value or the value of the name.

2, due to the dynamic support dynamic typing, so you can write the name of the dead in accordance processing can be used as long as the key variable. In order to comply with the rules to get the name, but to Replace.

 

 

I was not very smart, ha ha.

About IP restrictions

IP restriction is undoubtedly a good means for IP restrictions, it can only slow down the speed of queries.

Again I visited and finished after visiting a Web page to access a Web page under way to prevent IP was blocked by simple random time.

About the future of the site

为了更好的适应网站的查询条件,比如年,会显示从1996年到当前时间的年份。

运行日志

为了更明显的显示日志信息,把执行成功的标记为蓝色,失败的标记为红色。

关于Dapper

刚刚接触Dapper的时候,把他当作一个完美的DbHelper使用的。后来发现无论是事务、确认数据是否存在、先查后插入都需要自己去完整,我心目中的完美Dapper啊

 

 

 

不过SQLite还是比SQLServer有好的地方的,

比如Create Table If Not Exists TableName

比如 Replace Into 减少了很多代码量

数据导出

导出数据到Excel,NPOI永远是利器。

待处理问题

       数据中如果存在上下标,还不知道怎么处理和保存。万能的百度没有帮到我,Unicode中不知道a的上标是什么,下标也没有成功显示。求大神们指点…

 

 

Guess you like

Origin www.cnblogs.com/wenqingluomo/p/12003588.html