Crawler series articles (1) clarify business needs

1. Project background

After receiving the request from the business department to download the file from the State Grid, the requirements will be gradually clarified and clarified after detailed communication. For details, see:

  • Data source
    http://ecp.sgcc.com.cn/ecp1.0/project_list.jsp?site=global&column_code=014001001&project_typ
    e=1
  • Crawl all bidding announcements on the State Electric Sub-Business Platform and download the "Project Announcement File"
  • After decompressing the downloaded item announcement file, look for the Excel sheet with the words "list of goods",
  • Unified and merge all cargo lists into a csv file
  • Obtain a total of 15 columns of data specified in the Excel table of the goods
    list'package number','provincial purchase application number','item unit','demand unit','item name','range voltage level', 'Material name','Material description
    ','Unit','Quantity','Delivery date','Delivery place','Remarks','Technical specification ID','Status'
  • Wherein 'state' column is added post-column, it refers to the state of the destination time bid items, such as' already tender ',' evaluation is', 'end evaluation "like
    status

Use Xmind brain map to sort out as follows:
Clarify business needs

2. Process steps

  • download file
  • unzip files
  • Modify the code
  • Look for "goods list"
  • Consolidate data
  • "Project Status" column data is added to the consolidated data

3. Tools and technology

  • requests
  • lxml
  • time
  • urllib
  • re,os
  • shutil
  • zipfile
  • threading

The above disassembly of requirements and overall analysis ideas.
Next, each step is implemented through code: crawler series articles (2) crawl page analysis and information acquisition

Guess you like

Origin blog.csdn.net/weixin_42961082/article/details/114404089