python crawl all stock reports

foreword

Since Bishe is going to perform sentiment analysis on stock news reports, it is necessary to crawl individual stocks of all stocks. At the beginning, I was going to crawl all the stock information directly on
[Eastern Fortune.com] ( http://quote.eastmoney.com/stocklist.html ), but when getting the stock information list, it was necessary to simulate events.

prerequisite work  

1. Install python3

([download address]( https://www.python.org/downloads/ )), choose to add to the environment variable during installation, if not selected, you can click [right-click my computer]->[properties]-> [Advanced system settings] -> [Environment variables] -> [path] Add the path of the installed Python3 to the path.

2. Install the requests library via the command line:

>pip install requests


3. Install lxml

>pip install lxml


4. Install pyquery

> For details on the usage of pip install pyquery,
see [Jing Seek](https://cuiqingcai.com/) » [Usage of PyQuery in Python Crawler Tool Six]( https://cuiqingcai.com/2636.html )


5. Install pymysql

The premise is to install mysql first, and then also use
>pip install pymysql


Crawl data

1. Crawl all stock symbols

The url is the stock code query list of Eastmoney.com (http://quote.eastmoney.com/stocklist.html). This is a static web page, and crawling is relatively simple.

From the results of analyzing the web page, we can see from the picture that the stock code is the text in the text brackets of the a tag whose target is _blank, and the stock name is the text before the brackets. Therefore, the obtained text is processed by the split function to obtain the stock code and stock name.


    
def getCodes():
      codes=[]
      url='http://quote.eastmoney.com/stocklist.html'
      req =requests.get(url,timeout=30)
      reporthtml=req.text
      html = pq(reporthtml)
      #print(html)
      stock_a_list = html("#quotesearch ul li a[target='_blank']").items()
      for stock_a in stock_a_list:
          num = stock_a.text().split('(')[1].strip(')')
          if (num.startswith('1') or num.startswith('5')or num.startswith('2')): continue # Only need stocks starting with 6*/0*/3*/2*
          sname = stock_a.text().split('(')[0]
          record = {}# code used to store individual stocks, and names
          # do transcoding
          sname = sname.encode("iso-8859-1").decode('gbk').encode('utf-8')
          result = str(sname, encoding='utf-8')
          print(result)
          record["sname"]=result
          record["num"]=num;
          codes.append(record)
          return codes


2. Crawl all stock symbols

To get the stock code Zhihu, you need to get the details page of the individual stock. This is relatively simple for us to obtain through Sina Finance.

* Analysis address

 


The address for analyzing the company information of multiple stocks http://vip.stock.finance.sina.com.cn/corp/view/vCB_AllNewsStock.php?symbol=sz000725&Page=3
can be found**http://vip.stock.finance .sina.com.cn/corp/view/vCB_AllNewsStock.php?symbol=**+(sz or sh)+code+&Page=+page number. Among them, when it is Shenzhen, it is sz, and when it is on the Shanghai Stock Exchange, it is sz.

* Analyze web pages & remove delisted and unlisted stocks

After getting the html code through requests and pyquery, analyze the structure of the web page. First remove unlisted stocks and delisted stocks.
            

          
        
        
isclose = report_list_wrap("#closed")
         f isclose=="Delisted" or isclose=="Not listed":
         flag=False;
         continue


*Analyze the report structure of the webpage


The stock information is all a tags in the div with class=datelist. By getting the href of the a tag, you can get the link of the news report, and get the content of the report through the link
                             
     
report_list=report_list_wrap("#con02-7 .datelist ul a").items()
      # **************************************** Traverse the information of individual stocks
      for r in report_list:
                    try:
                         #transcode
                        report_title= r.text().encode("iso-8859-1").decode('gbk').encode('utf-8')
                        report_title = str(report_title, encoding='utf-8')


                        print("Title: "+report_title)
                        #Get the article link and get the article
                        report_url = r.attr("href")
                        req = requests.get(report_url, timeout=30)
                        reporthtml = req.text
                        reporthtml = pq(reporthtml)
                        content = reporthtml("#artibody").text()
                    except:
                        flag = False
                        print(code['num'] + report_title + "Error")


* Analyze web pages to solve coding problems:

The content of the report obtained through the previous step analysis will be found to be garbled, because the encoding itself is ISO-8859-1, not our default utf8, so the solution is as follows:


                         
                       
##Solve encoding problems
                        #print(req.encoding)
                        if req.encoding == 'ISO-8859-1':
                            encodings = requests.utils.get_encodings_from_content(req.text)
                            if encodings:
                                encoding = encodings[0]
                            else:
                                encoding = req.apparent_encoding
                        reporthtml = req.content.decode(encoding, 'replace').encode('utf-8', 'replace')






Tips:

utf8 encoded text can be represented by iso8859-1 encoding, but not the other way around. iso8859-1 is a single-byte encoding, and utf8 is a fixed-length encoding. Converting from utf8 to iso8859-1 is equivalent to converting high precision to low precision, resulting in loss of precision, so it is irreversible. The root cause is because utf8 Chinese, there is no matching position in iso8859-1.
Reference:
https://blog.csdn.net/kelindame/article/details/75014485,
https://www.cnblogs.com/GUIDAO/p/6679574.html

Insert into database

The last step is to insert the obtained data into the database.


                   
sql = ("insert into stock(rid,scode,sname,rdate,rtitle,report,emotion) VALUES (%s,%s,%s,%s,%s,%s,%s) ")
                    data_report = (str(id), code['num'], code['sname'],rdate,report_title, content, '-2')
                    id = id + 1
                    try:
                        # execute sql statement
                        cursor.execute(sql,data_report)
                        print(code['sname']+"Insert successfully")
                        # Submit to the database for execution
                        db.commit()
                    except Exception as e:
                        print('perhaps timeout:', e)
                        db.rollback()

Alright, this completes our crawler work.

I'm c6j, a ceremonial programmer.
 





Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324604003&siteId=291194637