Stock price doubled winner in life, python crawling Fund Screener stock

Preface:

I heard that you want to get rich? Then slowly come by yourself, you heard this saying: "Poor people do not want to get rich slowly." We are thinking flourishes in a caipiao what can there be so lucky it? Can not like me in the 780,000 cai piao, then secretly spend it themselves.

I want to slowly become rich, as long as the good fiscal management on it; save a little money on, earn a lot of money on! Before I also suffer from thinking about how to achieve their own financial freedom, so only the idea of ​​learning financial management, financial management would have to say something when it comes to financial products, such as gold, futures, stocks, funds, etc., to popularize these little knowledge of it because that is crawling stocks, funds, so a brief introduction of both;

Stock (English: stock) or capital stock (in English: capital stock) is a securities, shares of the company will take its ownership allocated by such bonds, securities. Because shares of the company needs to raise funds, so the stock to investors as proof of ownership of part of the company's capital, to become shareholders in order to receive dividends (dividends) and share the company's growth or market fluctuations of profits; but it should also be shared the company operational risk arising from errors. ------ to the Wikipedia

Fund , chestnuts give it, is the money in your hand, want to buy a stock, but do not understand the stock of knowledge; I had no money, but has a very rich financial knowledge, share experience, is a financial players. So we have a total, you put your money to me, I used the money to finance, such as the money I get from point divided; the "I" refers to the fund; ------ from star desperately willing to self-understanding

Overall, stocks are high-yield, high risk. Fund income is low, the risk is low, because the fund bought a lot of stock, buy the stock basically will not encounter case all up or all down, so the anti-risk is relatively large. Having said that, I do not want to buy a fake fund, to buy stock, but also want to buy a good stock, but I do not understand the stock, we supposed to do, the amount of, well, what I think, though ya, but the way always there. Under our analysis, the fund bought a lot of stock regarded institutions, which are the brightest and Finance Great God, we can see which stocks they bought, they can not buy it and then follow it up ok from the fund, the after all, they do not want to make their own losses, they will select potential stocks.

text

This article is the use python to fund a financial Web site were crawling, crawling up 5000+ funds shares held, and were processed.

Because crawling before the array data led the entire company caught a lot of cases, so this explanation: reject the use of crawlers for illegal behavior, resolutely patriotic, good without seeking recognition, and more help grandmother to cross the road, do not want the police uncle this article to me because reptiles away.
Stock price doubled winner in life, python crawling Fund Screener stock

This article related to knowledge:

1, python string: segmentation, splicing, Chinese characters is determined;

2, python regular expressions;

3, the crawler requests requesting repository, XPath acquiring data, the proxy server;

4, selenium Usage: headless browser, positioning elements, waiting for an explicit data acquisition;

5, python operation mongodb

Web analytics

Code and data back to us and then paste, let's analyze the next target sites, this will help us crawl process more clearly;

Target site: http://fund.eastmoney.com/data/fundranking.html#tall;c0;r;szzf;pn50;ddesc;qsd20181126;qed20191126;qdii;zq;gg;gzbd;gzfs;bbzt;sfbb

We crawling is [open-end fund] in the data:
Stock price doubled winner in life, python crawling Fund Screener stock
we casually points to open a fund, you can enter their details page, do not know you found, url of the fund details page is the Home Fund Code and http of the fund: //fund.eastmoney.com/ a combination, such as:

040 011 --- Hua core preferably mixed URL: http://fund.eastmoney.com/040011.html

005660 --- url Select Equity A Harvest of resources: http://fund.eastmoney.com/005660.html

ok,好,我们在基金详情页面往下拉就可以找到该基金的股票持仓信息,,也就是该基金买了哪些股票:
Stock price doubled winner in life, python crawling Fund Screener stock
然后点击 更多 进入该基金持股的详情页,往下拉就会看到,该基金三个季度的股票持仓信息:
Stock price doubled winner in life, python crawling Fund Screener stock
对,这就是目标数据,要爬取的数据;

ok,我们先不爬取,再分析这个基金持仓的详情页,这个url也是有规律的,它是用 http://fundf10.eastmoney.com/ccmx_ 和该基金的基金代码组合成的,比如:

005660 ,嘉实资源精选股票A 的持仓详情页面url:http://fundf10.eastmoney.com/ccmx_005660.html

006921,南方智诚混合 的持仓详情页面url:http://fundf10.eastmoney.com/ccmx_006921.html

 

因为这些数据是用js动态加载的,如果使用requests爬取的话难度很大,这种情况下一般会使用selenium模拟浏览器行为进行爬取。但是selenium爬取的效率确实比较低。其实我们依旧是可以使用requests进行爬取的,js动态加载是html页面中的js代码执行了一段操作,从服务端自动加载了数据,所以数据在一开始爬取的页面上是看不到的,除非一些特别难爬的数据才需要selenium,因为selenium号称:只要是你看得到的数据就都可以获取。毕竟selenium是模仿人操作浏览器的行为的。这里我们分析js动态加载,然后利用requests来爬取,后面进行二次爬取的时候再用selenium。

在首页按F12打开开发者工具,然后再刷新一下,
Stock price doubled winner in life, python crawling Fund Screener stock
可以看到右边蓝色框里的数据了吧,这是js动态加载之后返回的数据,然后经过加工后呈现在页面上的,其实只要获取这些数据就可以了,不用去爬取首页了;

我们再点击 Headers ,这个 Request URL 就是js请求的url了,你可以试试把这个url直接用浏览器回车下,会给你返回一堆的数据;上面分析了基金持仓股票页面url的组成,所以只要需要这些数据里的六位基金代码就可以了,本篇代码中是用python正则进行了六位数字的提取,然后组成的基金持仓股票页面的url;然后再在基金持仓股票页面对该基金持有的股票进行爬取、存储;

爬取流程:

1、首先从首页中请求js动态加载数据时请求的那个url,从中获取六位数字的基金代码,

然后 http://fundf10.eastmoney.com/ccmx_ + 基金代码 + .html  组成的基金持仓股票的详情页url;

2、针对 基金持仓股票的详情页url 进行爬取,因为也是js动态加载的(加载速度较快),并且需要判断该基金是否有持仓的股票(有的基金没有买股票,也不知道他们干啥了),所以使用selenium来爬取,同时也使用了显式等待的方式来等待数据加载完成;

3、将数据整理,存储到mongodb中;

代码讲解---数据爬取:

这次我们将代码分段放上来,分段说明;

需要的库:

import requests
import re
from lxml import etree
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pymongo

准备的一些常用方法:

#判断字符串中是否含有中文
def is_contain_chinese(check_str):
    """
    判断字符串中是否包含中文
    :param check_str: {str} 需要检测的字符串
    :return: {bool} 包含返回True, 不包含返回False
    """
    for ch in check_str:
        if u'\u4e00' <= ch <= u'\u9fff':
            return True
    return False
#selenium通过class name判断元素是否存在,用于判断基金持仓股票详情页中该基金是否有持仓股票;
def is_element(driver,element_class):
    try:
        WebDriverWait(driver,2).until(EC.presence_of_element_located((By.CLASS_NAME,element_class)))
    except:
        return False
    else:
        return True
#requests请求url的方法,处理后返回text文本
def get_one_page(url):
    headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36',
    }
    proxies = {
        "http": "http://XXX.XXX.XXX.XXX:XXXX"
    }

    response = requests.get(url,headers=headers,proxies=proxies)
    response.encoding = 'utf-8'
    if response.status_code == 200:
        return response.text
    else:
        print("请求状态码 != 200,url错误.")
        return None
#该方法直接将首页的数据请求、返回、处理,组成持仓信息url和股票名字并存储到数组中;
def page_url():
    stock_url = []      #定义一个数组,存储基金持仓股票详情页面的url
    stock_name = []     #定义一个数组,存储基金的名称
    url = "http://fund.eastmoney.com/data/rankhandler.aspx?op=ph&dt=kf&ft=all&rs=&gs=0&sc=zzf&st=desc&sd=2018-11-26&ed=2019-11-26&qdii=&tabSubtype=,,,,,&pi=1&pn=10000&dx=1&v=0.234190661250681"
    result_text = get_one_page(url)
    # print(result_text.replace('\"',','))    #将"替换为,
    # print(result_text.replace('\"',',').split(','))    #以,为分割
    # print(re.findall(r"\d{6}",result_text))     #输出股票的6位代码返回数组;
    for i in result_text.replace('\"',',').split(','):  #将"替换为,再以,进行分割,遍历筛选出含有中文的字符(股票的名字)
        result_chinese = is_contain_chinese(i)
        if result_chinese == True:
            stock_name.append(i)
    for numbers in re.findall(r"\d{6}",result_text):
        stock_url.append("http://fundf10.eastmoney.com/ccmx_%s.html" % (numbers))    #将拼接后的url存入列表;
    return stock_url,stock_name
#selenium请求[基金持仓股票详情页面url]的方法,爬取基金的持仓股票名称;
def hold_a_position(url):
    driver.get(url)  # 请求基金持仓的信息
    element_result = is_element(driver, "tol")  # 是否存在这个元素,用于判断是否有持仓信息;
    if element_result == True:  # 如果有持仓信息则爬取;
        wait = WebDriverWait(driver, 3)  # 设置一个等待时间
        input = wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'tol')))  # 等待这个class的出现;
        ccmx_page = driver.page_source  # 获取页面的源码
        ccmx_xpath = etree.HTML(ccmx_page)  # 转换成成 xpath 格式
        ccmx_result = ccmx_xpath.xpath("//div[@class='txt_cont']//div[@id='cctable']//div[@class='box'][1]//td[3]//text()")
        return ccmx_result
    else:   #如果没有持仓信息,则返回null字符;
        return "null"

注意 page_url() 方法,里面的url就是上面分析js动态加载数据时请求的url,需要注意的是该url后面的参数,pi是第几页,pn是每页多少条数据,我这里pi=1,pn=10000,意思就是第一页,显示10000条数据(实际数据肯定没这么多,首页才5000+),就一次性的显示出所有的数据了;

程序开始:

if __name__ == '__main__':
    # 创建连接mongodb数据库
    client = pymongo.MongoClient(host='XXX.XXX.XXX.XXX', port=XXXXX)  # 连接mongodb,host是ip,port是端口
    db = client.db_spider  # 使用(创建)数据库
    db.authenticate("用户名", "密码")  # mongodb的用户名、密码连接;
    collection = db.tb_stock  # 使用(创建)一个集合(表)

    stock_url, stock_name = page_url()     #获取首页数据,返回基金url的数组和基金名称的数组;

    #浏览器动作
    chrome_options = Options()
    chrome_options.add_argument('--headless')
    driver = webdriver.Chrome(options=chrome_options)    #初始化浏览器,无浏览器界面的;

    if len(stock_url) == len(stock_name):       #判断获取的基金url和基金名称数量是否一致
        for i in range(len(stock_url)):
            return_result = hold_a_position(stock_url[i])  # 遍历持仓信息,返回持仓股票的名称---数组
            dic_data = {
                'fund_url':stock_url[i],
                'fund_name':stock_name[i],
                'stock_name':return_result
            }        #dic_data 为组成的字典数据,为存储到mongodb中做准备;
            print(dic_data)
            collection.insert_one(dic_data)     #将dic_data插入mongodb数据库
    else:
        print("基金url和基金name数组数量不一致,退出。")
        exit()

    driver.close()              #关闭浏览器

    #查询:过滤出非null的数据
    find_stock = collection.find({'stock_name': {'$ne': 'null'}})  # 查询 stock_name 不等于 null 的数据(排除那些没有持仓股票的基金机构);
    for i in find_stock:
        print(i)

好,至此,爬取数据的代码交代完毕,运行后坐等即可;

The project is a single process run, so crawling slightly slower, but also by the speed influence, the latter will continue to improve into multiple threads.

--- explain the code data processing:

The above data have been crawled and stored in a database, where data is processed, it will become available;

First idea:

1, we need to know all of these funds comprehensive data holdings of stocks, including the fund's holdings in duplicate stock;

2, need to know which stocks to repeat, how many repeats, repeated many times;

In this way, the number of repetitions of that stock will most certainly be the best, because it proves that there are a lot of funds have bought the stock;

Specifically look at the code, comments, has said very clearly:

import pymongo

#一、数据库:连接库、使用集合、创建文档;#
client = pymongo.MongoClient(host='XXX.XXX.XXX.XXX',port=XXXXX)  #连接mongodb数据库

db = client.db_spider       #使用(创建)数据库
db.authenticate("用户名","密码")      #认证用户名、密码

collection = db.tb_stock    #使用(创建)一个集合(表),里面已经存储着上面程序爬取的数据了;
tb_result = db.tb_data      #使用(创建)一个集合(表),用于存储最后处理完毕的数据;

#查询 stock_name 不等于 null 的数据,即:排除那些没有持仓股票的基金;
find_stock = collection.find({'stock_name':{'$ne':'null'}})

#二、处理数据,将所有的股票数组累加成一个数组---list_stock_all #
list_stock_all = []     #定义一个数组,存储所有的股票名称,包括重复的;
for i in find_stock:
    print(i['stock_name'])    #输出基金的持仓股票(类型为数组)
    list_stock_all = list_stock_all + i['stock_name']   #综合所有的股票数组为一个数组;
print("股票总数:" + str(len(list_stock_all)))

#三、处理数据,股票去重#
list_stock_repetition = []  #定义一个数组,存放去重之后的股票
for n in list_stock_all:
    if n not in list_stock_repetition:        #如果不存在
        list_stock_repetition.append(n)        #则添加进该数组,去重;
print("去重后的股票数量:" + str(len(list_stock_repetition)))

#四、综合二、三中的得出的两个数组进行数据筛选#
for u in list_stock_repetition:        #遍历去重后股票的数组
    if list_stock_all.count(u) > 10:   #在未去重股票的数组中查找股票的重复数,如果重复数大于10
        #将数据组成字典,用于存储到mongodb中;
        data_stock = {
            "name":u,
            "numbers":list_stock_all.count(u)
        }
        insert_result = tb_result.insert_one(data_stock)    #存储至mongodb中
        print("股票名称:" + u + " , 重复数:" + str(list_stock_all.count(u)))

Thus, the data will be processed a little bit into a collection tb_data;

The following disclosure is only partially processed data:

{'_id': ObjectId('5e0b5ecc7479db5ac2ec62c9'), 'name': '水晶光电', 'numbers': 61}
{'_id': ObjectId('5e0b5ecc7479db5ac2ec62ca'), 'name': '老百姓', 'numbers': 77}
{'_id': ObjectId('5e0b5ecc7479db5ac2ec62cb'), 'name': '北方华创', 'numbers': 52}
{'_id': ObjectId('5e0b5ecc7479db5ac2ec62cc'), 'name': '金风科技', 'numbers': 84}
{'_id': ObjectId('5e0b5ecc7479db5ac2ec62cd'), 'name': '天顺风能', 'numbers': 39}
{'_id': ObjectId('5e0b5ecc7479db5ac2ec62ce'), 'name': '石大胜华', 'numbers': 13}
{'_id': ObjectId('5e0b5ecc7479db5ac2ec62cf'), 'name': '国投电力', 'numbers': 55}
{'_id': ObjectId('5e0b5ecc7479db5ac2ec62d0'), 'name': '中国石化', 'numbers': 99}
{'_id': ObjectId('5e0b5ecc7479db5ac2ec62d1'), 'name': '中国石油', 'numbers': 54}
{'_id': ObjectId('5e0b5ecc7479db5ac2ec62d2'), 'name': '中国平安', 'numbers': 1517}
{'_id': ObjectId('5e0b5ecc7479db5ac2ec62d3'), 'name': '贵州茅台', 'numbers': 1573}
{'_id': ObjectId('5e0b5ecc7479db5ac2ec62d4'), 'name': '招商银行', 'numbers': 910}

The data do not yet sorted in alphabetical order;

Data:

China Petrochemical numbers is 54, indicating that there are 54 to buy the China Petrochemical shares in 5000 + family funds;

China Merchants Bank numbers 910, indicating that there are 910 fund to buy shares of China Merchants Bank in 5000 + family fund

......

The amount of, well, this is also nothing to say;

Finally, investors must be careful, the stock risk; article is for learning, self-financing, is not responsible;

Guess you like

Origin blog.51cto.com/13577495/2466127