Question: Given a text file, each line is a stock information, write a program to extract all stock codes. Among them, the stock code rule is: 6 digits,
and end with .SH or .SZ.
Example of file content:
2020-08-08; Ping An Bank (000001.SZ); 15.55; 294.00 billion
2020-08-08; Hengrui Medicine (600276.SH); 95.32; 495.65 billion (including non-tradable market value)
... ...
2020-08-08; CATL (300750.SZ); 205.32; 465.7 billion
Output:
['000001.SZ','600276.SH'.......'300750.SZ']
answer:
import re
# 打开文件并读取所有行
with open('file.txt', 'r') as f:
lines = f.readlines()
# 定义正则表达式规则
pattern = re.compile(r'\d{6}\.[SZ]{2}')
# 提取所有股票代码
codes = []
for line in lines:
match = pattern.search(line)
if match:
codes.append(match.group())
# 输出所有股票代码
print(codes)
In this sample code, we first open the file and read all the lines into a list. Then, we define a regex rule to match stock symbols. Next, we use a loop to go through all the rows and look for tickers that match the rule. Finally, we store all matching tickers in a list and output it.
Note that the filename here should be changed to your actual filename, not 'file.txt'.
But there will be an error:
content = f.readlines()
UnicodeDecodeError: 'gbk' codec can't decode byte 0xa1 in position 21: illegal multibyte sequence
reason:
This error is usually caused by a file encoding that does not match the encoding used by Python. You can try to specify the file encoding method, for example:
with open('file.txt', 'r', encoding='utf-8') as f:
lines = f.readlines()
So when this problem occurs, just
becomes
with open('file.txt', 'r',encoding='utf-8') as f: lines = f.readlines()