Article directory
Preface
Today I would like to share a very strange thing. I encountered invisible characters when writing code! ! !
1. Cause
Today, I suddenly got an error when using pipreqs
the libraries I depended on to export the project:
pipreqs . --encoding=utf-8 --force
# 以下是报错信息
ERROR: Failed on file: ./build.py
Traceback (most recent call last):
File "/usr/local/bin/pipreqs", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/pipreqs/pipreqs.py", line 528, in main
init(args)
File "/usr/local/lib/python3.8/dist-packages/pipreqs/pipreqs.py", line 455, in init
candidates = get_all_imports(input_path,
File "/usr/local/lib/python3.8/dist-packages/pipreqs/pipreqs.py", line 131, in get_all_imports
raise exc
File "/usr/local/lib/python3.8/dist-packages/pipreqs/pipreqs.py", line 117, in get_all_imports
tree = ast.parse(contents)
File "/usr/lib/python3.8/ast.py", line 47, in parse
return compile(source, filename, mode, flags,
File "<unknown>", line 1
# -*- coding:utf-8 -*-
^
SyntaxError: invalid character in identifier
When I came directly SyntaxError
, #
it turned out to be an invalid character. The character #
meant that it was very innocent, and the person involved was very shocked! ! ! Isn't this a great sign of leaving the world and a great joke of slipping into the world? ? ? This is just one line of code comment, how could it go wrong!
2. Investigation
The first time I encountered this kind of evil thing, I checked the pipreqs
source code. The code was very simple, so I excerpted the part that reported the error:
# pipreqs/pipreqs.py line 112
for file_name in files:
file_name = os.path.join(root, file_name)
with open(file_name, "r", encoding=encoding) as f:
contents = f.read()
try:
tree = ast.parse(contents) # 在这里报错了
for node in ast.walk(tree):
if isinstance(node, ast.Import):
for subnode in node.names:
raw_imports.add(subnode.name)
elif isinstance(node, ast.ImportFrom):
raw_imports.add(node.module)
except Exception as exc:
...
The meaning is also easy to understand. Read all files pipreqs
under the current project , and then use the library to perform syntax analysis to obtain the library name that the file depends on. Since this part reported an error, I took it out directly. At this time, I highly suspected that there was something serious !python
ast
python
ast
bug
3. High energy
In order to confirm that ast
there was a bug in parsing the file, I tested all python
the files in the current project one by one. However, the development of things exceeded my expectations: the second file (process_data.py)
could be parsed!
I looked at this file and it has the same comment at the beginning, but no error was reported. Is there a problem with the encoding? I opened it Pycharm
and took a look, and there was no problem:
This was too weird, so I debug
checked again and looked at the contents of the file to make sure there was nothing wrong:
I couldn’t figure it out, so I asked ChatGPT
:
Four doubt points were given, and they were basically eliminated one by one. python 3.8
The files were utf-8
encoded, there were no grammatical errors, and there were no problems with the comments. But there's one thing I don't understand: invisible special characters?
Invisible? Since it is a character, it must have a place even if it is not visible. Then, the weirdest thing happened:
There really is a null character in the first position, this. . . Can empty characters still take up space?
After printing it ASCII
, I found that the value turned out to be 65279
, and the empty character actually had ASCII
a value. I suddenly felt that this problem was not simple. Is it really invisible?
4. Clear up confusion
Baidu took a look and found that ASCII
the value 65279
is caused by the encoding used in the file UTF-8 BOM
. This is Windows
the default encoding method when creating files in the environment. I also looked at this specifically and found that it is really:
There is also this setting in Pycharm
. By default, encoding Pycharm
will be used when creating new files in UTF-8 with NO BOM
, which is often said UTF-8
. The reason why ast
some libraries can parse files normally and some cannot is that some files may not be in Pycharm
Created in , which led to this weird time happening. At the same time, I read the file in binary mode and found that the first three bytes were \xEF\xBB\xBF
, which were UTF-8 BOM
automatically added during encoding.
py_file = './build.py'
with open(py_file, 'r', encoding='utf-8') as f:
contents = f.read()
if ord(contents[0]) == 65279:
print('UTF-8 BOM')
with open(py_file, 'rb') as f:
contents = f.read(3)
if contents == b'\xEF\xBB\xBF':
print('UTF-8 BOM')
# UTF-8 BOM
# UTF-8 BOM
So I UTF-8 BOM
changed the file encoding from to UTF-8
, and the problem was solved!
Follow the WeChat public account:
夏小悠
to obtain more articles, papersPPT
and other information ^_^