Weird things in Python: invisible characters!

Preface

  Today I would like to share a very strange thing. I encountered invisible characters when writing code! ! !

1. Cause

  Today, I suddenly got an error when using pipreqsthe libraries I depended on to export the project:

pipreqs . --encoding=utf-8 --force

# 以下是报错信息
ERROR: Failed on file: ./build.py
Traceback (most recent call last):
  File "/usr/local/bin/pipreqs", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/pipreqs/pipreqs.py", line 528, in main
    init(args)
  File "/usr/local/lib/python3.8/dist-packages/pipreqs/pipreqs.py", line 455, in init
    candidates = get_all_imports(input_path,
  File "/usr/local/lib/python3.8/dist-packages/pipreqs/pipreqs.py", line 131, in get_all_imports
    raise exc
  File "/usr/local/lib/python3.8/dist-packages/pipreqs/pipreqs.py", line 117, in get_all_imports
    tree = ast.parse(contents)
  File "/usr/lib/python3.8/ast.py", line 47, in parse
    return compile(source, filename, mode, flags,
  File "<unknown>", line 1
    # -*- coding:utf-8 -*-
    ^
SyntaxError: invalid character in identifier

  When I came directly SyntaxError, #it turned out to be an invalid character. The character #meant that it was very innocent, and the person involved was very shocked! ! ! Isn't this a great sign of leaving the world and a great joke of slipping into the world? ? ? This is just one line of code comment, how could it go wrong!

Insert image description here

2. Investigation

  The first time I encountered this kind of evil thing, I checked the pipreqssource code. The code was very simple, so I excerpted the part that reported the error:

# pipreqs/pipreqs.py line 112
for file_name in files:
    file_name = os.path.join(root, file_name)
    with open(file_name, "r", encoding=encoding) as f:
        contents = f.read()
    try:
        tree = ast.parse(contents)	# 在这里报错了
        for node in ast.walk(tree):
            if isinstance(node, ast.Import):
                for subnode in node.names:
                    raw_imports.add(subnode.name)
            elif isinstance(node, ast.ImportFrom):
                raw_imports.add(node.module)
    except Exception as exc:
    	...

  The meaning is also easy to understand. Read all files pipreqsunder the current project , and then use the library to perform syntax analysis to obtain the library name that the file depends on. Since this part reported an error, I took it out directly. At this time, I highly suspected that there was something serious !pythonastpythonastbug

3. High energy

  In order to confirm that astthere was a bug in parsing the file, I tested all pythonthe files in the current project one by one. However, the development of things exceeded my expectations: the second file (process_data.py)could be parsed!
  I looked at this file and it has the same comment at the beginning, but no error was reported. Is there a problem with the encoding? I opened it Pycharmand took a look, and there was no problem:

Insert image description here
  This was too weird, so I debugchecked again and looked at the contents of the file to make sure there was nothing wrong:

Insert image description here
  I couldn’t figure it out, so I asked ChatGPT:

Insert image description here
  Four doubt points were given, and they were basically eliminated one by one. python 3.8The files were utf-8encoded, there were no grammatical errors, and there were no problems with the comments. But there's one thing I don't understand: invisible special characters?
  Invisible? Since it is a character, it must have a place even if it is not visible. Then, the weirdest thing happened:

Insert image description here
  There really is a null character in the first position, this. . . Can empty characters still take up space?
  After printing it ASCII, I found that the value turned out to be 65279, and the empty character actually had ASCIIa value. I suddenly felt that this problem was not simple. Is it really invisible?

Insert image description here

4. Clear up confusion

  Baidu took a look and found that ASCIIthe value 65279is caused by the encoding used in the file UTF-8 BOM. This is Windowsthe default encoding method when creating files in the environment. I also looked at this specifically and found that it is really:

Insert image description here
  There is also this setting in Pycharm. By default, encoding Pycharmwill be used when creating new files in UTF-8 with NO BOM, which is often said UTF-8. The reason why astsome libraries can parse files normally and some cannot is that some files may not be in PycharmCreated in , which led to this weird time happening. At the same time, I read the file in binary mode and found that the first three bytes were \xEF\xBB\xBF, which were UTF-8 BOMautomatically added during encoding.

py_file = './build.py'
with open(py_file, 'r', encoding='utf-8') as f:
    contents = f.read()
    if ord(contents[0]) == 65279:
        print('UTF-8 BOM')

with open(py_file, 'rb') as f:
    contents = f.read(3)
    if contents == b'\xEF\xBB\xBF':
        print('UTF-8 BOM')

# UTF-8 BOM
# UTF-8 BOM

  So I UTF-8 BOMchanged the file encoding from to UTF-8, and the problem was solved!

Insert image description here


  Follow the WeChat public account: 夏小悠to obtain more articles, papers PPTand other information ^_^

Guess you like

Origin blog.csdn.net/qq_42730750/article/details/132249961