[python] file operation (11)

Reference Python series serialization from scratch, by Wang Dawei Python enthusiast community

Refer to Hellobi Live | 1 hour icebreaker introduction to Python

Refer to "Concise Python Tutorial"

Note: For more serialization, please see [python]


content


file operations

1 What is a file

A file is a collection of data stored on an external medium, which can usually be stored for a long time (provided that the medium is not easily damaged) . In other
words, a file is a place to store data

2 Absolute path and relative path

The open file operation requires 3 steps:

1. Find the path where the file is stored, open the file
2. Modify the file
3. Close the file

When it comes to finding out the storage path of the file, we must understand the concept of absolute path and relative path

2.1 Absolute path

The absolute path refers to starting from the original hard disk and going all the way to the file location
eg:

E:\Programming Learning Materials\Crawling HD Big Picture.py

E:/Programming Learning Materials/Crawling HD Big Picture.py

The following is an example of the path of the next image in the ubuntu system

open('/root/userfolder/0.jpg')

The result is

<_io.TextIOWrapper name='/root/userfolder/0.jpg' mode='r' encoding='ANSI_X3.4-1968'>

2.2 Relative paths

The relative path refers to the current location to continue to the location of the file

open('0.jpg')

The result is

<_io.TextIOWrapper name='0.jpg' mode='r' encoding='ANSI_X3.4-1968'>

You can call the opencv library (provided it is configured) to visualize the picture. If the opencv library is not installed, you can use the Image library in PIL. The following uses the Opencv library as an example to display the picture.

import cv2
im = cv2.imread('0.jpg')#相对路径
imshow(im)
axis('off')
show()

The picture shows up

write picture description here

3 encoding of the file

Depending on the encoding, the file can be divided into text characters and binary bytes

  • Text characters , such as Chinese characters, English letters, numbers, punctuation, etc., the characters are for display

  • Binary bytes are the form of computer storage. In the computer, any data is binary bytes composed of 01 strings

  When we open the text, we see the characters, and the binary bytes are stored in the final save. The encoding of the text characters can be selected from various encodings when saving in the Notepad that comes with Windows.

write picture description here

Unicode is " character set "
UTF-8 is " encoding rule "
Among them:
character set : assign a unique ID to each "character" (scientific name is code point/code point/Code Point)
Encoding rule : convert "code point" "The rules for converting to byte sequences (encoding/decoding can be understood as the process of encryption/decryption)

To make a simple analogy, unicode is equivalent to Chinese, UTF-8, UTF-16, etc. are equivalent to various writing methods such as running script, regular script, cursive script, etc. When it comes to details, the most important point is. Why have UTF-16, also Will there be UTF-8? Why do so many complex encodings? Is one encoding not good? The answer is to save bandwidth, because early Internet bandwidth was very expensive. (Excerpt from Zhihu)

All text characters such as strings are encoded in unicode.
Use encode() to encode to utf-8 and
use decode() to decode utf-8 files into text characters

write picture description here

s1 = '莫莫,你好'
s2 = s1.encode()
print (s2)
type(s2)

Here the string in the text characters is encoded to the default utf-8 file and the
result is

b'\xe8\x8e\xab\xe8\x8e\xab,\xe4\xbd\xa0\xe5\xa5\xbd'
bytes

The string before encoding is displayed as str string type

The encoded string is displayed as bytes byte type


Of course, in addition to utf-8 encoding, there are many other encodings, such as gbk encoding

s3 = s1.encode('gbk')
print (s3)
type(s3)

The result is

b'\xc4\xaa\xc4\xaa,\xc4\xe3\xba\xc3'
bytes

We decode utf-8 back to unicode encoding.

s2.decode()

The result is

'莫莫,你好'

However, if we decode the encoded utf-8 using gbk, an error will be reported

s2.decode('gbk')

The result is

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-29-8f3ab4505d84> in <module>()
----> 1 s2.decode('gbk')

UnicodeDecodeError: 'gbk' codec can't decode byte 0xab in position 2: illegal multibyte sequence

From the reason of the error, the content of a certain position cannot be decoded

Think about the reason, you can understand it like this

A Chinese sentence can be translated into English or Korean

The translator A, who only understands Chinese and English, can translate (encode) Chinese into English, or translate (decode) English into Chinese

If translator A wants to translate (decode) Korean into Chinese, he doesn't understand Korean, so he can't do it!


We decode the gbk-encoded content

s3.decode()

The result is

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-30-c1fbb3df3ed5> in <module>()
----> 1 s3.decode()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 5: invalid continuation byte

Reported wrong! Because we did not add parameters to decoding, the default is to use utf-8 decoding


So, we have to decode with gbk:

s3.decode('gbk')

The result is

'莫莫,你好'

To sum up, the content encoded by utf-8 can only be decoded by utf-8, and the content encoded by gbk can only be decoded by gbk! encode() and decode() default to utf-8 mode .

4 Opening, writing and closing of files

Usually, our Python operations on files include file opening, file content reading, file modification, file closing , etc.

4.1 File open

use open() to open the
file fileobject = open(filename, 'mode')

Mode is an optional parameter, usually one of the following:

w     以写方式打开,如果这个文件不存在,则创建这个文件

r      以只读方式打开

a     以写方式打开,写的内容追加在文章末尾(像列表的append())

b     表示二进制文件

+     以修改方式打开,支持读/写

r+    以读写模式打开

w+   以读写模式打开 (参见 w )

a+    以读写模式打开 (参见 a )

rb     以二进制读模式打开

wb    以二进制写模式打开 (参见 w )

ab     以二进制追加模式打开 (参见 a )

rb+   以二进制读写模式打开 (参见 r+ )

wb+  以二进制读写模式打开 (参见 w+ )

ab+  以二进制读写模式打开 (参见 a+ )

How to remember?

w = write 写

r = read read

b = bytes binary

a = append append

Then maybe combine

If no mode is added, the default is r

When processing a file, close the file

4.2 File close

file object.close()

Take a look at the example below

f = open('/root/userfolder/1.txt')
type(f)

The result is

_io.TextIOWrapper

View the file type, it is a text type (text),
indicating that the file has been opened.
Opening file is to read the file from the external memory (hard disk) into the memory. According to the previous learning,
there must be an id number

id(f)

The result is

139835160270384

ok
Next close the file:

f.close()

If we open a file that does not exist,
the system default r mode will report an error:

f = open('/root/userfolder/2.txt')

The result is

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-38-ba3d87886db4> in <module>()
----> 1 f = open('/root/userfolder/2.txt')

FileNotFoundError: [Errno 2] No such file or directory: '/root/userfolder/2.txt'

Just open it in w mode

f = open('/root/userfolder/2.txt','w')

You will find that a 2.txt file is created in the directory


4.3 Writing to the file

Earlier we said that using open() plus the absolute path or relative path of the file can open the file

Here we talk about a simpler method

We import the os module, which is a module related to the operating system

The os.chdir() method is used to change the current working directory to the specified path.

In the 1.txt file in the /root/userfolder/ directory, write "Momo, hello" and save it

4.3.1 read

We use the read() method to read the text content

import os
os.chdir('/root/userfolder/')
f = open('1.txt',encoding='gbk')
f.read()

The ubuntu system used, the encoding format has become gbk. If it is not added, read() will report an error when reading Chinese, and the window system can be omitted. The
result is

'莫莫,你好'

4.3.2 write

We can also write content using the write() method:

f.write('莫莫,我爱你')

The result is

---------------------------------------------------------------------------
UnsupportedOperation                      Traceback (most recent call last)
<ipython-input-57-0a3613f20217> in <module>()
----> 1 f.write('莫莫,我爱你')

UnsupportedOperation: not writable

Report an error! The default open mode is r (read-only), so no writing is possible.


close the file first

f.close()

Do it again, change the open mode

import os
os.chdir('/root/userfolder/')
f = open('1.txt','a',encoding='gbk')
f.write('\n莫莫,我爱你')
f.close()

read it

f = open('1.txt',encoding='gbk')
f.read()

The result is

'莫莫,你好\n莫莫,我爱你'

read again

f.read()

The result is

''

What is the situation, it is found that the content read out is empty!
Explain, the read() method here is equivalent to reading the entire content. If you finish reading the content, when you read it again, the bookmark is already at the end of the article, and of course there is no content when you read it again~


What if I only want to read one line?

f.close()
f = open('1.txt',encoding='gbk')
f.readline()

The result is

'莫莫,你好\n'

read the second line

f.readline()

The result is

'莫莫,我爱你'

read the third line

f.readline()

The result is

''

The above example has only two lines. If there are multiple lines, you can use a loop to read

f.close()
f = open('1.txt',encoding='gbk')
if f.readline()!='':
    print(f.readline())

The result is

莫莫,我爱你

Well Well? Why is only the second sentence read out?
Because the judgment in if has already executed readline() once, and the readline() execution in print() reads the second sentence


Let's write it differently

f.close()
f = open('1.txt',encoding='gbk')
for i in range(0,2):
    print(f.readline())

The result is

莫莫,你好

莫莫,我爱你

The problem comes again, we know that there are two lines, but what about the situation where we don't know how many lines there are?


You can use the readlines() method

f.close()
f = open('1.txt',encoding='gbk')
f.readlines()

The result is

['莫莫,你好\n', '莫莫,我爱你']

The readlines() method uses the content of each line as a list element and returns a list which
looks very unsightly


process it

f.close()
f = open('1.txt',encoding='gbk')
for i in f.readlines():
    print(i)

The result is

莫莫,你好

莫莫,我爱你

We can write more pythonic code

f.close()
f = open('1.txt',encoding='gbk')
g = [print(i) for i in f.readlines()]

The result is

莫莫,你好

莫莫,我爱你

Why add a g =?


Let's see the case without g =

f.close()
f = open('1.txt',encoding='gbk')
[print(i) for i in f.readlines()]

The result is

莫莫,你好

莫莫,我爱你
[None, None]

It will return a list, and the element of this list is None, because the print( ) function as the element of the list does not return a value
. It is easy to understand with a simpler example.

i = print('莫莫,我爱你')

The result is

莫莫,我爱你

print i

print(i)

The result is

None

Assign print() to i and
we print i and find that it is None


5 Traversal of files

python get all file names in the current folder

import os  
def file_name(file_dir):   
    for root, dirs, files in os.walk(file_dir):  
        print(root) #当前目录路径  
        print(dirs) #当前路径下所有子目录   
        print(files) #当前路径下所有非目录子文件  

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326747412&siteId=291194637