[Liao Xuefeng] python_IO programming: StringIO and BytesIO, operating files and directories, serialization

IO programming

IO refers to Input/Output in computers, that is, input and output . Since programs and runtime data reside in memory and are executed by the ultra-fast computing core of the CPU, IO interfaces are required wherever data exchange is involved, usually disks, networks, etc.

In IO programming, Stream is a very important concept. You can think of the stream as a water pipe, and the data is the water in the pipe, but it can only flow in one direction. Input Stream means data flows into the memory from the outside (disk, network), and Output Stream means data flows from the memory to the outside. For browsing the web, at least two water pipes need to be established between the browser and the Sina server to be able to both send and receive data.

Since the speed of CPU and memory is much higher than the speed of peripherals , there is a serious speed mismatch problem in IO programming . For example, if you want to write 100M data to the disk, it only takes 0.01 seconds for the CPU to output 100M data, but it may take 10 seconds for the disk to receive the 100M data. What should I do? There are two ways:

The first is that the CPU waits, that is, the program pauses the execution of subsequent code, waits for 100M of data to be written to the disk after 10 seconds, and then continues execution. This mode is called synchronous IO ;

Another method is that the CPU does not wait, but just tells the disk, "You always write slowly, don't worry, I will go on to do other things." Therefore, subsequent code can be executed immediately. This mode is called asynchronous IO .

The difference between synchronous and asynchronous is whether to wait for the result of IO execution . It’s like when you go to McDonald’s to order food, you say “a burger,” and the waiter tells you, sorry, the burger needs to be cooked fresh and you need to wait 5 minutes, so you stand in front of the checkout counter and wait for 5 minutes, get the burger, and then go to the mall. , this is synchronous IO.

You say "have a burger" and the waiter tells you that the burger needs to wait for 5 minutes. You can go to the mall first. When it is ready, we will notify you so that you can go do other things immediately (go to the mall). It is asynchronous IO.

Obviously, the performance of using asynchronous IO to write programs will be much higher than that of synchronous IO, but the disadvantage of asynchronous IO is that the programming model is complex. Think about it, you have to know when to tell you "the burger is ready," and the methods of notifying you vary. If the waiter comes to find you, this is the callback mode . If the waiter sends you a text message to notify you, you have to keep checking your phone, which is the polling mode . In short, the complexity of asynchronous IO is much higher than that of synchronous IO.

The ability to operate IO is provided by the operating system. Every programming language encapsulates the low-level C interface provided by the operating system for easy use , and Python is no exception. The IO programming in this chapter is all in synchronous mode.

StringIO and BytesIO

StringIO

Many times, data reading and writing are not necessarily files, but can also be read and written in memory.

StringIO, as the name suggests, reads and writes str in memory .

To write str to StringIO, we need to create a StringIO first , and then write it like a file :

>>> from io import StringIO
>>> f = StringIO()
>>> f.write('hello')
5
>>> f.write(' ')
1
>>> f.write('world!')
6
>>> print(f.getvalue())  # 用于获取写入后的 str
hello world!

getvalue()The method is used to obtain the str after writing.

To read StringIO, you can initialize StringIO with a str , and then read it like a file :

>>> from io import StringIO
>>> f = StringIO('Hello!\nHi!\nGoodbye!')
>>> while True:
...     s = f.readline()
...     if s == '':
...         break
...     print(s.strip())
...
Hello!
Hi!
Goodbye!

BytesIO

StringIO can only operate on str. If you want to operate binary data, you need to use BytesIO.

BytesIO implements reading and writing bytes in memory . We create a BytesIO and then write some bytes:

>>> from io import BytesIO
>>> f = BytesIO()
>>> f.write('中文'.encode('utf-8'))
6
>>> print(f.getvalue())  # 用于获取写入后的bytes
b'\xe4\xb8\xad\xe6\x96\x87'

Please note that what is written is not str, but UTF-8 encoded bytes.

Similar to StringIO, BytesIO can be initialized with one bytes, and then read like a file:

>>> from io import BytesIO
>>> f = BytesIO(b'\xe4\xb8\xad\xe6\x96\x87')
>>> f.read()
b'\xe4\xb8\xad\xe6\x96\x87'

summary

StringIO and BytesIO are methods of operating str and bytes in memory, so that they have a consistent interface for reading and writing files.


Manipulate files and directories

If we want to operate files and directories, we can enter various commands provided by the operating system on the command line to complete. For example dir, cpwait for an order.

What if you want to perform operations on these directories and files in a Python program? In fact, the commands provided by the operating system simply call the interface functions provided by the operating system. Python's built-in osmodules can also directly call the interface functions provided by the operating system .

Open the Python interactive command line and let's see how to use osthe basic functions of the module:

>>> import os
>>> os.name
'nt'

If yes posix, indicate that the system is Linux, Unixor Mac OS X, if yes nt, it is Windowsthe system.

To get detailed system information, you can call uname()the function. but! Note that uname()the functions are not provided on Windows, that is to say, ossome functions of the module are related to the operating system. (cry!!!)

environment variables

All environment variables defined in the operating system are stored in os.environthis variable and can be viewed directly:

>>> os.environ
[Squeezed text(147 lines)]  # 我的显示结果

To get the value of an environment variable, you can call os.environ.get('key'):

>>> os.environ.get('PATH')
[Squeezed text(50 lines)]  # 我的显示结果
>>> os.environ.get('x', 'default')
'default'

Manipulate files and directories

Some of the functions that operate files and directories are placed osin modules, and some are placed os.pathin modules. This should be noted. Viewing, creating and deleting directories can be called like this:

# 查看当前目录的绝对路径:
>>> os.path.abspath('.')  
'C:\\Program Files\\Python39'

# 在某个目录下创建一个新目录,首先把新目录的完整路径表示出来:
>>> os.path.join('E:/', 'testdir')
'E:/testdir'

# 然后创建一个目录:
>>> os.mkdir('E:/testdir')

# 删掉一个目录:
>>> os.rmdir('E:/testdir')

When you want to split a path, don't split the string directly, but use os.path.split()a function. This way you can split a path into two parts. The latter part is always the last level directory or file name :

>>> os.path.split('E:\\submit\\submit\\README.md')
('E:\\submit\\submit', 'README.md')

os.path.splitext()It allows you to get the file extension directly , which is very convenient in many cases:

>>> os.path.splitext('E:\\submit\\submit\\README.md')
('E:\\submit\\submit\\README', '.md')

These functions for merging and splitting paths do not require that the directories and files actually exist. They only operate on strings.

File operations use the following functions. Assume there is a test.txtfile in the current directory:

# 对文件重命名:
>>> os.rename('test.txt', 'test.py')
# 删掉文件:
>>> os.remove('test.py')

But the function to copy files osdoes not exist in the module! The reason is that copying files is not a system call provided by the operating system . Theoretically, we can complete file copying by reading and writing files in the previous section, but we need to write a lot more code.

Fortunately, the functions shutilprovided by the module copyfile(), you can also shutilfind many practical functions in the module, they can be regarded as ossupplements to the module.

Finally, let’s take a look at how to use Python’s features to filter files. For example, if we want to list all directories in the current directory , we only need one line of code:

>>> [x for x in os.listdir('.') if os.path.isdir(x)]
['DLLs', 'Doc', 'include', 'Lib', 'libs', 'Scripts', 'share', 'tcl', 'Tools']

To list all .pyfiles , just one line of code:

>>> [x for x in os.listdir('.') if os.path.isfile(x) and os.path.splitext(x)[1]=='.py']
[]

summary

Python osmodules encapsulate the directory and file operations of the operating system. It should be noted that some of these functions are in osthe module and some are in os.paththe module.

practise

1. Use osmodules to write a dir -lprogram that can achieve output.

>>> [x for x in os.listdir('E:/submit/submit/')]
['.DS_Store', '.idea', 'config1.json', 'config2.json', 'config3.json', 'data.py', 'demo', 'inference.py', 'model.py', 'README.html', 'README.md', 'results', 'run.sh.rjejmartr', 'train.py', '__pycache__']

2. Write a program that can search for files whose file names contain the specified string in the current directory and all subdirectories of the current directory , and print out the relative path .

import os
def search_file(path, str):
    # 首先找到当前目录下的所有文件
    # os.listdir(path) 是当前这个path路径下的所有文件的列表,包括子目录、子目录下的目录及文件
    for file in os.listdir(path):
        this_path = os.path.join(path, file)
        if os.path.isfile(this_path):  # 判断这个路径对应的是目录还是文件,是文件就走下去
            if str in file:
                print(this_path)
        else:
            search_file(this_path, str)


if __name__ == "__main__":
    search_file("F:/7788/", "王")

Serialization

During the running of the program, all variables are in memory. For example, defining a dict:

>>> d = dict(name='Bob', age=20, score=88)
>>> d
{
    
    'name': 'Bob', 'age': 20, 'score': 88}

Variables can be modified at any time, such as changing nameto 'Bill', but once the program ends, all the memory occupied by the variables will be reclaimed by the operating system. If the modification is not 'Bill'stored on the disk, the variable will be initialized again when the program is re-run next time 'Bob'.

We call the process of changing variables from memory to storable or transferable . It is called pickling in Python . It is also called serialization, marshalling, flattening, etc. in other languages. They all have the same meaning.

After serialization, the serialized content can be written to disk or transmitted to other machines through the network.

In turn, rereading the variable content from the serialized object into memory is called deserialization , that is, unpickling .

Python provides picklemodules to implement serialization.

First, we try to serialize an object and write it to a file:

>>> import pickle
>>> d = dict(name='Bob', age=20, score=88)
>>> pickle.dumps(d)
b'\x80\x04\x95$\x00\x00\x00\x00\x00\x00\x00}\x94(\x8c\x04name\x94\x8c\x03Bob\x94\x8c\x03age\x94K\x14\x8c\x05score\x94KXu.'

pickle.dumps()Method serializes any object into one bytes, which can then be byteswritten to a file. Or use another method pickle.dump()to directly serialize the object and write it to a file-like Object:

>>> f = open(r'E:\csdn\practice\dump.txt', 'wb')
>>> pickle.dump(d, f)
>>> f.close()

Look at the written dump.txtfile, a bunch of messy content, which is the internal information of the object saved by Python .

When we want to read an object from disk to memory, we can first read the content into one bytes, and then use pickle.loads()the method to deserialize the object, or we can directly use the method to directly deserialize the object pickle.load()from one . We open another Python command line to deserialize the object we just saved:file-like Object

# loads()将bytes序列化
>>> pickle_b = b'\x80\x04\x95$\x00\x00\x00\x00\x00\x00\x00}\x94(\x8c\x04name\x94\x8c\x03Bob\x94\x8c\x03age\x94K\x14\x8c\x05score\x94KXu.'
>>> pickle.loads(pickle_b)
{
    
    'name': 'Bob', 'age': 20, 'score': 88}

# load()从一个file-like Object中直接反序列化出对象
>>> f = open(r'E:\csdn\practice\dump.txt', 'rb')
>>> d = pickle.load(f)
>>> f.close()
>>> d
{
    
    'name': 'Bob', 'age': 20, 'score': 88}

The contents of the variable are back!

Of course, this variable and the original variable are completely unrelated objects, they just have the same content.

The problem with Pickle, like the serialization problems specific to all other programming languages, is that it can only be used with Python, and it is possible that different versions of Python are incompatible with each other. Therefore, only unimportant data can be saved with Pickle, which cannot be successfully used. Deserialization doesn't matter either.

JSON

If we want to transfer objects between different programming languages, we must serialize the object into a standard format , such as XML, but a better way is to serialize it into JSON , because JSON is expressed as a string and can be used by all languages. Reading can also be easily stored to disk or transmitted over the network. JSON is not only a standard format and faster than XML, but it can also be read directly in Web pages, which is very convenient.

The objects represented by JSON are standard JavaScript language objects. The correspondence between JSON and Python’s built-in data types is as follows:

JSON type Python types
{} dict
[] list
“string” str
1234.56 int or float
true/false True/False
null None

Python's built-in jsonmodule provides very complete conversion of Python objects into JSON format. Let’s first look at how to turn a Python object into a JSON:

>>> import json
>>> d = dict(name='Bob', age=20, score=88)
>>> json.dumps(d)
'{"name": "Bob", "age": 20, "score": 88}'  # 返回的是str

dumps()The method returns one str, and the content is standard JSON. Similarly, dump()methods can directly write JSON to a file-like Object.

>>> f = open(r'E:\csdn\practice\json_dump.txt', 'w')
>>> json.dump(d, f)  # 写入的是字符串,且不含引号
>>> f.close()

To deserialize JSON into a Python object, use loads()or the corresponding load()method. The former deserializes the JSON string, and the latter reads the string file-like Objectfrom and deserializes it:

# loads()将json的字符串反序列化
>>> json_str = '{"name": "Bob", "age": 20, "score": 88}'
>>> json.loads(json_str)
{
    
    'name': 'Bob', 'age': 20, 'score': 88}

# load从file-like Object中读取字符串并反序列化
>>> f = open(r'E:\csdn\practice\json_dump.txt', 'r')
>>> d = json.load(f)
>>> f.close()
>>> d
{
    
    'name': 'Bob', 'age': 20, 'score': 88}

strSince the JSON standard specifies that the JSON encoding is UTF-8, we can always convert between Python and JSON strings correctly .

JSONAdvanced

Python dictobjects can be directly serialized into JSON {}. However, many times, we prefer to use classrepresentation objects, such as defining Studentclasses, and then serialize them.

By default, dumps()the method does not know how to convert Studentthe instance into a JSON {}object. The optional parameter defaultis to turn any object into an object that can be sequenced into JSON. We only need to Studentwrite a conversion function specifically for it, and then pass the function in:

import json

class Student(object):
    def __init__(self, name, age, score):
        self.name = name
        self.age = age
        self.score = score

# 为Student专门写一个转换函数
def student2dict(std):
    return {
    
    
        'name': std.name,
        'age': std.age,
        'score': std.score
    }

s = Student('Bob', 20, 88)
print(json.dumps(s, default=student2dict))
# 可选参数default就是把任意一个对象变成一个可序列为JSON的对象

# 结果:{"name": "Bob", "age": 20, "score": 88}

However, if you encounter an Teacherinstance of a class next time, it will still not be serialized into JSON. We can be lazy and turn any arbitrary classinstance into dict:

print(json.dumps(s, default=lambda obj: obj.__dict__))
# {"name": "Bob", "age": 20, "score": 88}

Because usually classinstances have a __dict__property, which is one dict, used to store instance variables. There are a few exceptions, such as defined __slots__classes.

By the same token, if we want to deserialize JSON into an Studentobject instance, loads()the method first converts an dictobject, and then object_hookthe function we pass in is responsible for dictconverting it into Studentan instance:

def dict2student(d):
    return Student(d['name'], d['age'], d['score'])

The running results are as follows:

json_str = '{"name": "Bob", "age": 20, "score": 88}'
print(json.loads(json_str, object_hook=dict2student))
# 结果:<__main__.Student object at 0x000001D56BD649B0>

What is printed is the deserialized Studentinstance object.

practise

When serializing JSON for Chinese, json.dumps()a parameter is provided ensure_asciito observe the impact of this parameter on the result:

import json
obj = dict(name='小明', age=20)
s = json.dumps(obj, ensure_ascii=True)
print(s)
# 结果:{"name": "\u5c0f\u660e", "age": 20}

summary

The Python language-specific serialization module is pickle, but if you want to make serialization more general and more compliant with web standards, you can use jsonmodules.

jsonModules dumps()and loads()functions are good examples of well-defined interfaces. When we use it, we only need to pass in one required parameter. However, when the default serialization or deserialization mechanism does not meet our requirements, we can pass in more parameters to customize the serialization or deserialization rules, which not only makes the interface simple and easy to use, but also achieves sufficient scalability and flexibility.

Reference link: IO Programming-Liao Xuefeng’s official website (liaoxuefeng.com)

Guess you like

Origin blog.csdn.net/qq_45670134/article/details/127203316