Python and byte string type

String (str)

Defined strings

  • String is a string of characters, is a programming language represented data types text
  • Double quotes may be used "or single quotes' define a one string in Python
  • You can use the index obtaining a character string specified location, the start index count from 0
  • You may also be used for loop through each character string
#!/usr/bin/env python3
# -*-coding:utf-8-*-

"""
@author:fyh
@time:2019/5/31
"""

str1 = "hello python"

for c in str1:
    print(c, end='\t')
    
# 运行结果:h    e    l    l    o         p    y    t    h    o    n

Common string operation

#!/usr/bin/env python3
# -*-coding:utf-8-*-

"""
@author:fyh
@time:2019/5/31
"""

# 1   * 重复输出字符串
print('hello' * 2)

# 2 [] ,[:] 通过索引获取字符串中字符
print('helloworld'[2:])

# 3 in  成员运算符 - 如果字符串中包含给定的字符返回 True
print('el' in 'hello')

# 4 %   格式字符串
print('alex is a good teacher')
print('%s is a good teacher' % 'alex')

# 5 +   字符串拼接
a = '123'
b = 'abc'
c = '789'
d1 = a + b + c
print(d1)  # +效率低,该用join

# join效率高
d2 = ''.join([a, b, c])
print(d2)

String related functions

Common Functions

# string.upper() 转换 string 中的小写字母为大写
# string.lower() 转换 string 中所有大写字符为小写
# string.startswith(obj, beg=0,end=len(string)) 检查字符串是否是以obj开头,是则返回True,否则返回 False。如果beg 和 end 指定值,则在指定范围内检查.
# string.endswith(obj, beg=0, end=len(string)) 检查字符串是否以 obj 结束,如果beg 或者 end 指定则检查指定的范围内是否以 obj 结束,如果是,返回 True,否则返回 False.
# string.replace(str1, str2, num=string.count(str1)) 把 string 中的 str1 替换成 str2,如果 num 指定,则替换不超过 num 次.
# string.strip([obj]) 在 string 上执行 lstrip()和 rstrip() 去除空格
# string.split(str="", num=string.count(str)) 以 str 为分隔符切片 string,如果 num有指定值,则仅分隔 num 个子字符串
# string.find(str, beg=0, end=len(string)) 检测 str 是否包含在 string 中,如果 beg 和 end 指定范围,则检查是否包含在指定范围内,如果是返回开始的索引值,否则返回-1
# string.encode(encoding='UTF-8', errors='strict') 以 encoding 指定的编码格式编码 string,如果出错默认报一个ValueError 的异常,除非 errors 指定的是ignore或者replace
# string.decode(encoding='UTF-8', errors='strict') 以 encoding 指定的编码格式解码 string,如果出错默认报一个ValueError的异常,除非errors指定的是'ignore'或 者'replace'
# string.join(seq) 以 string 作为分隔符,将 seq 中所有的元素(的字符串表示)合并为一个新的字符串
# string.format()  格式化输出

Other functions

# string.capitalize() 把字符串的第一个字符大写
# string.center(width) 返回一个原字符串居中,并使用空格填充至长度 width 的新字符串
# string.count(str, beg=0, end=len(string)) 返回 str 在 string 里面出现的次数,如果 beg 或者 end 指定则返回指定范围内 str 出现的次数
# string.expandtabs(tabsize=8) 把字符串 string 中的 tab 符号转为空格,tab 符号默认的空格数是8。
# string.index(str, beg=0, end=len(string)) 跟find()方法一样,只不过如果str不在 string中会报一个异常.
# string.isalnum() 如果 string 至少有一个字符并且所有字符都是字母或数字则返回 True,否则返回 False
# string.isalpha() 如果 string 至少有一个字符并且所有字符都是字母则返回 True,否则返回 False
# string.isdecimal() 如果 string 只包含十进制数字则返回 True 否则返回 False.
# string.isdigit() 如果 string 只包含数字则返回 True 否则返回 False.
# string.islower() 如果 string 中包含至少一个区分大小写的字符,并且所有这些(区分大小写的)字符都是小写,则返回 True,否则返回 False
# string.isnumeric() 如果 string 中只包含数字字符,则返回 True,否则返回 False
# string.isspace() 如果 string 中只包含空格,则返回 True,否则返回 False.
# string.istitle() 如果 string 是标题化的(见 title())则返回 True,否则返回 False
# string.isupper() 如果 string 中包含至少一个区分大小写的字符,并且所有这些(区分大小写的)字符都是大写,则返回 True,否则返回 False
# string.ljust(width) 返回一个原字符串左对齐,并使用空格填充至长度 width 的新字符串
# string.lstrip() 截掉 string 左边的空格
# string.maketrans(intab, outtab]) maketrans() 方法用于创建字符映射的转换表,对于接受两个参数的最简单的调用方式,第一个参数是字符串,表示需要转换的字符,第二个参数也是字符串表示转换的目标。
# max(str) 返回字符串 str 中最大的字母。
# min(str) 返回字符串 str 中最小的字母。
# string.partition(str) 有点像 find()和 split()的结合体,从 str 出现的第一个位置起,把 字 符 串 string 分 成 一 个 3 元 素 的 元 组 (string_pre_str,str,string_post_str),如果 string 中不包含str 则 string_pre_str == string.
# string.rfind(str, beg=0,end=len(string) ) 类似于 find()函数,不过是从右边开始查找.
# string.rindex( str, beg=0,end=len(string)) 类似于 index(),不过是从右边开始.
# string.rjust(width) 返回一个原字符串右对齐,并使用空格填充至长度 width 的新字符串
# string.rpartition(str) 类似于 partition()函数,不过是从右边开始查找.
# string.rstrip() 删除 string 字符串末尾的空格.
# string.splitlines(num=string.count('\n')) 按照行分隔,返回一个包含各行作为元素的列表,如果 num 指定则仅切片 num 个行.
# string.swapcase() 翻转 string 中的大小写
# string.title() 返回"标题化"的 string,就是说所有单词都是以大写开始,其余字母均为小写(见 istitle())
# string.translate(str, del="") 根据 str 给出的表(包含 256 个字符)转换 string 的字符,要过滤掉的字符放到 del 参数中

Case:

#!/usr/bin/env python3
# -*-coding:utf-8-*-

"""
@author:fyh
@time:2019/5/31
"""

name = "hello world"
# 小写字母变为大写
print(name.upper())    # HELLO WORLD
print(name)            # hello world

# 把大写字母变小写
name2 = "HELLO WORLD"
print(name2.lower())    # hello world
print(name2)            # HELLO WORLD

# 判断是否以。。。开头
name3 = "hello"
print(name3.startswith("he"))   # True
# 检查字符串是否以。。结尾
print(name3.endswith("lo"))     # True

# replace   替换
name4 = "python"
print(name4.replace('th', 'aa'))    # pyaaon

# 去除空格
name5 = "  bbcc  "
print(name5.strip())    # bbcc

# 分割
str1 = "aa|bb|cc|dd"
print(str1.split('|'))  # ['aa', 'bb', 'cc', 'dd']

# 查找  返回开始的索引值
str2 = "we are family"
print(str2.find("are"))     # 3

# join  拼接
lst1 = ['aa', 'bb', 'cc', 'dd']
print("-".join(lst1))       # aa-bb-cc-dd

String sections

  • Slice index value is used to limit the scope, according to the step part of the element removed from the original sequence to form a new sequence
  • Slicing method is applicable to strings, lists, tuples
  • The syntax for the slice expression: [start_index: end_index: step], where:
    • start_index: starting index
    • end_index: end index
    • step: step

The slicing operation is in accordance with the step size, index taken from the start to the end of the index, but the index does not include the end (i.e. the end of the index minus 1) all elements.

Slice does not change the original object, but to re-generate a new object

#!/usr/bin/env python3
# -*-coding:utf-8-*-

"""
@author:fyh
@time:2019/5/31
"""

str1 = "hello world"


print(str1[1:6])
# 结果是ello

print(str1[1:6:2])
# 结果是el

print(str1[2:])
# 结果是llo world
# 保留start_index,但省略end_index,这样会从起始索引开始,切到最后一个元素为止

print(str1[:5])
# 结果是hello
# 省略start_index,保留end_index,这样会从第一个元素开始,切到end_index - 1的元素为止

print(str1[-1:-6:-1])
# 结果是dlrow

print(str1[1:6:-1])
# 结果为空
# 切片时,一定要保证start_index到end_index的方向与步长step的方向同向,否则会切出空的序列

String concatenation

Python string formatting of three ways: the percent mode, format mode, f-strings manner

Percent way

格式:%[(name)][flags][width].[precision]typecode

  • (Name) Alternatively, for selecting a specified key

  • flags optional, alternative values ​​are:

    • + Right aligned; added just before the positive, negative numbers preceded by a minus sign;
    • - Left; unsigned positive front, a minus sign before the negative;
    • Right-aligned spaces; spaces, a minus sign before the negative before positive number;
    • Align Right 0; unsigned before positive and negative numbers preceded by a minus sign; the space filled with 0
  • width optional, possession of width

  • precision optional decimal places reserved

  • typecode Required

    • s, __str__ method of obtaining the return value of the incoming object and is formatted to the specified position
    • r, __repr__ method of obtaining the return value of the incoming object and is formatted to the specified position
    • C, integers: a number into its corresponding unicode values, decimal range 0 <= i <= 1114111 (py27 only supports 0-255); character: adding a character to the specified position
    • O, the octal notation is converted into an integer, and formats it into the designated location
    • x, an integer converted into hexadecimal notation, and formats it into the designated location
    • d, convert integer, floating point decimal representation and formats it to a designated position
    • E, integer, floating-point numbers into scientific notation, and formats it into the designated location (lowercase e)
    • E, converting the integer, floating-point number in scientific notation, and formats it to the specified position (uppercase E)
    • F, converting integers, floating point numbers to floating point representation and formats it to the specified position (after the default 6 decimal places)
    • F, ibid.
    • g, automatic adjustment convert integer, float to float or scientific notation (more figures in scientific notation), and formats it to a specified location (if it is a scientific notation E;)
    • G, automatic adjustment convert integer, float to float or scientific notation (more figures in scientific notation), and formats it to a specified location (if it is a scientific notation E;)
    • %, When the flag is present formatted string, a percent sign indicates required by %%

Note: Python the percent format is automatically converted into a binary representation of an integer as not exist

Case:

tpl = "i am %s" % "alex"
  
tpl = "i am %s age %d" % ("alex", 18)
  
tpl = "i am %(name)s age %(age)d" % {"name": "alex", "age": 18}
  
tpl = "percent %.2f" % 99.97623
  
tpl = "i am %(pp).2f" % {"pp": 123.425556, }
  
tpl = "i am %.2f %%" % 123.425556

format mode

格式:[[fill]align][sign][#][0][width][,][.precision][type]

    • [Optional] fill the space filled with character

    • [optional] align alignment (in conjunction with the use of width)

      • <, Left-aligned content
      • > Contents in the right alignment (default)
      • = Contents in the right alignment, the symbol on the left pad characters, and only valid digital type. Even: digital sign + filler +
      • ^, Centered content
    • sign [optional] whether the symbol numbers

      • +, Positive plus plus plus minus minus sign;
      • - a positive sign change, plus minus negative sign;
      • Plus negative space, plus space, negative sign;
    • # [Optional] For binary, octal, hexadecimal, if coupled with # displayed 0b / 0o / 0x, or do not show

    • , [Optional] is added to a digital separators, such as: 1,000,000

    • [optional] format bit width occupied by the width

    • .precision [optional] decimal place precision reserved

    • type [optional] format type

      • Incoming "string type" parameter

        • s, string type data format
        • Blank, unspecified type, the default is None, with s
      • Incoming "integer type" parameter

        • b, will be automatically converted to decimal integer of 2 hexadecimal format then
        • c, will automatically convert a decimal integer to its corresponding unicode character
        • d, decimal integer
        • o, it will be automatically converted to decimal integer octal and format;
        • x, the decimal integer automatically converted into a hexadecimal format and (lower case x)
        • X, will be automatically converted to decimal integer in hexadecimal format and then (upper case X)
      • Passed "or decimal floating-point type" parameter
        • E, is converted to scientific notation (lowercase e) shows, then format;
        • E, is converted to scientific notation (uppercase E), and format;
        • F, is converted to floating point (the default after the decimal point 6) shows, then format;
        • F., Is converted to floating point (the default after the decimal point 6) shows, then format;
        • g, e and f are automatically switched
        • G, automatically switches E and F
        • %, Shows the percentage (default display 6 decimal place)

Case:

tpl = "i am {}, age {}, {}".format("seven", 18, 'alex')
   
tpl = "i am {}, age {}, {}".format(*["seven", 18, 'alex'])
   
tpl = "i am {0}, age {1}, really {0}".format("seven", 18)
   
tpl = "i am {0}, age {1}, really {0}".format(*["seven", 18])
   
tpl = "i am {name}, age {age}, really {name}".format(name="seven", age=18)
   
tpl = "i am {name}, age {age}, really {name}".format(**{"name": "seven", "age": 18})
   
tpl = "i am {0[0]}, age {0[1]}, really {0[2]}".format([1, 2, 3], [11, 22, 33])
   
tpl = "i am {:s}, age {:d}, money {:f}".format("seven", 18, 88888.1)
   
tpl = "i am {:s}, age {:d}".format(*["seven", 18])
   
tpl = "i am {name:s}, age {age:d}".format(name="seven", age=18)
   
tpl = "i am {name:s}, age {age:d}".format(**{"name": "seven", "age": 18})
  
tpl = "numbers: {:b},{:o},{:d},{:x},{:X}, {:%}".format(15, 15, 15, 15, 15, 15.87623, 2)
  
tpl = "numbers: {:b},{:o},{:d},{:x},{:X}, {:%}".format(15, 15, 15, 15, 15, 15.87623, 2)
  
tpl = "numbers: {0:b},{0:o},{0:d},{0:x},{0:X}, {0:%}".format(15)
  
tpl = "numbers: {num:b},{num:o},{num:d},{num:x},{num:X}, {num:%}".format(num=15)

f-strings way

f-strings provide a simple and easy to read manner, Python expression may be included in the string. f-strings with the letter 'f' or 'F' as a prefix, a pair of single format string double quotation marks, three single quotes, three double quotes. formatted string.

#!/usr/bin/env python
# -*-coding:utf-8-*-

"""
@author:fyh
@time:2019/6/5
"""
name = '豪仔'
age = 26
format_string1 = f'我的名字是 {name}, 我的年龄是 {age}'
print(format_string1)  # 我的名字是 豪仔, 我的年龄是 26

format_string2 = f"我的名字是 {name}, 我的年龄是 {age}"
print(format_string2)   # 我的名字是 豪仔, 我的年龄是 26

format_string3 = F'''我的名字是 {name}, 我的年龄是 {age}'''
print(format_string3)  # 我的名字是 豪仔, 我的年龄是 26

format_string4 = F"""我的名字是 {name}, 我的年龄是 {age}"""
print(format_string4)   # 我的名字是 豪仔, 我的年龄是 26

# 花括号中的数字会先运算
format_string5 = f'3 + 5 = {3 + 5}'
print(format_string5)   # 3 + 5 = 8

a = 10
b = 20
format_string6 = f'3 + 5 = {a + b}'
print(format_string6)   # 3 + 5 = 30

# 两个花括号会被替换为一个花括号, 注意{{}} 不表示表达式
format_string7 = F'我的名字是 {{name}}, 我的年龄是 {{age}}'
print(format_string7)

Byte (bytes)

One of the most important new features is Python3 strings and binary data stream made a clear distinction. Text always Unicode, represented by str type binary data type represented by bytes. Python3 not in any way an implicit mix str and bytes, you can concatenate strings and byte stream, you can not search string (and vice versa) in the byte stream, nor can the string passed as a parameter word throttling function (or vice versa).

Recalling the history of the development of coding

In the early days of computing history, the United States is the leading English-speaking countries on behalf of the entire computer industry, 26 English letters a variety of English words, statements, articles. Thus, the first ASCII code for the character encoding, an 8-bits, i.e., 1 byte coding standards, it can be required to cover the entire coding English system.

What code is? A character encoding is to be represented by a binary. We all know that all things, whether it is in English, Chinese or symbols, etc., are ultimately stored on disk 01010101 such things. Inside the computer, read and store the data in the final analysis, the bit stream is composed of 0 and 1 handle. The question is, humans do not understand these bit streams, how to make these 010,101 pairs become human readable it? So there is a character encoding, it is a translator, somewhere inside the computer, the help text will be transparent bit stream can be directly translated into human understandable. For the average user, this process does not need to know what principle, is how to perform. But it was a must for programmers to figure out the problem.

In an example ASCII code, which provides an 8 bit byte represents a region encoding a character is "00000000" so wide, an explanation of a byte. For example: 01000001 capital letter A, we sometimes "lazy" with 65 to represent this decimal A coded in ASCII. 8 bits can represent up to 8 repetitive no power (255) of 2 characters.

Later, computers gained popularity, writing Chinese, Japanese, Korean, and so need to represent the country in the computer, ASCII of 255 is not enough, so the standards organizations to develop a Unicode is called UNICODE, which provides that any character (no matter what country) in at least two bytes can be more. Wherein the letters is 2 bytes, 3 bytes and characters. Although this code is very good, it satisfied all the requirements, but it is not compatible with ASCII, but also take up more space and memory. Because, in the computer world more characters are letters, obviously it can be a byte can represent, have to use two.

Then came into UTF-8 encoding, which provides series of letters represented by 1 byte, 3 bytes represented by characters and the like. Therefore, it is compatible with ASCII, can decode the earlier document. UTF-8 soon to be widely used.

In the development of coding, China has created its own coding, such as GBK, GB2312, BIG5. They limited use in the country, it is not recognized abroad. In GBK encoding, the Chinese characters occupies 2 bytes.

Similarities and differences between bytes and str

Str bytes and back of the body. bytes is a bit stream, it is present in the form of such 01010001110. Whether we are writing the code, or in the process of reading the article, it was certainly not directly read this bit stream, it must have a coded way that it becomes meaningful bit stream, rather than a bunch of obscure 01 combination. Because different encoding methods, interpretation of the bit stream will be different, the actual use causes great distress. Let's look at how to deal with this series of Python is coding problem:

>>> s = "中文"
>>> s
'中文'
>>> type(s)
<class 'str'>
>>> b = bytes(s, encoding='utf-8')
>>> b
b'\xe4\xb8\xad\xe6\x96\x87'
>>> type(b)
<class 'bytes'>

As can be seen from the example, s is a string type. Python has a built-in functions bytes () can be converted into bytes type type string str, b is actually a combination of a bunch of 01, but in order for us to observe in a relatively straightforward ide environment, it is the performance becomes b '\ xe4 \ xb8 \ xad \ xe6 \ x96 \ x87 'in this form, the beginning of the b bytes indicate a type. \ XE4 hexadecimal representation, which occupies 1 byte in length, so the "Chinese" is encoded into utf-8, we can derive a number of common six bytes, each character occupies 3 months, which confirms the above discussion. When using the built-in function bytes (), you must clear the encoding parameters, can not be omitted.

As we all know, there are a string class str encode () method, which is the encoding process from the string into the bit stream. Bytes have exactly the type of decode () method, which is a process flow from the decoded bit string. In addition, we see the Python source code will find a list of bytes and str methods have almost exactly the same, the biggest difference is to encode and decode.

In essence, the combination to save a string on the disk is 01, also need codec.

If the above describes not let you figure out the difference between the two, so keep in mind the following two words:

  1. In the process of the string to disk and read from the disk in a string, Python automatically help you complete the coding and decoding work, you do not need to be concerned about its process.
  2. Use bytes type essentially tells Python, do not need it to help you to automatically complete encoding and decoding work, but the user's own manual, and specify the encoding format.
  3. Python has strict distinction between the two types of data bytes and str, you can not use the str parameter type parameter bytes when needed, and vice versa. This is likely to encounter when reading and writing disk files.

In the process of mutual conversion and str bytes, the process is actually a codec, you must explicitly specify the encoding formats.

#!/usr/bin/env python3
# -*-coding:utf-8-*-

"""
@author:fyh
@time:2019/5/31
"""
# gbk编码的bytes
b = bytes('中国', encoding='gbk')
print(b)        # b'\xd6\xd0\xb9\xfa'
print(type(b))  # <class 'bytes'>

str1 = str(b)
print(str1)     # b'\xd6\xd0\xb9\xfa'
print(type(str1))   # <class 'str'>

# 指定编码格式
str2 = str(b, encoding='gbk')
print(str2)     # 中国
print(type(str2))   # <class 'str'>

# 再把字符串str2转换为utf8编码格式的bytes类型:
b2 = bytes(str2, encoding='utf8')
print(b2)       # b'\xe4\xb8\xad\xe5\x9b\xbd'
print(type(b2))     # <class 'bytes'>

encode和decode

The above mentioned, Unicode when python3 default encoding, by the string, bytes of binary data type.

  • encode: str converted to bytes, the encoding process is a
  • decode: converting bytes to str, a decoding process is
#!/usr/bin/env python3
# -*-coding:utf-8-*-

"""
@author:fyh
@time:2019/5/31
"""

"""
其中decode()与encode()方法可以接受参数, 其声明分别为:
    bytes.decode(encoding="utf-8", errors="strict")
    str.encode(encoding="utf-8", errors="strict")

        其中 encoding是指在解码/编码(动词)过程中使用的字符编码(名词)
        
        errors是指错误的处理方案,errrors参数默认值是strict(严格的)意味着如果编解码出错将会抛出UnicodeError

        如果想忽略编解码错误 可以将errors设置为ignore
"""


# 编码 encode
str1 = "中国"
 
str1_utf8 = str1.encode(encoding='utf8')
print(str1_utf8)        # b'\xe4\xb8\xad\xe5\x9b\xbd'
print(type(str1_utf8))  # <class 'bytes'>

str1_gbk = str1.encode(encoding="gbk")
print(str1_gbk)         # b'\xd6\xd0\xb9\xfa'
print(type(str1_gbk))   # <class 'bytes'>
 

# 解码 decode
str2 = str1_utf8.decode(encoding="utf8")
print(str2)     # 中国
print(type(str2))   # <class 'str'>

str3 = str1_gbk.decode(encoding="gbk")
print(str3)         # 中国
print(type(str3))   # <class 'str'>

When encoding and decoding corresponding to encoding to be consistent

Encoding and decoding of data is to view the work by the return value of the end result, is not affected and the previous data objects

Guess you like

Origin www.cnblogs.com/fengyuhao/p/11697516.html