Python3 standard library: string string manipulation General

1. string: String General Operation

Python string module at a very early version there. Many functions formerly provided by this module has a method for the str object transplant, but the module still retains a lot of useful constants and classes to handle str object.

1.1 Constant

string.ascii_letters

  ascii_lowercase ascii_uppercase constant and below the splice. This value does not depend on locale.

string.ascii_lowercase

  Lowercase letters 'abcdefghijklmnopqrstuvwxyz'. This value does not depend on the language environment, and does not change.

string.ascii_uppercase

  Capital letter 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'. This value does not depend on the language environment, and does not change.

string.digits

  String '0123456789'.

string.hexdigits

  String '0123456789abcdefABCDEF'.

string.octdigits

  String '01234567'

string.punctuation

  String of ASCII characters in the Clocale is regarded as punctuation .

string.printable

  Deemed printable ASCII character strings. This is a combination of digits, ascii_letters, punctuation and whitespace.

string.whitespace

  A string that contains all the space is regarded as ASCII characters. This includes the character space, tab, newline return symbol, and a vertical page break tabs.

Print out the string module constants:

import inspect
import string

def is_str(value):
    return isinstance(value, str)

for name, value in inspect.getmembers(string, is_str):
    if name.startswith('_'):
        continue
    print('%s=%r\n' % (name, value))

result:

These constants are useful when dealing with ASCII data, but due to the non-ASCII text encountered increasingly common in some form of Unicode, so their use is limited.

1.2  custom string formatted

Formatter类实现了与str的format()方法同样的布局规范语言。它的功能包括类型强制,对齐,属性和域引用、命名和位置模板参数以及特定于类型的格式设置选项。在大多数情况下,该format()方法都能更便利地访问这些特功能,不过也可以利用Formatter构建子类,以备需要改动的情况。

class string.Formatter

Formatter class contains the following public methods:

format(format_string, /, *args, **kwargs)
   The primary API methods. It accepts a format string and a set of arbitrary location and keyword parameters. It's just a call vformat () wrapper.
   Changes in version 3.7: Now is the format string argument positional parameters only.
vformat(format_string, args, kwargs)
   This function performs the actual formatting. It is disclosed as a separate function, for a predefined letters need to pass as a parameter, instead of using *args and **kwargs syntax dictionary unpack a plurality of individual parameters both cases packaged. vformat () to complete the decomposition of the format string and the work replacement character data field. It calls in several different methods described below.

In addition, Formatter also defines a number of methods designed to replace the quilt class:

parse(format_string)
   Loop through format_string and returns a tuple (literal_text, field_name, format_spec, conversion) by the iterables thereof. It is vformat () is used to decompose or string literal text substitution field.
  Value tuples represents a replacement period plus the text literal field concept. If there is no literal text (if two consecutive fields will happen alternatively), the literal_text will be a zero-length string. If you do not replace field, field_name, and value format_spec conversion will be None.
get_field(field_name, args, kwargs)
   Field_name as given parse () return value, convert it to an object to be formatted. Returns a tuple (obj, used_key). Accept the default version  PEP 3101  defined in the form of a string, for example "0 [name]" or "label.title". args and kwargs and passed vformat () is the same. The return value and the get_value used_key () parameter of the key has the same meaning.
get_value(key, args, kwargs) 
  Extracting a predetermined field value. key parameter for integer or string. If it is an integer that represents the position of the index parameter args; if it is a string that represents the key parameter name kwargs in.
  Participants form args is set to vformat () a list of positional parameters, and kwargs participants are set to shape the dictionary by the keyword parameters thereof.  
  For composite field name, the function calls only the first component of the field name; subsequent component will be handled by the common attributes and index operations.
  Thus, for example, field expression '0.name' will cause the key parameter values ​​included zero call get_value (). In get_value () by calling the built-in getattr () function returns will look after the property name.
  If the index or keyword refers to the item that does not exist, it is thrown IndexError or KeyError .
check_unused_args(used_args, args, kwargs)
   When necessary to achieve the parameter is not used for detection. The parameters of this function is the set of all key parameters that are actually referenced in the format string (position parameter represents an integer, a string representation of the parameter name), and the reference is passed vformat args and the kwargs. Unused parameter set may be calculated from the shape parameters. If the test fails check_unused_args () should trigger an exception.
format_field(value, format_spec)
   format_field()会简单地调用内置全局函数format()。提供该方法是为了让子类能够重载它。
convert_field(value, conversion)
   Using a given type of conversion (from the tuple method parse () returned) to convert the value of (a get_field () returned). The default version supports 's' (str), 'r' (repr) and 'a' (ascii) conversion type and the like.

1.3 template string

String template is built as an alternative splicing grammar practice. When using string.Template splicing, to $ prefix in front of the name to identify the variables (e.g., $ {var}).

Template-based replacement $ string support, use the following rules:

$$An escape symbol; it will be replaced with a single  $.

$identifierTo replace the placeholder, it matches called "identifier"map keys. By default, "identifier"limited to any alphanumeric ASCII (including underscore) string, case-insensitive, underscore or ASCII letter. In the $following characters in the first non-identifier character will indicate the end of a placeholder.

${identifier}It is equivalent to $identifier. When the need to use effectively placeholder followed but not part of the placeholder character identifier, for example  "${noun}ification".

Appear in a different location string will cause an $ ValueError.

class string.Template(template)

This configuration accepts a parameter string as a template.

substitute(mapping={}, /, **kwds)
   Replace template execution and returns a new string. any dictionary mapping class object, wherein the key matching template placeholder. Or you can provide a set of key parameters, i.e., the corresponding placeholder keyword. When both analysis and mapping and kwds duplicate, places kwds placeholder priority.
safe_substitute(mapping={}, /, **kwds)
   Similar safe_substitute (), except that if there is a placeholder and is found in the mapping kwds, the initiator is not KeyError abnormal, but the original placeholder displayed without modification in the result string. Another difference with the substitute () is present in any other case $ would simply return $ instead of raising ValueError.
  This method is considered "safe", because although there are other abnormalities that may occur, but it always tries to return a string available instead of throwing an exception. On the other hand, safe_substitute () could hardly secure because it ignores the template malformed silently, for example, contains extra separator, placeholder braces unpaired or not legal Python identifiers Fu and so on.

The following examples using% operator will simply template similar string compare interpolation, and used to compare string syntax str.format new format ().

import string

values = {'var': 'foo'}

t = string.Template("""
Variable        : $var
Escape          : $$
Variable in text: ${var}iable
""")

print('TEMPLATE:', t.substitute(values))

s = """
Variable        : %(var)s
Escape          : %%
Variable in text: %(var)siable
"""

print('INTERPOLATION:', s % values)

s = """
Variable        : {var}
Escape          : {{}}
Variable in text: {var}iable
"""

print('FORMAT:', s.format(**values))

The first two cases, the trigger character ( $or %) be escaped by repeated twice. Formatting syntax, to repeat {and }to escape.

result:

The key difference between a template or format string concatenation is that it does not consider the type of the parameter. Values are converted to a string, and the string into the results. There is no offer formatting options. For example, you can not control the use of several significant figures to represent a floating point value.

However, the benefits of doing so is by using safe_substitute () method, you can avoid anomalies that may arise when failed to provide all the required parameter values to the template.

import string

values = {'var': 'foo'}

t = string.Template("$var is here but $missing is not provided")

try:
    print('substitute()     :', t.substitute(values))
except KeyError as err:
    print('ERROR:', str(err))

print('safe_substitute():', t.safe_substitute(values))

由于value字典中没有missing的值,所以substitute()会产生一个KeyError。safe_substitute()则不同,它不会抛出这个错误,而是会捕捉这个错误并保留文本中的变量表达式。

结果:

进阶用法:你可以派生Template的子类来自定义占位符语法、分隔符,或用于解析模板字符串的整个正则表达式。 为此目的,你可以重载这些类属性:  

delimiter -- 这是用来表示占位符的起始的分隔符的字符串字面值。 默认值为 $。 请注意此参数 不能 为正则表达式,因为其实现将在必要时对此字符串调用 re.escape()。 还要注意你不能在创建类之后改变此分隔符(例如在子类的类命名空间中必须设置不同的分隔符)。

idpattern -- 这是用来描述不带花括号的占位符的模式的正则表达式。 默认值为正则表达式 (?a:[_a-z][_a-z0-9]*)。 如果给出了此属性并且 braceidpattern 为 None 则此模式也将作用于带花括号的占位符。

注解:由于默认的 flags 为 re.IGNORECASE,模式 [a-z] 可以匹配某些非 ASCII 字符。 因此我们在这里使用了局部旗标 a

在 3.7 版更改: braceidpattern 可被用来定义对花括号内部和外部进行区分的模式。

braceidpattern -- 此属性类似于 idpattern 但是用来描述带花括号的占位符的模式。 默认值 None 意味着回退到 idpattern (即在花括号内部和外部使用相同的模式)。 如果给出此属性,这将允许你为带花括号和不带花括号的占位符定义不同的模式。

3.7 新版功能.

flags -- 将在编译用于识别替换内容的正则表达式被应用的正则表达式旗标。 默认值为 re.IGNORECASE。 请注意 re.VERBOSE 总是会被加为旗标,因此自定义的 idpattern 必须遵循详细正则表达式的约定。

3.2 新版功能.

作为另一种选项,你可以通过重载类属性 pattern 来提供整个正则表达式模式。 如果你这样做,该值必须为一个具有四个命名捕获组的正则表达式对象。 这些捕获组对应于上面已经给出的规则,以及无效占位符的规则:

escaped -- 这个组匹配转义序列,在默认模式中即 $$

named -- 这个组匹配不带花括号的占位符名称;它不应当包含捕获组中的分隔符。

braced -- 这个组匹配带有花括号的占位符名称;它不应当包含捕获组中的分隔符或者花括号。

invalid -- 这个组匹配任何其他分隔符模式(通常为单个分隔符),并且它应当出现在正则表达式的末尾。

string.Template可以通过调整用于在模板主体中查找变量名称的正则表达式模式来更改其默认语法一种简单的方法是更改delimiteridpattern类属性。

import string

class MyTemplate(string.Template):
    delimiter = '%'
    idpattern = '[a-z]+_[a-z]+'

template_text = '''
  Delimiter : %%
  Replaced  : %with_underscore
  Ignored   : %notunderscored
'''

d = {
    'with_underscore': 'replaced',
    'notunderscored': 'not replaced',
}

t = MyTemplate(template_text)
print('Modified ID pattern:')
print(t.safe_substitute(d))

在这个例子中,替换规则已经改变,定界符是%而不是$,而且变量名中间的某个位置必须包含一个下划线。模式%notunderscored不会被替换为任何字符串,因此它不包含下划线字符。

结果:

要完成更复杂的修改,可以覆盖pattern属性并定义一个全新的正则表达式。所提供的模式必须包含4个命名组,分别捕获转义定界符、命名变量、加括号的变量名和不合法的定界符模式。

import string

t = string.Template('$var')
print(t.pattern.pattern)

t.pattern的值是一个已编译正则表达式,不过可以通过它的pattern属性得到原来的字符串。

结果:

下面这个例子定义了一个新模式以创建一个新的模板类型,这个使用{{var}}作为变量语法。 

import re
import string


class MyTemplate(string.Template):
    delimiter = '{{'
    pattern = r'''
    \{\{(?:
    (?P<escaped>\{\{)|
    (?P<named>[_a-z][_a-z0-9]*)\}\}|
    (?P<braced>[_a-z][_a-z0-9]*)\}\}|
    (?P<invalid>)
    )
    '''

t = MyTemplate('''
{{{{
{{var}}
''')

print('MATCHES:', t.pattern.findall(t.template))
print('SUBSTITUTED:', t.safe_substitute(var='replacement'))

必须分别提供named和braced模式,尽管它们实际上是一样的。

结果:

1.4 辅助函数

string.capwords(s,sep=None)

   使用str.split()将参数拆分为单词,使用str.capitalize()将单词转为大写形式,使用str.join()将大写的单词进行拼接。 如果可选的第二个参数 sep 被省略或为None,则连续的空白字符会被替换为单个空格符并且开头和末尾的空白字符会被移除,否则 sep 会被用来拆分和拼接单词。

import string

s = 'The quick brown fox jumped over the lazy dog.'

print(s)
print(string.capwords(s))

结果:

Guess you like

Origin www.cnblogs.com/liuhui0308/p/12312755.html