Research in C # char, string (a)

Research in C # char, string (a)

1. System.Char character

char is System.Char alias.

System.Char two bytes, 16 bits.

System.Char used to represent, store a Unicode character.

Represents the range System.Char U+0000to U+FFFF, char default is \0that U+0000.

Unicode representation, generally in U+____the form of representation, i.e., Uand a hexadecimal number from the group consisting.

There are four char assignment method

            char a = 'j';
            char b = '\u006A';
            char c = '\x006A';
            char d = (char) 106;
            Console.WriteLine($"{a} | {b} | {c} | {d}");

Export

j | j | j | j

\u Unicode is an escape sequence at the beginning (coded); use Unicode escape sequence must be followed by four hexadecimal digits.

\u006A    有效
\u06A     无效
\u6A      无效

\xThe beginning of the hexadecimal escape sequence, is made up of four hexadecimal digits. If the front of the N 0, then 0 may be omitted. The following examples are represent the same character.

\x006A
\x06A
\x6A

char can be implicitly converted to other types of values, there can be converted to an integer ushort, int, uint, long, and ulong, it can be converted to floating point float, doubleand decimal.

char can be converted to explicit sbyte, byteand short.

Other types can not be implicitly converted to char type, but any integer and floating-point type can be explicitly converted to char.

2. Character Handling

System.Char having many methods on state, can contribute to recognize, process characters.

There is a very important UnicodeCategory enumeration

  public enum UnicodeCategory
  {
    UppercaseLetter,
    LowercaseLetter,
    TitlecaseLetter,
    ModifierLetter,
    OtherLetter,
    NonSpacingMark,
    SpacingCombiningMark,
    EnclosingMark,
    DecimalDigitNumber,
    LetterNumber,
    OtherNumber,
    SpaceSeparator,
    LineSeparator,
    ParagraphSeparator,
    Control,
    Format,
    Surrogate,
    PrivateUse,
    ConnectorPunctuation,
    DashPunctuation,
    OpenPunctuation,
    ClosePunctuation,
    InitialQuotePunctuation,
    FinalQuotePunctuation,
    OtherPunctuation,
    MathSymbol,
    CurrencySymbol,
    ModifierSymbol,
    OtherSymbol,
    OtherNotAssigned,
  }

System.Char, there is a GetUnicodeCategory()static method, the return type of character, i.e., the above enumerated values.

In addition GetUnicodeCategory(), we can also determine the type of characters through concrete static method.

Here are instructions for use enum class static method.

Static method Explanation Enumeration representation
IsControl Less than 0x20non-printable characters. For example \ r, \ n, \ t , \ 0 like. no
isdigit 0-9 and other digital alphabet DecimalDigitNumber
IsLetter AZ, az, and other alphabetic characters UppercaseLetter,
LowercaseLetter,
TitlecaseLetter,
ModifierLetter,
OtherLetter
IsLetterOrDigit Letters and numbers Reference IsLetter and IsDigit
IsLower Lower case letters LowercaseLetter
IsNumber Number, fraction Unicode in Roman numerals DecimalDigitNumber,
LetterNumber,
OtherNumber
IsPunctuation Western and other alphabet punctuation ConnectorPunctuation,
DashPunctuation,
InitialQuotePunctuation,
FinalQuotePunctuation,
OtherPunctuation
IsSeparator Spaces and all Unicode delimiters SpaceSeparator,
ParagraphSeparator
IsSurrogate Unicode values ​​between 0x10000 to 0x10FFF Surrogate
IsSymbol Most printable characters MathSymbol,
ModifierSymbol,
OtherSymbol
IsUpper The size of the letters UppercaseLetter
IsWhiteSpace All of the separator, and \ t, \ n, \ r, \ v, \ f SpaceSeparator,
ParagraphSeparator

Examples

        char chA = 'A';
        char ch1 = '1';
        string str = "test string"; 

        Console.WriteLine(chA.CompareTo('B'));          //-----------  Output: "-1
                                                        //(meaning 'A' is 1 less than 'B')
        Console.WriteLine(chA.Equals('A'));             //-----------  Output: "True"
        Console.WriteLine(Char.GetNumericValue(ch1));   //-----------  Output: "1"
        Console.WriteLine(Char.IsControl('\t'));        //-----------  Output: "True"
        Console.WriteLine(Char.IsDigit(ch1));           //-----------  Output: "True"
        Console.WriteLine(Char.IsLetter(','));          //-----------  Output: "False"
        Console.WriteLine(Char.IsLower('u'));           //-----------  Output: "True"
        Console.WriteLine(Char.IsNumber(ch1));          //-----------  Output: "True"
        Console.WriteLine(Char.IsPunctuation('.'));     //-----------  Output: "True"
        Console.WriteLine(Char.IsSeparator(str, 4));    //-----------  Output: "True"
        Console.WriteLine(Char.IsSymbol('+'));          //-----------  Output: "True"
        Console.WriteLine(Char.IsWhiteSpace(str, 4));   //-----------  Output: "True"
        Console.WriteLine(Char.Parse("S"));             //-----------  Output: "S"
        Console.WriteLine(Char.ToLower('M'));           //-----------  Output: "m"
        Console.WriteLine('x'.ToString());              //-----------  Output: "x"
        Console.WriteLine(Char.IsSurrogate('\U00010F00'));      // Output: "False"
        char test = '\xDFFF';
        Console.WriteLine(test);                        //-----------   Output:'?'
        Console.WriteLine( Char.GetUnicodeCategory(test));//----------- Output:"Surrogate"

If you want to satisfy your curiosity, you can click http://www1.cs.columbia.edu/~lok/csharp/refdocs/System/types/Char.html

3. Globalization

C #, System.Char has a rich way to deal with characters, such as commonly used ToUpper, ToLower.

But the character of the process, will be affected by the user's locale.

When processing the characters used in the method System.Char, you can call with a Invariantmethod or use suffix CultureInfo.InvariantCulture, for character processing language-independent environment.

Examples

            Console.WriteLine(Char.ToUpper('i',CultureInfo.InvariantCulture));
            Console.WriteLine(Char.ToUpperInvariant('i'));

For character and string handling, overloads and handling may be used, see the following description.

StringComparison

enumerate Enumeration values Explanation
CurrentCulture 0 Collation distinguish between culture and current culture to compare strings
CurrentCultureIgnoreCase 1 The use of regional sensitive collation, the current culture to compare strings, ignoring string comparison to the case of
InvariantCulture 2 Use to distinguish between culture and the invariant culture collation comparing strings
InvariantCultureIgnoreCase 3 Using culture-sensitive sort rules, the invariant culture to compare strings, ignoring string comparison to the case of
Ordinal 4 Use number (binary) string comparison collation sequence
OrdinalIgnoreCase 5 Use ordinal (binary) collation, string comparisons ignore the case of the string to compare

CultureInfo

enumerate Explanation
CurrentCulture 获取表示当前线程使用的区域性的 CultureInfo对象
CurrentUICulture 获取或设置 CultureInfo对象,该对象表示资源管理器在运行时查找区域性特定资源时所用的当前用户接口区域性
InstalledUICulture 获取表示操作系统中安装的区域性的 CultureInfo
InvariantCulture 获取不依赖于区域性(固定)的 CultureInfo 对象
IsNeutralCulture 获取一个值,该值指示当前 CultureInfo 是否表示非特定区域性

4. System.String 字符串

4.1 字符串搜索

字符串有多个搜索方法:StartsWith()EndsWith()Contains()IndexOf

StartsWith()EndsWith() 可以使用 StringComparison 比较方式、CultureInfo 控制文化相关规则。

StartsWith() :字符串开头是否存在符合区配字符串

EndsWith(): 字符串结尾是否存在符合区配字符串

Contains(): 字符串任意位置是否存在区配字符串

IndexOf: 字符串或字符首次出现的索引位置,如果返回值为 -1 则表示无区配结果。

使用示例

            string a = "痴者工良(高级程序员劝退师)";
            Console.WriteLine(a.StartsWith("高级"));
            Console.WriteLine(a.StartsWith("高级",StringComparison.CurrentCulture));
            Console.WriteLine(a.StartsWith("高级",true, CultureInfo.CurrentCulture));
            Console.WriteLine(a.StartsWith("痴者",StringComparison.CurrentCulture));
            Console.WriteLine(a.EndsWith("劝退师)",true, CultureInfo.CurrentCulture));
            Console.WriteLine(a.IndexOf("高级",StringComparison.CurrentCulture));

输出

False
False
False
True
True
5

除了 Contains(),其它三种方法都有多个重载方法,例如

重载 说明
(String) 是否与指定字符串区配
(String, StringComparison) 以何种方式指定字符串区配
(String, Boolean, CultureInfo) 控制大小写和文化规则指定字符串区配

这些与全球化和大小写区配的规则,在后面章节中会说到。

4.2 字符串提取、插入、删除、替换

4.2.1 提取

SubString() 方法可以在提取字符串指定索开始的N个长度或余下的所有的字符。

            string a = "痴者工良(高级程序员劝退师)";
            string a = "痴者工良(高级程序员劝退师)";
            Console.WriteLine(a.Substring(startIndex: 1, length: 3));
            // 者工良
            Console.WriteLine(a.Substring(startIndex: 5));
            // 高级程序员劝退师)

4.2.2 插入、删除、替换

Insert() :指定索引位置后插入字符或字符串

Remove() :指定索引位置后插入字符或字符串

PadLeft() :在字符串左侧将使用某个字符串扩展到N个字符长度

PadRight():在字符串右侧将使用某个字符串扩展到N个字符长度

TrimStart() :从字符串左侧开始删除某个字符,碰到不符合条件的字符即停止。

TrimEnd() :从字符串右侧开始删除某个字符,碰到不符合条件的字符即停止。

Replace():将字符串中的N连续个字符组替换为新的M个字符组。

示例

            string a = "痴者工良(高级程序员劝退师)"; // length = 14

            Console.WriteLine("\n  -  Remove Insert   - \n");

            Console.WriteLine(a.Insert(startIndex: 4, value: "我是"));
            Console.WriteLine(a.Remove(startIndex: 5));
            Console.WriteLine(a.Remove(startIndex: 5, count: 3));

            Console.WriteLine("\n  -  PadLeft PadRight  -  \n");

            Console.WriteLine(a.PadLeft(totalWidth: 20, paddingChar: '*'));
            Console.WriteLine(a.PadRight(totalWidth: 20, paddingChar: '#'));
            Console.WriteLine(a.PadLeft(totalWidth: 20, paddingChar: '\u0023'));
            Console.WriteLine(a.PadRight(totalWidth: 20, paddingChar: '\u002a'));
            Console.WriteLine(a.PadLeft(totalWidth: 18, paddingChar: '.'));
            Console.WriteLine(a.PadRight(totalWidth: 18, paddingChar: '.'));

            Console.WriteLine("\n  -  Trim  -  \n");

            Console.WriteLine("|Hello | World|".Trim('|'));
            Console.WriteLine("|||Hello | World|||".Trim('|'));
            Console.WriteLine("|Hello | World!|".TrimStart('|'));
            Console.WriteLine("|||Hello | World!|||".TrimStart('|'));
            Console.WriteLine("|Hello | World!|".TrimEnd('|'));
            Console.WriteLine("|||Hello | World!|||".TrimEnd('|'));
            Console.WriteLine("||||||||||||||||||||||||".TrimEnd('|'));
            

            Console.WriteLine("*#&abc ABC&#*".TrimStart(new char[] {'*', '#', '&'}));
            Console.WriteLine("*#&abc ABC&#*".TrimStart(new char[] {'#', '*', '&'}));

            Console.WriteLine("\n  -  Replace  -  \n");

            Console.WriteLine("abcdABCDabcdABCD".Replace(oldChar: 'a', newChar: 'A'));

输出

  -  Remove Insert   -

痴者工良我是(高级程序员劝退师)
痴者工良(
痴者工良(序员劝退师)

  -  PadLeft PadRight  -

******痴者工良(高级程序员劝退师)
痴者工良(高级程序员劝退师)######
######痴者工良(高级程序员劝退师)
痴者工良(高级程序员劝退师)******
....痴者工良(高级程序员劝退师)
痴者工良(高级程序员劝退师)....

  -  Trim  -

Hello | World
Hello | World
Hello | World!|
Hello | World!|||
|Hello | World!
|||Hello | World!

abc ABC&#*
abc ABC&#*

  -  Replace  -

AbcdABCDAbcdABCD

5. 字符串驻留池

以下为笔者个人总结,不保证对。

images

字符串 驻留池是在域(Domain)级别完成的,而字符串驻留池可以在域中的所有程序集之间共享。

CLR 中维护着一个叫做驻留池(Intern Pool)的表。

这个表记录了所有在代码中使用字面量声明的字符串实例的引用。

拼接方式操作字面量时,新的字符串又会进入字符串驻留池。

只有使用使用字面量声明的字符串实例,实例才会对字符串驻留池字符串引用。

而无论是字段属性或者是方法内是声明的 string 变量、甚至是方法参数的默认值,都会进入字符串驻留池。

例如

        static string test = "一个测试";

        static void Main(string[] args)
        {
            string a = "a";

            Console.WriteLine("test:" + test.GetHashCode());
            
            TestOne(test);
            TestTwo(test);
            TestThree("一个测试");
        }

        public static void TestOne(string a)
        {
            Console.WriteLine("----TestOne-----");
            Console.WriteLine("a:" + a.GetHashCode());
            string b = a;
            Console.WriteLine("b:" + b.GetHashCode());
            Console.WriteLine("test - a :" + Object.ReferenceEquals(test, a));
        }

        public static void TestTwo(string a = "一个测试")
        {
            Console.WriteLine("----TestTwo-----");
            Console.WriteLine("a:" + a.GetHashCode());
            string b = a;
            Console.WriteLine("b:" + b.GetHashCode());
            Console.WriteLine("test - a :" + Object.ReferenceEquals(test, a));
        }

        public static void TestThree(string a)
        {
            Console.WriteLine("----TestThree-----");
            Console.WriteLine("a:" + a.GetHashCode());
            string b = a;
            Console.WriteLine("b:" + b.GetHashCode());
            Console.WriteLine("test - a :" + Object.ReferenceEquals(test, a));
        }

输出结果

test:-407145577
----TestOne-----
a:-407145577
b:-407145577
test - a :True
----TestTwo-----
a:-407145577
b:-407145577
test - a :True
----TestThree-----
a:-407145577
b:-407145577
test - a :True

可以通过静态方法 Object.ReferenceEquals(s1, s2); 或者 实例的 .GetHashCode() 来对比两个字符串是否为同一个引用。

可以使用不安全代码,直接修改内存中的字符串

参考 https://blog.benoitblanchon.fr/modify-intern-pool/

string a = "Test";

fixed (char* p = a)
{
    p[1] = '3';
}

Console.WriteLine(a);

使用 *Microsoft.Diagnostics.Runtime* 可以获取 CLR 的信息。

结果笔者查阅大量资料发现,.NET 不提供 API 去查看字符串常量池里面的哈希表。

关于 C# 字符串的使用和驻留池等原理,请参考

http://community.bartdesmet.net/blogs/bart/archive/2006/09/27/4472.aspx

通过设法在程序集中获取字符串文字的列表

https://stackoverflow.com/questions/22172175/read-the-content-of-the-string-intern-pool

.NET 底层 Profiling API说明

https://docs.microsoft.com/en-us/dotnet/framework/unmanaged-api/profiling/profiling-overview?redirectedfrom=MSDN

.NET字符串驻留池和提高字符串比较性能

http://benhall.io/net-string-interning-to-improve-performance/

关于 C# 字符串驻留池的学习文章

https://www.cnblogs.com/mingxuantongxue/p/3782391.html

https://www.xuebuyuan.com/189297.html

https://www.xuebuyuan.com/189297.html

如果总结或知识有错,麻烦大佬们斧正哈。

Guess you like

Origin www.cnblogs.com/whuanle/p/11967014.html