CSV file reading and writing in C#

foreword

The project often encounters the need to read and write CSV files, and the difficulty is mainly the analysis of CSV files.

This article will introduce three methods of parsing CSV files, CsvHelper , TextFieldParser , and regular expressions , and will also introduce the writing method of CSV files by the way.

CSV file standard

Before introducing the method of reading and writing CSV files, we need to understand the format of CSV files.

file example

A simple CSV file:

Test1,Test2,Test3,Test4,Test5,Test6
str1,str2,str3,str4,str5,str6
str1,str2,str3,str4,str5,str6

A non-trivial CSV file:

"Test1
"",""","Test2
"",""","Test3
"",""","Test4
"",""","Test5
"",""","Test6
"","""
" 中文,D23 ","3DFD4234""""""1232""1S2","ASD1"",""23,,,,213
23F32","
",,asd
" 中文,D23 ","3DFD4234""""""1232""1S2","ASD1"",""23,,,,213
23F32","
",,asd

You read that right, both of the above are CSV files, and both have only 3 lines of CSV data. The second file is spiritual pollution at a glance, but this kind of file cannot be avoided in the project.

RFC 4180

There is no official standard for CSV files, but general projects will follow  the RFC 4180  standard. This is an unofficial standard that reads as follows:

  1. Each record is located on a separate line, delimited by a line break (CRLF).

  2. The last record in the file may or may not have an ending line break.

  3. There maybe an optional header line appearing as the first line of the file with the same format as normal record lines. This header will contain names corresponding to the fields in the file and should contain the same number of fields as the records in the rest of the file (the presence or absence of the header line should be indicated via the optional "header" parameter of this MIME type).

  4. Within the header and each record, there may be one or more fields, separated by commas. Each line should contain the same number of fields throughout the file. Spaces are considered part of a field and should not be ignored. The last field in the record must not be followed by a comma.

  5. Each field may or may not be enclosed in double quotes (however some programs, such as Microsoft Excel, do not use double quotes at all). If fields are not enclosed with double quotes, then double quotes may not appear inside the fields.

  6. Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes.

  7. If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote.

translate:

1. Each record is on a separate line, separated by a newline character (CRLF).

2. The last record in the file may or may not have an ending newline.

3. An optional header line may appear as the first line of the file, in the same format as a normal record line. This header will contain the names corresponding to the fields in the file, and should contain the same number of fields as the records in the rest of the file (the presence or absence of a header row shall be indicated via the optional "header" parameter of this MIME type) .

4. In the title and each record, there may be one or more fields, separated by commas. Throughout the file, each line should contain the same number of fields. Whitespace is considered part of the field and should not be ignored. There cannot be a comma after the last field in the record.

5. Each field may or may not be enclosed in double quotes (but some programs, such as Microsoft Excel, do not use double quotes at all). Double quotes may not appear inside a field if the field is not enclosed in double quotes.

6. Fields containing line breaks (CRLF), double quotes and commas should be enclosed in double quotes.

7. If double quotes are used to enclose a field, a double quote that appears in a field must be preceded by another double quote.

simplified standard

The above criteria may be a mouthful, so we simplify it a bit. It should be noted that simplification is not a simple reduction rule, but a combination of similar similarities for easy understanding. The following code will also use the simplified standard, which is as follows:

1. Each record is on a separate line, separated by a newline character (CRLF). Note: The line here is not a line in the sense of ordinary text, but refers to a record conforming to the CSV file format (hereinafter referred to as CSV line ), which may occupy multiple lines in the text.

2. The last record in the file needs to have an ending newline character, and the first line of the file is a title line (the title line contains the names corresponding to the fields, and the number of titles is the same as the number of fields in the record). Note: The dispensable options in the original standard are uniformly stipulated as mandatory, which is convenient for later analysis, and there is no header line to let others see the data.

3. In the title and each record, there may be one or more fields , separated by commas. Throughout the file, each line should contain the same number of fields . Whitespace is considered part of the field and should not be ignored . There cannot be a comma after the last field in the record . Note: This standard has not been simplified. Although there are other standards that use spaces, tabs, etc. for separation, are the files that do not use commas separated also called comma-separated value files?

4. Each field is enclosed in double quotes. The double quotes that appear in the field must be preceded by another double quote. Note: In the original standard, double quotes must be used and optional double quotes are used, and all double quotes should be used. You can't go wrong with quotes.

Read and write CSV files

Before officially reading and writing CSV files, we need to define a Test class for testing.

code show as below:

class Test
{
    public string Test1{get;set;}
    public string Test2 { get; set; }
    public string Test3 { get; set; }
    public string Test4 { get; set; }
    public string Test5 { get; set; }
    public string Test6 { get; set; }

    //Parse方法会在自定义读写CSV文件时用到
    public static Test Parse (string[]fields )
    {
        try
        {
            Test ret = new Test();
            ret.Test1 = fields[0];
            ret.Test2 = fields[1];
            ret.Test3 = fields[2];
            ret.Test4 = fields[3];
            ret.Test5 = fields[4];
            ret.Test6 = fields[5];
            return ret;
        }
        catch (Exception)
        {
            //做一些异常处理,写日志之类的
            return null;
        }
    }
}

Generate some test data, the code is as follows:

static void Main(string[] args)
{
    //文件保存路径
    string path = "tset.csv";
    //清理之前的测试文件
    File.Delete("tset.csv");
      
    Test test = new Test();
    test.Test1 = " 中文,D23 ";
    test.Test2 = "3DFD4234\"\"\"1232\"1S2";
    test.Test3 = "ASD1\",\"23,,,,213\r23F32";
    test.Test4 = "\r";
    test.Test5 = string.Empty;
    test.Test6 = "asd";

    //测试数据
    var records = new List<Test> { test, test };

    //写CSV文件
    /*
    *直接把后面的写CSV文件代码复制到此处
    */

    //读CSV文件
     /*
    *直接把后面的读CSV文件代码复制到此处
    */
   
    Console.ReadLine();
}

Using CsvHelper

CsvHelper is a library for reading and writing CSV files, which supports reading and writing of custom class objects.

The C# library for reading and writing CSV files with the highest star on github uses MS-PL and Apache 2.0 open source protocols.

Use NuGet to download CsvHelper, the code for reading and writing CSV files is as follows:

//写CSV文件
using (var writer = new StreamWriter(path))
using (var csv = new CsvWriter(writer, CultureInfo.InvariantCulture))
{
    csv.WriteRecords(records);
}

using (var writer = new StreamWriter(path,true))
using (var csv = new CsvWriter(writer, CultureInfo.InvariantCulture))
{
    //追加
    foreach (var record in records)
    {
        csv.WriteRecord(record);
    }
}

//读CSV文件
using (var reader = new StreamReader(path))
using (var csv = new CsvReader(reader, CultureInfo.InvariantCulture))
{
    records = csv.GetRecords<Test>().ToList();
    //逐行读取
    //records.Add(csv.GetRecord<Test>());
}

If you just want ready-to-use libraries, this is basically the end of the article.

Use a custom method

In order to distinguish it from CsvHelper, create a new CsvFile class to store the custom code for reading and writing CSV files, and finally provide the complete source code of the class.

The CsvFile class is defined as follows:

/// <summary>
/// CSV文件读写工具类
/// </summary>
public class CsvFile
{
    #region 写CSV文件
    //具体代码...
    #endregion

    #region 读CSV文件(使用TextFieldParser)
    //具体代码...
    #endregion

    #region 读CSV文件(使用正则表达式)
    //具体代码...
    #endregion

}

Writing CSV files based on simplified criteria

According to the simplified standard (see above for the specific content of the standard), write the CSV file code as follows:

#region 写CSV文件
//字段数组转为CSV记录行
private static string FieldsToLine(IEnumerable<string> fields)
{
    if (fields == null) return string.Empty;
    fields = fields.Select(field =>
    {
        if (field == null) field = string.Empty;
        //简化标准,所有字段都加双引号
        field = string.Format("\"{0}\"", field.Replace("\"", "\"\""));

        //不简化标准
        //field = field.Replace("\"", "\"\"");
        //if (field.IndexOfAny(new char[] { ',', '"', ' ', '\r' }) != -1)
        //{
        //    field = string.Format("\"{0}\"", field);
        //}
        return field;
    });
    string line = string.Format("{0}{1}", string.Join(",", fields), Environment.NewLine);
    return line;
}

//默认的字段转换方法
private static IEnumerable<string> GetObjFields<T>(T obj, bool isTitle) where T : class
{
    IEnumerable<string> fields;
    if (isTitle)
    {
        fields = obj.GetType().GetProperties().Select(pro => pro.Name);
    }
    else
    {
        fields = obj.GetType().GetProperties().Select(pro => pro.GetValue(obj)?.ToString());
    }
    return fields;
}

/// <summary>
/// 写CSV文件,默认第一行为标题
/// </summary>
/// <typeparam name="T"></typeparam>
/// <param name="list">数据列表</param>
/// <param name="path">文件路径</param>
/// <param name="append">追加记录</param>
/// <param name="func">字段转换方法</param>
/// <param name="defaultEncoding"></param>
public static void Write<T>(List<T> list, string path,bool append=true, Func<T, bool, IEnumerable<string>> func = null, Encoding defaultEncoding = null) where T : class
{
    if (list == null || list.Count == 0) return;
    if (defaultEncoding == null)
    {
        defaultEncoding = Encoding.UTF8;
    }
    if (func == null)
    {
        func = GetObjFields;
    }
    if (!File.Exists(path)|| !append)
    {
        var fields = func(list[0], true);
        string title = FieldsToLine(fields);
        File.WriteAllText(path, title, defaultEncoding);
    }
    using (StreamWriter sw = new StreamWriter(path, true, defaultEncoding))
    {
        list.ForEach(obj =>
        {
            var fields = func(obj, false);
            string line = FieldsToLine(fields);
            sw.Write(line);
        });
    }
}
#endregion

When used, the code is as follows:

//写CSV文件
//使用自定义的字段转换方法,也是文章开头复杂CSV文件使用字段转换方法
CsvFile.Write(records, path, true, new Func<Test, bool, IEnumerable<string>>((obj, isTitle) =>
{
    IEnumerable<string> fields;
    if (isTitle)
    {
        fields = obj.GetType().GetProperties().Select(pro => pro.Name + Environment.NewLine + "\",\"");
    }
    else
    {
        fields = obj.GetType().GetProperties().Select(pro => pro.GetValue(obj)?.ToString());
    }
    return fields;
}));

//使用默认的字段转换方法
//CsvFile.Write(records, path);

You can also use the default field conversion method, the code is as follows:

CsvFile.Save(records, path);

Use TextFieldParser to parse CSV files

TextFieldParser is a class for parsing CSV files in VB. Although C# does not have a class with similar functions, you can call TextFieldParser of VB to realize the function.

The code for TextFieldParser to parse CSV files is as follows:

#region 读CSV文件(使用TextFieldParser)
/// <summary>
/// 读CSV文件,默认第一行为标题
/// </summary>
/// <typeparam name="T"></typeparam>
/// <param name="path">文件路径</param>
/// <param name="func">字段解析规则</param>
/// <param name="defaultEncoding">文件编码</param>
/// <returns></returns>
public static List<T> Read<T>(string path, Func<string[], T> func, Encoding defaultEncoding = null) where T : class
{
    if (defaultEncoding == null)
    {
        defaultEncoding = Encoding.UTF8;
    }
    List<T> list = new List<T>();
    using (TextFieldParser parser = new TextFieldParser(path, defaultEncoding))
    {
        parser.TextFieldType = FieldType.Delimited;
        //设定逗号分隔符
        parser.SetDelimiters(",");
        //设定不忽略字段前后的空格
        parser.TrimWhiteSpace = false;
        bool isLine = false;
        while (!parser.EndOfData)
        {
            string[] fields = parser.ReadFields();
            if (isLine)
            {
                var obj = func(fields);
                if (obj != null) list.Add(obj);
            }
            else
            {
                //忽略标题行业
                isLine = true;
            }
        }
    }
    return list;
}
#endregion

When used, the code is as follows:

//读CSV文件
records = CsvFile.Read(path, Test.Parse);

Parse CSV file using regex

If you have one problem and want to solve it with regular expressions, then you have two problems.

Regular expressions have a certain learning threshold, and after learning, they will be forgotten if they are not used frequently. Most of the regular expressions solve problems that are not easy to change, which leads to a stable and usable regular expression that can be passed on for several generations. The regular expressions in this section come from  Chapter 6 of "Mastering Regular Expressions (3rd Edition)" Creating Efficient Regular Expressions - A Simple Example of Eliminating Cycles. If you are interested, you can find out. The expression description is as follows: 60b066740b699aecd6f6a802ec7cb418.pngNote: The regular expression for parsing CSV files in the final version of this book is the Java version that uses possessive quantifiers instead of solidified groups, and it is also a version that is often seen on Baidu. However, there is a problem with the possessive quantifier in C#, and I can’t solve it due to my limited ability, so I used the version in the picture above. However, there is no difference in performance between the two versions of regular expressions.

The regular expression parsing CSV file code is as follows:

#region 读CSV文件(使用正则表达式)
/// <summary>
/// 读CSV文件,默认第一行为标题
/// </summary>
/// <typeparam name="T"></typeparam>
/// <param name="path">文件路径</param>
/// <param name="func">字段解析规则</param>
/// <param name="defaultEncoding">文件编码</param>
/// <returns></returns>
public static List<T> Read_Regex<T>(string path, Func<string[], T> func, Encoding defaultEncoding = null) where T : class
{
    List<T> list = new List<T>();
    StringBuilder sbr = new StringBuilder(100);
    Regex lineReg = new Regex("\"");
    Regex fieldReg = new Regex("\\G(?:^|,)(?:\"((?>[^\"]*)(?>\"\"[^\"]*)*)\"|([^\",]*))");
    Regex quotesReg = new Regex("\"\"");

    bool isLine = false;
    string line = string.Empty;
    using (StreamReader sr = new StreamReader(path))
    {
        while (null != (line = ReadLine(sr)))
        {
            sbr.Append(line);
            string str = sbr.ToString();
            //一个完整的CSV记录行,它的双引号一定是偶数
            if (lineReg.Matches(sbr.ToString()).Count % 2 == 0)
            {
                if (isLine)
                {
                    var fields = ParseCsvLine(sbr.ToString(), fieldReg, quotesReg).ToArray();
                    var obj = func(fields.ToArray());
                    if (obj != null) list.Add(obj);
                }
                else
                {
                    //忽略标题行业
                    isLine = true;
                }
                sbr.Clear();
            }
            else
            {
                sbr.Append(Environment.NewLine);
            }                   
        }
    }
    if (sbr.Length > 0)
    {
        //有解析失败的字符串,报错或忽略
    }
    return list;
}

//重写ReadLine方法,只有\r\n才是正确的一行
private static string ReadLine(StreamReader sr) 
{
    StringBuilder sbr = new StringBuilder();
    char c;
    int cInt;
    while (-1 != (cInt =sr.Read()))
    {
        c = (char)cInt;
        if (c == '\n' && sbr.Length > 0 && sbr[sbr.Length - 1] == '\r')
        {
            sbr.Remove(sbr.Length - 1, 1);
            return sbr.ToString();
        }
        else 
        {
            sbr.Append(c);
        }
    }
    return sbr.Length>0?sbr.ToString():null;
}

private static List<string> ParseCsvLine(string line, Regex fieldReg, Regex quotesReg)
{
    var fieldMath = fieldReg.Match(line);
    List<string> fields = new List<string>();
    while (fieldMath.Success)
    {
        string field;
        if (fieldMath.Groups[1].Success)
        {
            field = quotesReg.Replace(fieldMath.Groups[1].Value, "\"");
        }
        else
        {
            field = fieldMath.Groups[2].Value;
        }
        fields.Add(field);
        fieldMath = fieldMath.NextMatch();
    }
    return fields;
}
#endregion

The code for use is as follows:

//读CSV文件
records = CsvFile.Read_Regex(path, Test.Parse);

No bugs in regular expression parsing have been found yet, but it is still not recommended.

Complete CsvFile tool class

The complete CsvFile class code is as follows:

using Microsoft.VisualBasic.FileIO;
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;


namespace ConsoleApp4
{
    /// <summary>
    /// CSV文件读写工具类
    /// </summary>
    public class CsvFile
    {
        #region 写CSV文件
        //字段数组转为CSV记录行
        private static string FieldsToLine(IEnumerable<string> fields)
        {
            if (fields == null) return string.Empty;
            fields = fields.Select(field =>
            {
                if (field == null) field = string.Empty;
                //所有字段都加双引号
                field = string.Format("\"{0}\"", field.Replace("\"", "\"\""));

                //不简化
                //field = field.Replace("\"", "\"\"");
                //if (field.IndexOfAny(new char[] { ',', '"', ' ', '\r' }) != -1)
                //{
                //    field = string.Format("\"{0}\"", field);
                //}
                return field;
            });
            string line = string.Format("{0}{1}", string.Join(",", fields), Environment.NewLine);
            return line;
        }

        //默认的字段转换方法
        private static IEnumerable<string> GetObjFields<T>(T obj, bool isTitle) where T : class
        {
            IEnumerable<string> fields;
            if (isTitle)
            {
                fields = obj.GetType().GetProperties().Select(pro => pro.Name);
            }
            else
            {
                fields = obj.GetType().GetProperties().Select(pro => pro.GetValue(obj)?.ToString());
            }
            return fields;
        }

        /// <summary>
        /// 写CSV文件,默认第一行为标题
        /// </summary>
        /// <typeparam name="T"></typeparam>
        /// <param name="list">数据列表</param>
        /// <param name="path">文件路径</param>
        /// <param name="append">追加记录</param>
        /// <param name="func">字段转换方法</param>
        /// <param name="defaultEncoding"></param>
        public static void Write<T>(List<T> list, string path,bool append=true, Func<T, bool, IEnumerable<string>> func = null, Encoding defaultEncoding = null) where T : class
        {
            if (list == null || list.Count == 0) return;
            if (defaultEncoding == null)
            {
                defaultEncoding = Encoding.UTF8;
            }
            if (func == null)
            {
                func = GetObjFields;
            }
            if (!File.Exists(path)|| !append)
            {
                var fields = func(list[0], true);
                string title = FieldsToLine(fields);
                File.WriteAllText(path, title, defaultEncoding);
            }
            using (StreamWriter sw = new StreamWriter(path, true, defaultEncoding))
            {
                list.ForEach(obj =>
                {
                    var fields = func(obj, false);
                    string line = FieldsToLine(fields);
                    sw.Write(line);
                });
            }
        }
        #endregion

        #region 读CSV文件(使用TextFieldParser)
        /// <summary>
        /// 读CSV文件,默认第一行为标题
        /// </summary>
        /// <typeparam name="T"></typeparam>
        /// <param name="path">文件路径</param>
        /// <param name="func">字段解析规则</param>
        /// <param name="defaultEncoding">文件编码</param>
        /// <returns></returns>
        public static List<T> Read<T>(string path, Func<string[], T> func, Encoding defaultEncoding = null) where T : class
        {
            if (defaultEncoding == null)
            {
                defaultEncoding = Encoding.UTF8;
            }
            List<T> list = new List<T>();
            using (TextFieldParser parser = new TextFieldParser(path, defaultEncoding))
            {
                parser.TextFieldType = FieldType.Delimited;
                //设定逗号分隔符
                parser.SetDelimiters(",");
                //设定不忽略字段前后的空格
                parser.TrimWhiteSpace = false;
                bool isLine = false;
                while (!parser.EndOfData)
                {
                    string[] fields = parser.ReadFields();
                    if (isLine)
                    {
                        var obj = func(fields);
                        if (obj != null) list.Add(obj);
                    }
                    else
                    {
                        //忽略标题行业
                        isLine = true;
                    }
                }
            }
            return list;
        }
        #endregion

        #region 读CSV文件(使用正则表达式)
        /// <summary>
        /// 读CSV文件,默认第一行为标题
        /// </summary>
        /// <typeparam name="T"></typeparam>
        /// <param name="path">文件路径</param>
        /// <param name="func">字段解析规则</param>
        /// <param name="defaultEncoding">文件编码</param>
        /// <returns></returns>
        public static List<T> Read_Regex<T>(string path, Func<string[], T> func, Encoding defaultEncoding = null) where T : class
        {
            List<T> list = new List<T>();
            StringBuilder sbr = new StringBuilder(100);
            Regex lineReg = new Regex("\"");
            Regex fieldReg = new Regex("\\G(?:^|,)(?:\"((?>[^\"]*)(?>\"\"[^\"]*)*)\"|([^\",]*))");
            Regex quotesReg = new Regex("\"\"");

            bool isLine = false;
            string line = string.Empty;
            using (StreamReader sr = new StreamReader(path))
            {
                while (null != (line = ReadLine(sr)))
                {
                    sbr.Append(line);
                    string str = sbr.ToString();
                    //一个完整的CSV记录行,它的双引号一定是偶数
                    if (lineReg.Matches(sbr.ToString()).Count % 2 == 0)
                    {
                        if (isLine)
                        {
                            var fields = ParseCsvLine(sbr.ToString(), fieldReg, quotesReg).ToArray();
                            var obj = func(fields.ToArray());
                            if (obj != null) list.Add(obj);
                        }
                        else
                        {
                            //忽略标题行业
                            isLine = true;
                        }
                        sbr.Clear();
                    }
                    else
                    {
                        sbr.Append(Environment.NewLine);
                    }                   
                }
            }
            if (sbr.Length > 0)
            {
                //有解析失败的字符串,报错或忽略
            }
            return list;
        }

        //重写ReadLine方法,只有\r\n才是正确的一行
        private static string ReadLine(StreamReader sr) 
        {
            StringBuilder sbr = new StringBuilder();
            char c;
            int cInt;
            while (-1 != (cInt =sr.Read()))
            {
                c = (char)cInt;
                if (c == '\n' && sbr.Length > 0 && sbr[sbr.Length - 1] == '\r')
                {
                    sbr.Remove(sbr.Length - 1, 1);
                    return sbr.ToString();
                }
                else 
                {
                    sbr.Append(c);
                }
            }
            return sbr.Length>0?sbr.ToString():null;
        }
       
        private static List<string> ParseCsvLine(string line, Regex fieldReg, Regex quotesReg)
        {
            var fieldMath = fieldReg.Match(line);
            List<string> fields = new List<string>();
            while (fieldMath.Success)
            {
                string field;
                if (fieldMath.Groups[1].Success)
                {
                    field = quotesReg.Replace(fieldMath.Groups[1].Value, "\"");
                }
                else
                {
                    field = fieldMath.Groups[2].Value;
                }
                fields.Add(field);
                fieldMath = fieldMath.NextMatch();
            }
            return fields;
        }
        #endregion

    }
}

The method of use is as follows:

//写CSV文件
CsvFile.Write(records, path, true, new Func<Test, bool, IEnumerable<string>>((obj, isTitle) =>
{
    IEnumerable<string> fields;
    if (isTitle)
    {
        fields = obj.GetType().GetProperties().Select(pro => pro.Name + Environment.NewLine + "\",\"");
    }
    else
    {
        fields = obj.GetType().GetProperties().Select(pro => pro.GetValue(obj)?.ToString());
    }
    return fields;
}));

//读CSV文件
records = CsvFile.Read(path, Test.Parse);

//读CSV文件
records = CsvFile.Read_Regex(path, Test.Parse);

Summarize

Introduces  the RFC 4180  standard for CSV files and its simplified version

Introduced three methods of parsing CSV files, CsvHelper, TextFieldParser, and regular expressions. CsvHelper is recommended in the project. If you don’t want to introduce too many open source components, you can use TextFieldParser. Regular expressions are not recommended.

appendix

CsvHelper github:https://github.com/JoshClose/CsvHelper

CsvHelper project backup: https://pan.baidu.com/s/1xDOGgJuw5YaxPZwf8vGyrw Extraction code: 33j7

RFC 4180 standard: https://datatracker.ietf.org/doc/html/rfc4180

Reposted from: time-flies

Link: cnblogs.com/timefiles/p/CsvReadWrite.html

-

Technical group: Add Xiaobian WeChat and comment into the group

Editor WeChat: mm1552923   

Public number: dotNet Programming Daquan    

Guess you like

Origin blog.csdn.net/zls365365/article/details/129415069