genfromtxt function and Numpy function learning

　　Today I encountered the genfromtxt function
　　
　　world_alcohol = numpy.genfromtxt ("world_alcohol.txt", delimiter = ",", dtype = str)
　　
　　What is genfromtxt? The
　　
　　genfromtxt function creates array table data
　　
　　genfromtxt mainly performs two loop operations. The first loop converts each line of the file into a sequence of strings. The second loop converts each string sequence to the corresponding data type.
　　
　　genfromtxt can consider missing data, but other faster and simpler functions like loadtxt cannot consider missing values.
　　
　　Before using, you need to import the corresponding module
　　
　　Python 3 import numpy to
　　
　　define the
　　
　　only mandatory parameter of input genfromtxt is the source of the data. It can be a name string corresponding to a local or remote file, or a file-like object with a read method (such as an actual file or StringIO.StringIO object). If the parameter is a URL of a remote file, the latter is automatically downloaded in the current directory. The input file can be a text file or an archive. Currently, this function recognizes gzip and bz2 (bzip2). The type of archive file is to check the file extension: if the file name ends with ".gz", gzip archive; if it ends with "bz2", bzip2 archive.
　　
　　Split the line into columns
　　
　　genfromtxt split each non-blank line into a character String sequence. Empty lines or comment lines are skipped. The
　　
　　split character is not limited to a single character, any string will do.
　　
　　Split a fixed-width file, the width of the column is defined as a given number of characters. In this case, we need to set the separator to an integer (if all columns have the same size) or a sequence of integers (if the columns can have different sizes).
　　
　　data = "123456789 \ n 4 7 9 \ n 4567 9"
　　
　　np.genfromtxt (StringIO (data), delimiter = (4, 3, 2))
　　
　　array ([[1234., 567., 89.], [4. , 7., 9.], [4., 567., 9.]])
　　
　　genfromtxt parameter details
　　
　　autostrip parameter
　　
　　When a line is divided into a string sequence, extra spaces before and after each item in the sequence still exist, you can change The autostrip parameter is set to true to remove spaces.
　　
　　comments parameter The
　　
　　comments parameter is a string that marks the beginning of a comment. The default is "#", and comment marks may occur anywhere during the conversion process. Any characters that appear after the comment mark will be ignored.
　　
　　Note: There is one exception to this behavior: if the optional parameter names = True, the first line checks that the comment line is considered to be the name.
　　
　　The skip_header and skip_footer parameters
　　
　　header of a file will hinder the processing of the file. In this case, we need to use skip_header optional parameter. The value of this parameter must be an integer, skip the corresponding number of lines at the beginning of the file, and then perform any other operations. Similarly, we can skip the last n lines of the file by using the skip_footer attribute and the value of n. The default value is 0.
　　
　　usercols parameter
　　
　　In some cases, we are only interested in certain columns in the data. We can use usecols to select the columns of interest. This parameter accepts an integer or a sequence of integers as an index. Remember, by convention, the first column index 0, -1 corresponds to the last column. If the column has a name, we can also set the usecols parameter to their name, or a string sequence or comma-separated string containing the column name.
　　
　　dtype parameter
　　
　　The string sequence we read from the file needs to be set when converting to other types of data. The default is float type.
　　
　　1. A single type, such as dtype = float.
　　
　　2. A sequence type, such as dtype = (int, float, float).
　　
　　3. A comma-separated string, such as dtype = "i4, f8, | S3".
　　
　　4. A dictionary contains two keys 'names' and 'formats'
　　
　　5. A sequence of tuples, for example dtype = [('A', int), ('B', float)]
　　
　　6. An existing numpy.dtype object
　　
　　7. A special value None, in this In this case, the type of each column is determined by its own data. Setting the parameter to None is less efficient. Because it will start with a Boolean value, then integer, floating-point, and complex numbers will end with a string until the condition is met.
　　
　　The names parameter
　　
　　can set the names parameter to true and skip the first line, the program will use the first line as the column name, even if the first line is commented out, it will be read. Or you can use dtype to set the name, or you can rewrite the name. The default name is none. When names = none, numpy will generate some standard default values "f% i". We can change the default format through defaultfmt.
　　
　　data = StringIO ("1 2 3 \ n 4 5 6"
　　
　　ndtype = [('a', int), ('b', float), ('c', int)]
　　
　　names = ["A", "B", "C"]
　　
　　np.genfromtxt (data, names = names, dtype = ndtype) array ([(1, 2.0, 3), (4, 5.0, 6)],
　　
　　dtype = [('A', '<i8'), ('B', '<f8') , ('C', '<i8')])
　　
　　Note: We need to remember that the defaultfmt will only be used to
　　
　　verify the names
　　
　　Numpy array and a structured dtype can also be regarded as recarray, a field can be used as a Property access. For this reason, we may need to ensure that the field name does not contain any spaces or invalid characters, or names that do not conform to standard attributes (such as size or shape). genfromtxt accepts three optional parameters and provides a name for better control:
　　
　　deletechars: all the connectors in a name that need to be deleted. By default, invalid characters ~! @ # $% ^ & * () —— + ~ = |]} ({;: /?>, <.
　　
　　Excludelist: gives a list of names to be deleted, such as return, file, print ... If one of the entered names appears in this list, it will be underlined ("_").
　　
　　case_sensitive: whether the name should be case sensitive (case_sensitive = True),
　　

　　
　　When we want the date format MM / DD / YYYY to be converted to a datetime object, or a string xx% to be correctly converted to a floating point number between 0 and 1, we need to use the converters parameter to define the conversion function.
　　
　　The value of this parameter is usually a dictionary with column indexes or column names as keys and a conversion function as values. These conversion functions can be actual functions or lambda functions. In any case, they should only accept one string as input and output and only one element type that they want to get. 　　By default,
　　
　　missing_values
　　
use spaces to indicate missing, we can use more complex characters to indicate missing, such as 'N / A' or '???'. missing_values accepts three types of values:
　　
　　a string or a comma-separated string: this string will be used as
　　
　　a string sequence to mark all columns of missing data : in this case, each item corresponds in order to Is associated.
　　
　　A dictionary: The value of the dictionary is a string or a sequence of strings. The corresponding key can be column index (integer) or column name (string). In addition, key = none defines a default value that applies to all columns.
　　
　　Filling_values parameter
　　
　　When there are missing values, the system fills the default value:
　　
　　Expected type Default
　　
　　bool False
　　
　　int -1
　　
　　float np.nan
　　
　　complex np.nan + 0j
　　
　　string '???'
　　
　　We can also customize the value of the setting parameter like missing_values parameter. filling_values accepts three types of values:-a
　　
　　single value: the default value for all columns
　　
　　-A sequence of values: each item corresponds to the corresponding column in order.
　　
　　-A dictionary: The value of the dictionary is a separate object. The corresponding key can be column index (integer) or column name (string). In addition, key = none defines a default value that applies to all columns.
　　
　　Usermask parameter
　　
　　We may also want to track the occurrence of missing data by constructing a Boolean mask, where the data is missing returns true, otherwise, it returns False. To do this, we must set the optional parameter usemask = True (default is False). The result will be a MaskedArray array.
　　
　　In addition to genfromtxt, the numpy.lib.io module provides some convenient functions from genfromtxt. These functions all work in the same way, but they have different default values.
　　
　　ndfromtxt
　　
　　usually sets usemask = False. The output is usually a standard numpy.ndarray.
　　
　　mafromtxt
　　
　　usually sets usemask = True. The output is MaskedArray
　　
　　recfromtxt
　　
　　returns a standard numpy.recarray (if usemask = False) or a MaskedRecords array (if usemaske = True) The default dtype is dtype = None, which means the type of each column will be determined automatically.
　　
　　recfromcsv is
　　
　　similar to recfromtxt, but delimiter = ",".
　　
　　Numpy function learning --genfromtxt function

genfromtxt function and Numpy function learning

Guess you like