Data analysis python pandas Basics 2 - (format conversion, sorting, statistics, PivotTable)

//2019.07.18
Pyhton the pandas Learning Data Analysis - Part II
2.1 data format converting
data format 1, and the conversion table to see a column of:
(1) View Data Type: The data format of a column: DF [ "Column Properties name "] .dtype
(2) data type conversion: a column data type conversion need to use the data transfer function:
DF [column attribute name] = df [column name attribute] .astype (" new data type ")
Code example as follows:
Import numpy AS NP
Import PANDAS AS PD
DF = pd.read_excel ( "D: / Byrbt2018 / Study / Python data analysis program exercises + + explain / Python data analysis program exercises + + explain / jobs / job 4/4 job / hotel modem 1.xlsx ")
Print (df)
print (df [" score "] data type .dtype) # View this column type of
print (df [" score "])
df [" score "] = df [" Rating "] .astype (" int ") # data table data type conversions need to use the data type conversion functions df [" column name "] .astype (" type name ")
Print (df [" score "])
Print ( df [ "score"] .dtype)

2.2 ordering data
1, a single table column to sort the data need to use function:
df.sort_values (= columns by name, ascending = True (L) / False (drop))
Print (df.sort_values (by = "score" ascending = False) [ "Rating"])
2, multiple scheduling problem:
for multi-column sorted main problem is encountered in which the same data as needed to see a column of a column data additional sort function and its usage rules are as follows :
df.sort_values (by = [column 1, column 2 ...], ascending = [True
, False ...]) where the function parameters by = [,,] represents the order of a plurality of rows before and after the sort order of priority, Ascending = [,,] represents the False and True corresponding ordering principle (lift) of each column)

2.3 Basic statistical data analysis
1, for the numeric data corresponding typically requires a common statistical data analysis, the most common descriptive statistics function is .describe (), which can be given a number of statistical data indicators ;
2, for statistics table index of the respective columns have the following main functions:
(1) the most value: the maximum value and the minimum value df [] max () and DF [] min ()..
(2) median: DF [] .median ()
(. 3) average value: df [] Mean ().
(. 4) variance: df [] var ().
(. 5) standard deviation:. DF [] STD ()
(. 6) summation function: df [] .sum ()
(. 7) associated with the covariance coefficients:
correlation coefficient: df [[1,2 ...]]
covariance:. df [[1,2 ...])
(8) count:
1) unique to each type of statistical value data can appear in the function DF [column properties] .unique () to query, directly in front of the number of words can be a len plus
2) table replacement data: using the function DF [column name 1] .replace (a, B, InPlace = True) # represents the table of all data replacement data a 1 in the data column B
. 3) appears unique statistical value occurrence using the function DF [column] .valuecounts () which are arranged in descending order of default per Data appears different number of related statistical output

2.4 Pivot operations and functions
1, a pivot is widely used and powerful data mining function, which function is pd.pivot_table ()
2, parameter perspective function of the usage rule data
function has the form:
pd.pivot_table (df, index = [column 1, column 2 ...], values = [1 remaining columns, the remaining columns 2 ...], aggfunc = np.sum ... ,, fill_value = 0 ( non-numerical properties of the data processing ), margins = True (the sum of the statistical data), columns = [column 1, column 2 ...] (as long as the column direction refers to a layered, stratified index is analogous, non-essential parameters))
specific example code below :
pd.set_option ( "MAX_COLUMNS", 1000)
pd.set_option ( "MAX_ROWS", 1000)
(to appear after ellipsis than the set value) set # pyhton line output data and the maximum number of rows of columns
Print (DF)
Print (pd.pivot_table (df, index = "Region")) # output area attribute data of each column mean
print (pd.pivot_table (df, index =area", "type"])) # output Region a first layer, a data type for the other columns the mean of the second layer
print (pd.pivot_table (df, index =area", "type"], values = [ "price"]))
Print (pd.pivo t_table (df, index = [ "area", "type"],
print (pd.pivot_table (df, index = [ " area"], values = [ "score", "price"], columns = [ "type"], aggfunc = { "Rating": np.mean, "price" : np.sum}, fill_value = 0))
Table pd.pivot_table = (DF, index = [ "area", "type"], values = [ "price"])
#Print (table.sort_values (= by "Rating ", ascending = False)) # descending order of rates for
print (table.index)

Overall entry to run the code shown below ( can be copied directly run, contains a detailed code comments, you can easily get you started to understand ):

numpy NP AS Import 
Import PANDAS AS pd
df = pd.read_excel ( "D: / Byrbt2018 / Study / Python Data Analysis courses + practice + explain / Python Data Analysis courses + practice + explain / Job / Job 4 / Job 4 / hotel data 1.xlsx ")
Print (df)
Print (df.index)
Print (df.columns)
Print (df [: 5]) # 5 rows of data before output
print (df [" score "] .dtype) # view this type a data type of
print (df [ "Rating"])
DF [ "Rating"] = df [ "Rating"] .astype ( "int") # data table for data type conversion function need to use the data type conversion DF [ "column name"] .astype ( "type name")
Print (df [ "score"])
Print (df [ "score"] .dtype)
Print (df [ "area"] .dtype)
df [ "area"] = df [ "area"] .astype ( "str") # convert the data area of the column type string data
print (df [ "area"] .dtype)
print (df [ "area"])

# table data ordering
print (df.sort_values ​​(by = "rating", ascending = False) [ "Rating"]) # sort the data by a function df.sort_values ​​(by = column name, ascending = True (L) / False (down))
print (df.sort_values (by = [ "score", "price"], Ascending = False))
print (df.sort_values (by = [ "score", "price"], ascending = [False, True]) [ [ "score", "price"]]) # multi-column sorting problem (according to the parameters by = [,,] represents the order before and after the multi-column sorting order of priority, ascending = [,,] represents the corresponding each column sorting principle (elevating))

statistical analysis of the data # description
print (df.describe ()) # give a data table which all columns of numeric data are statistical metrics data (including the average data, the number, median, minimum, maximum)
print (df [ "price"] .mean ()) # output average price column data
print (df [ "price"] .var ()) # column data output price variance
print (df [ "price "] .max ()) # output maximum price column data
print (df [" "Min] .min () # price column data output price)
Print (DF [" price "] .std ()) # output prices of standard deviation
print (df [ "price"] .median ()) # output median price of
print (df [[ "price", "score"]]. corr ()) # output prices and ratings The correlation coefficient
print (df [[ "price", "score"]].cov ()) # scoring output prices and covariance
print (len (df)) # counting statistics
print (df [ "score"] .unique ()) # query all of the unique value of a data
print (len (df [ "Rating"] .unique ())) # number of unique queries as different values
df [ "Rating"] .replace (4,4.1, inplace = True) # table scoring data replaces ( 4 will be replaced 4.1)
Print (df [ "score"])
Print (df [ "area"] .unique ()) # output of all the unique values
print (len (df [ "area"] .unique ())) # output a different number of data in the data
print (df [ "area"] .value_counts ()) # for each unique value of the number of occurrence of
print (df [ "area"] .value_counts () [: 5 ]) the number of output data # five regions and the number of occurrence

using the pivot function and the function # pd.pivot_table (df, index = [column 1, column 2 ...], values = [1 remaining columns, the remaining column 2 ...], aggfunc = np.sum ... ,, fill_value = 0 ( non-numerical processing property data), margins = True (the sum of the statistical data), columns = [column 1, column 2 ...] ( It refers layered long column direction, similar to the stratifying index, non-essential parameters))
pd.set_option ( "MAX_COLUMNS", 1000) provided # pyhton line output data and the maximum number of rows of the column (more than a set value after the ellipsis appear)
pd.set_option ( "MAX_ROWS", 1000)
Print (df)
Print (pd.pivot_table (df , index = "Region"
print (pd.pivot_table (df, index = [ " area", "type"])) # output areas of the first layer, the other type is the mean data for each column of the second layer
print (pd.pivot_table (df, index = [ "area", "type"], values = [ "price"]))
Print (pd.pivot_table (df, index = [ "area", "type"], values = [ "price"], aggfunc = [np.sum, np.mean]))
Print (pd.pivot_table (df, index = [ "area"], values = [ "score", "price"], columns = [ "type"], aggfunc = { "Rating": np.mean, "price": np.sum}, fill_value = 0))
Table pd.pivot_table = (DF, index = [ "area", "type"], values = [ "price"])
#print (table.sort_values (by = "rating", ascending = False)) # descending order of rates for
print (table.index)

results are as follows:

 











Guess you like

Origin www.cnblogs.com/Yanjy-OnlyOne/p/11207278.html