pyhton pandas Data analysis Basics (article read pandas)

//2017.07.17

pyhton the pandas analysis of the data entry (article read pandas),

Teach you get started quickly pandas data analysis module (rear entry with full code can be copied directly run, contains a detailed code comments, you can easily get you started understanding)

1.1 pandas Module Introduction

First, before using the corresponding operation module pandas need to import pandas

PD pandas AS Import 
Import numpy AS # NP numpy module and import pandas

1, pandas having two common data structures:
(. 1) Series
which refers to a one-dimensional array or list (column vectors), and the array numpy similar comparison can store many different types of data types;
(2) DataFrame
type of two-dimensional data structure, and Excel spreadsheets compare like with like, it can be understood as a Series of container.

1.2 pandas type of application which the series:
1, for a series of definitions:
S = pd.Series ([l, 2,3, np.nan, 2,3,1, ...], index = [ "A", "b", "c",
wherein for each row of a predetermined index number for attribute
2, Series index for index, which refers to the row essence list of tags may serie .index to query and output;

Related operations Encyclopedia 1.3 DataFrame two-dimensional list of
(a) a list of two-dimensional manner configured DataFrame
1, for the two-dimensional list of data DataFrame out mainly in two ways : the incoming two-dimensional array definition method and use of dictionaries ;
manner a : incoming method df = pd.DataFrame (np.random.randn (6,4)two-dimensional array of
second approach: df = pd.DataFrame ({ "a], "B": [ "incoming dictionary manner
2, for the attribute name definition of each two-dimensional list is mainly used in each row index (each row name) and columns (column designations):
DF = pd.DataFrame (np.random.randn (6,4), PD index = .date_range ( "20,180,701", periods =. 6, FREQ = "M"), columns = [list ( "ABCD")])
. 3, the incoming data on DateFrame dictionary mode, where the dictionary is a list of the key refers to a column name, i.e., columns of values, additional values for each column are the following six ways , can pass array:
DF = pd.DataFrame ({ "a": 1.0, "B": np.array ([ 3] * 4, dtype = int
(B) a two-dimensional query data DataFrame array
1, the head and tail data query: The function .head (x) and .tail (x) x row data before or after query;
2, each column of the data table to view data types can df.dtypes view to using the data type;

1.4pandas reading data and data manipulation
. 1, PANDAS read table data comes way is to use a function pd.read_excle (route table) and pd.read_csv (route table)
, for example:
DF = pd.read_excel ( " D: / Byrbt2018 / Study / Python data analysis courses + practice + explain / Python data analysis courses + practice + explain / jobs / job 3/3 job / Hong Kong hotel data .xlsx ") # table read operation relies mainly on pd.read_excel / csv file path function +
2, manner of certain columns in a table data extraction lines:
(1) the name attribute ways:
df.loc [OK properties a: B attribute row, column attributes: column properties]
(2) the subject of the mode table:
df.iloc [line ID: row number, column number: column number]
(3) directly by way of the array:
DF [[1 column attributes, attribute column 2, column 3 properties ...]] [line number: line number]
(4) a standard format:
df.loc [[index ,, ...], the [columns ,, ...]] # preceding column
3, table rows additions and substitutions
(1) add a line:
first good new data with this line of dictionary definitions (a dictionary of key attributes for each column of the table), then convert it to a series of one-dimensional list, after the use of table Addition function df.append (s) to be added to this data line, also can be used to define the row s.name property name of the newly added line
s = {0: "Tianshui Hotel", 1: "recreation", 2: "Tianshui", 3: "Gangu County", 4: "Zhongguancun Street", 5: 4.5,6: 11000,7: # 345} define a line of data is performed by the operation of the dictionary definition of a new line
s1 = pd.Series (s) # conversion dictionary as a one-dimensional list
s1.name = 420 # line property definition list
(2) deletion of one line:
direct pandas calling module deletion function df.drop ([row number]) function to cut a row of data corresponding to the required cut.
4, the column related operation
increases (1) column:
adding columns by definition can directly increase:
DF [ "number"] = range (1, len (df) +1) # This adds a new column number
(2) deletion of a column:
deletion columns may be [ "column attribute name", axis = 1] by df.drop to operate, wherein the axis = 1 must be set, he said column instead of deleting the row, If no write or axis disposed axis = 0, that indicates the row to be cut:
df.drop [ "column attribute name", = axis. 1]
. 5, the data relating to the operation conditions are selected:
about primary data selection condition can be DF using [(selection criteria)] to proceed for example as follows:
Print (DF [. DF score> 4.5]) 4.5 # to select data rates above list
print (df [(df rates> 4.5) & (df.. type == "romantic couple")]) # select score higher than 4.

# Select the type of Hong Kong or the daily number of more than 1,000 people, and score higher than 4.5 data list
6 for missing values and outliers data processing operations:
(1) dealing with missing values mainly includes the following four actions:
ISNULL ( It returns a Boolean data type, determines whether the missing value)
NotNull (isnull and opposite, is determined not missing values)
fillna (for filling missing values)
dropna (missing values for the corresponding cut filter)
(2) deletion processing rule values:
processing data # outliers and missing values
for missing values deletion dropna () there are three main parameters: how = all (delete all rows and columns), inplace = Ture (for real-time deletion of table updates), axis = 0 or 1 (deletion process is a column or row)
Print (df.isnull ())
Print (DF [DF [ "Rating"] .isnull ()] is determined), and missing values # output
# for filling missing values
print (df [df [ "number evaluation"] .isnull ()]) # value is first determined whether they are defective
DF [ "number [" number evaluation "] .fillna (np.mean (df evaluation "]), inplace = True) # fill in missing values and real-time updates = 1 InPlace
Print (df)
Print (len (df [df [" appraiser Number "] .isnull ()]))
Print (len (df))
print(len(df.dropna()))
df.dropna(inplace=True)
print(len(df))

Process # outlier
outliers generally mainly determine whether a table inside the data and properties are listed attributes do not match (such as column data for a number of attributes for the presence of a decimal point negative), then binding of the judgment processing and the update data
print (len (df [df [ "day number"]% 1! = 0])) # for determining outliers and processing
df = df [(df [ "day number"]% 1 == 0) &"the number of daily"]> 0)] # updated in real time according to the conditions of outliers
print (df)

Overall entry to run the code shown below ( can be copied directly run, contains a detailed code comments, you can easily get you started to understand ):

PD pandas AS Import 
Import numpy AS # NP numpy module and import pandas

# Series operation of one-dimensional list
s = pd.Series ([1,2,3,4, np.nan , 2,3,4,6,7])
Print (S)
Print (s.index) # line output of the series tag (attributes)
Print (s.values) # series output values
print (s [2: 9: 2]) # interlaced outputs a corresponding value (slice operation)
s.index.name = "attributes" define a list of attributes series name
print (s)
attribute name of each line s.index = list ( "abcdefghij") # redefine the table
print (s)
Print (S [ " a ":" h ": 2 ]) # extraction part, related to the slicing

# Dataframe two-dimensional list operation Encyclopedia
date = pd.date_range (" 20180101 ", periods = 6, freq =" D ") # generated time series
Print (DATE)
DF = pd.DataFrame (np.random.randn (6,4), index = DATE, Columns = list ( "ABCD")) # define a two-dimensional random number list 6x4, then the definition of each row of each name (index and columns) columns
df.index.name = "date"
print(df)
df1 = pd.DataFrame ({ "A" : 1.0, "B": np.array ([3] * 4, dtype = int), "C": pd.Timestamp ( "20190701"), "D": pd .Series ([1.21,2.21,3.24,4.26], dtype = float), "E": pd.Categorical ([ "a", "b", "c", "d"]), "F": " ABC "})
print (DF1)
print (df1.values) # view binary data
print (df1.index) # view data row attribute name
print (df1.head (3)) #c view of the first three lines of data
print (df1.tail (3)) # view data after three rows of
print (df1.dtypes) # View data table each type of data type

# various data table read operation
df = pd.read_excel ( "D: / Byrbt2018 / Study / Python data analysis courses + practice + explain / Python data analysis courses + practice + explain / jobs / job 3/3 job / Hong Kong hotel data .xlsx ") # table read operation relies mainly on pd. read_excel / csv file path function +
Print (DF)

# operation table row
print (df.iloc [0]) # using suffixes form to query data df.iloc [:,;]
Print (df.iloc [:5]) using the suffixes # data query form df.iloc [:,;]
Print (df.loc [0: 5,0:. 3]) # of rows and columns using the form attributes to query data df .iloc [:,;]
# Add the line operator
s = {0: "Tianshui Hotel", 1: "recreation", 2: "Tianshui", 3: "Gangu County", 4: "Zhongguancun Street", 5: 4.5,6: 11000,7 : 345} # define a row of data, using the operation of the dictionary to define a new line
s1 = pd.Series (s) # conversion dictionary as a one-dimensional list
s1.name = 420 # line property definition list
Print (S1)
DF = df.append (s1) # line of data to increase the operation (re definition list)
Print (DF)
Print (DF [-5:])
DF = df.drop ([420.]) # delete a row data operation (re definition list)
Print (df)
df.columns = [ "name", "type", "city", "China", "street", "score", "the number of evaluation", "the number of daily"] # change table each column name
Print (DF)
Print (type (df.index))
Print (df.columns)
Print (DF [ "name"])
# extract of certain columns of a row of three methods of data (df.loc [; ,;, iLoc [;,;], DF [[,,] [:])
Print (DF [[ "name", "type"]] [:5]) # extract lines of certain columns in a
print (df.iloc [0: 5,0: 3]) # extract certain columns in a row (first row, in the column)
Print (df.loc [1:40 , "type": "ratings"])
print (df [[ "name", "score", "City"]] [0: 400: 20]) # front row after row

# column related operation
DF [ "number"] = range (1, len (df) +1) # increasing operation column
Print (DF [:. 5])
DF = df.drop ( "No.", axis = 1) # deletion time column using .drop function, then drop (x, axis = 1) x represents the name of the column, axis = 1 means to delete a column, do not write axis indicates that it is 0, the default is to delete the row
Print (df)
Print (df.loc [[1,2,3,5 , 10], [ "type", "Rating"]]) # standard form extracted data table

# conditions selected data operation
print (df [df. rates> 4.5])
Print (DF [(DF. rates> 4.5 ) & (. df type == "romantic couple")])
. Print (df [((df type == "Hong Kong") | (. df daily number> 1000)) & (df score> 4.5)]).

# data processing outliers and missing values
Print (df.isnull ())
print (df [DF [ "Rating"] .isnull ()]) # of missing values is determined and output
# for filling missing values
print (df [df [ "The number of evaluation"] .isnull ()]) # first determine whether they are defective value
df [ "The number of evaluation"] .fillna (np.mean (df [ "the number of evaluators"]), inplace = True) # fill in missing values and real-time updates = 1 InPlace
Print (df)
print (len ([df [ "The number of evaluation"] .isnull ()] df))
Print (len (df))
Print (len (df.dropna ()))
df.dropna (InPlace = True)
Print (len ( DF))
Print (len (DF [DF [ "day number"]% 1! = 0])) # for determining outliers and processing
df = df [(df [ "day number"]% 1 == 0) & (df [ "the daily number of people"]> 0)] # updated in real time according to the conditions of outliers
print (df)

 








 

Guess you like

Origin www.cnblogs.com/Yanjy-OnlyOne/p/11201273.html