Numpy
The basic data structure Numpy
np.array () function accepts a multi-dimensional list, returns a matrix corresponding to the latitude
vector = np.array([1, 2, 3, 4])
matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
Special matrix:
np.zeros ((size of the first dimension, the second dimension size, ...)) to initialize all-zero matrix, requires a tuple is passed, on the storage size of each dimension.
np.ones ((size of the first dimension, the second dimension size, ...)) to initialize a full matrix, requires a tuple is passed, on the storage size of each dimension.
np.arange (start, end, step) to create a sequence
np.eye (size) to create a matrix of size * size
np.linspace (start point, end point, length of columns) column number Returns a length from the start point to the finish line interpolation length sequence
np.logspace (starting index, end index, the number of column length, base = base number) returns the length of the end of index number starting from the index in base sequence is listed in the end of several geometric length
Basic operations and properties Numpy
A type of data stored in the matrix should be the same, which dtype return data type attribute matrix elements
Use asType (Type) Method to change the data type of security element.
vector = numpy.array(["1", "2", "3", "4"]) # ['1' '2' '3' '4']
vector = vector.astype(float) # [1. 2. 3. 4.]
matrix shape properties of an object which returns the size of each dimension.
Using the RESHAPE ((size of the first dimension, the second dimension size, ...)) to change the method of reforming the shape of an array, if the size of an incoming -1 dimension, this dimension of the thrust reversers according to the size of the other dimensions size
matrix = np.arange (6) .reshape (-1, 3) # give [[012] [345]]
Use Ravel () method of the high-dimensional matrix is drawn into a one-dimensional vector
matrix = np.arange(6).reshape(-1, 3)
matrix = matrix.ravel () # give [012345]
Numpy support matrix index and slice index, and the list is similar to python
matrix = numpy.array([[5,10,15], [20,25,30], [35,40,45]])
matrix [:, 1] # to give [102540]
matrix [:, 0: 2] # give [[510] [2025] [3540]]
matrix [1: 3,: 2] # give [[2025] [3540]]
Numpy matrix reshape () and slice does not return the new matrix, just change a view (view) the original matrix, modifications to the new view will be applied to the original view
Numpy support matrix comparison operators, return the same size as the original matrix bool matrix, stores the determination result as the corresponding
print(matrix == 10) # 得到 [[False True False] [False False False] [False False False]]
print(matrix > 10) # 得到 [[False False True] [ True True True] [ True True True]]
And MATLAB Similarly, the matrix may be the result of the comparison operation to the index matrix
matrix [matrix> 10] = 10 # to give [[51010] [101010] [101010]]
Use min (axis = dimension), max (axis = dimension), sum (axis = dimension) return on a minimum dimension of the array, and summing the maximum value
matrix = np.arange(9).reshape((3, 3))
matrix.min(axis=1) # array([0, 3, 6])
matrix.max(axis=1) # array([2, 5, 8])
matrix.sum(axis=1) # array([3, 12, 21])
Sequence
numpy.sort (matrix, axis = dimension) returns a matrix corresponding to the dimension of the matrix to sort noted that returns a new matrix, and the matrix does not change the original
numpy.argsort (matrix, axis = dimension) Returns the sort after each matrix element corresponding to the location index of the original matrix.
Numpy matrix operations
Addition and Subtraction:
The same dimension of the matrix subtraction, addition and subtraction corresponding to the position of the element
a1 = np.array ([20,30,40,50]) # give [20304050]
a2 = np.arange (4) # give [0123]
a3 = a1 - a2 # give [20293847]
Matrix addition and subtraction of a scalar, then for each element of this scalar addition and subtraction
a1 = np.array ([20,30,40,50]) # give [20304050]
a2 = a1-1 # give [19293949]
Matrix multiplication:
* Use the same dimension of the matrix operator, returns the same dimensions of the new matrix, which is stored in a position corresponding to the multiplication result element
Using the matrix 1.dot (matrix 1) or np.dot (Matrix 1, Matrix 2) calculated dot matrix
A = np.array([[1,1],[0,1]])
B = np.array([[2,0],[3,4]])
Multiplying the matrix corresponds to position #
print (A * B) # give [[20] [04]]
Both versions dot matrix #
print (A.dot (B)) # give [[54] [34]]
print (np.dot (A, B)) # give [[54] [34]]
Power operation: for each element of the matrix exponentiation operation
A = np.range(5)
A = A ** 2 # give [014925]
Matrix mosaic:
Using the tile (matrix (multiple first dimension, second dimension ratio, ...)) to extend the method corresponding to a multiple of the corresponding dimension of the matrix
matrix = np.arange(2)
matrix = np.tile (matrix, (1,3)) # give [[010101]]
matrix = np.tile (matrix, (3,1)) # give [[01] [01] [01]]
Transpose of a matrix, determinant, Inversion
Use of Matrix Properties .T transpose
Use np.linalg.inv (matrix) Matrix inverse
Use np.linalg.det (matrix) Determinants of Matrix
matrix = np.arange(1,5).reshape(2,2)
# Of Matrix transpose
print (matrix.T) # give [[13] [24]]
# Inverse matrix
print (np.linalg.inv (matrix)) # give [[-2. 1.] [-0.5 1.5]]
# Matrix determinant
print (np.linalg.det (matrix)) # get -2.0000000000000004
Pandas
pandas to read data
Use read_csv pandas () method csv read data, the read data will be packed into a DataFrame object.
food_info = pd.read_csv("food_info.csv")
type(food_info) # pandas.core.frame.DataFrame
Each object food_info.dtypes # DataFrame contained objects are seen as Numpy
food_info.columns.tolist () # get all the column names
food_info.values.tolist () # returns it as a form of np.array
pandas data show
DataFrame head calling object (number of rows) method to display the number of lines before the line, tail (the number of rows) display method before the line head.
food_info.head () # First 5 rows
Before food_info.head (3) # 3 lines
food_info.tail () 5 # display row after
Shape objects DataFrame call () method which returns a matrix shape
food_info.shape # (8618, 36)
DataFrame calling object LOC [rows] property taken at line number of lines, number of lines may be a number or a list int
If the number of rows is an int number, a Series object is returned
If the number of lines as a list, a DataFrame object is returned
type(food_info.loc[[0]]) # pandas.core.frame.DataFrame
type(food_info.loc[0]) # pandas.core.series.Series
food_info.loc[3:5]
food_info.loc[[2,5,10]]
For DataFrame object using the subscript index to return a few columns
If passed an index value, a Series object is returned
If passed an index list, a DataFrame object is returned
ndb_col = food_info["NDB_No"]
zinc_copper_col = food_info[["Zinc_(mg)", "Copper_(mg)"]]
type(ndb_col) # pandas.core.series.Series
type(zinc_copper_col) # pandas.core.frame.DataFrame
Pandas data processing
DataFrame object of addition, subtraction, multiplication and division is equivalent to addition and subtraction of each of its elements
div_100 = food_info["Iron_(mg)"] / 100
add_100 = food_info["Iron_(mg)"] + 100
sub_100 = food_info["Iron_(mg)"] - 100
mult_100 = food_info["Iron_(mg)"] * 100
Examples: calculating a weighted index
# Score = 2*(protein_(g))-0.75*(Lipid_Tot_(g))
weighted_protein = food_info["Protein_(g)"] * 2
weighted_fat = -0.75 * food_info["Lipid_Tot_(g)"]
initial_rating = weighted_protein + weighted_fat
Most take the value, the average value: max (), min (), mean (), Pandas automatically remove bad value does not exist
Use # max () takes the maximum value
max_calories = food_info["Energ_Kcal"].max()
mean_calories = food_info["Energ_Kcal"].mean()
Examples: normalized data as a new row and
normalized_protein = food_info["Protein_(g)"] / food_info["Protein_(g)"].max()
normalized_fat = food_info["Lipid_Tot_(g)"] / food_info["Lipid_Tot_(g)"].max()
food_info["Normalized_Protein"] = normalized_protein
food_info["Normalized_Fat"] = normalized_fat
Sequence
sort_value () method all the rows are ordered by the value
sort_index () method all rows sorted by index
# The first argument: Sort keys
# Inplace: whether a direct replacement for the original object
# Ascending: ascending or not
food_info.sort_values("Sodium_(mg)", inplace=True)
food_info.sort_values("Sodium_(mg)", inplace=True, ascending=False)
PivotTable
titanic_surival = pd.read_csv("titanic_train.csv")
Using the pivot_table () function generates a PivotTable, the following parameters
index: Specifies the index statistics as a benchmark to which
values: statistical field can be an index, the index can also list
aggfunc: statistical method applied field values, default averaged, i.e. np.mean ()
# Rescued statistical probability of different cabin
passenger_survival = titanic_surival.pivot_table(index="Pclass", values="Survived", aggfunc=np.mean)
Statistics different ages cabin #
passenger_age = titanic_surival.pivot_table(index="Pclass", values="Survived")
# Statistics total fare and the total number of rescued each port
port_stats = titanic_surival.pivot_table(index="Embarked", values=["Fare", "Survived"], aggfunc=np.sum)
Remove Empty value Wuxi gynecological which hospital http://www.87554006.com/
Use dropna () to delete rows or columns with a null value, the parameters listed below
axis: column dimension, taken remove empty value 0, fetch a row delete nullable
subset: delete the row subset field null values
# Delete all empty column values, axis = 1
drop_na_columns = titanic_surival.dropna(axis=1)
# Delete all [ "Age" or "Sex" field] line null values
new_titanic_survival = titanic_surival.dropna(axis=0, subset=["Age", "Sex"])
Re-index
Use reset_index way to re-index
new_titanic_survival = titanic_surival.sort_values("Age", ascending=False)
# The data set according to the current order to re-index, drop the original index waiver
titanic_reindexed = new_titanic_survival.reset_index(drop=True)
apply (function) method DataFrame object may perform custom column data for each function, and the results summarized in a Series object.
# Return the first 100 of each column
def hundredth_row(column):
# Extract the hundredth item
hundredth_item = column.loc[99]
return hundredth_item
# Returns to its position 100 for each field, the next line is equivalent to titanic_surival.loc [99]
hundredth_row = titanic_surival.apply(hundredth_row)
# Count the number of null values for each column
def not_null_count(column):
column_null = pd.isnull(column)
null = column[column_null]
return len(null)
# Returns the number thereof for each field null
column_null_count = titanic_surival.apply(not_null_count)
# Pclass field coding for each line
def which_class(row):
pclass = row["Pclass"]
if pd.isnull(pclass):
return "Unknown"
Elif pclass == 1:
return "First Class"
Elif pclass == 2:
return "Second Class"
Elif pclass == 3:
return "Third Class"
# Returns the encoding Pclass field of each line
classes = titanic_surival.apply(which_class, axis=1)
The basic data structure Pandas: DataFrame and Series
DataFrame Series for the Pandas and the two most important structure, which is similar to Series one-dimensional vector, and DataFrame similar to the two-dimensional matrix.
Series Numpy can be seen as a collection of objects, DataFrame can be seen as a collection of Series
fandango = pd.read_csv('fandango_score_comparison.csv')
# Of DataFrame index values obtained Series
series_film = fandango['FILM']
type(series_film) # pandas.core.series.Series
Series can be used () Constructor generating Series object, index parameter specifies the index
from pandas import Series
film_names = fandango [ 'FILM']. values # get all the movie name
rt_scores = series_rt.values # get all the scores
# Movie name to an index, the movie score composed of a Series
series_custom = Series(rt_scores, index=film_names)
# So you can find movies to movie named Index
series_custom[['Minions (2015)', 'Leviathan (2014)']]
The bottom is implemented by a series np.ndarray, so series objects as function arguments may alternatively ndarray
np.add(series_custom, series_custom)
np.sin(series_custom)
np.max(series_custom)