Getting started with the Numpy and Pandas

  Numpy

  The basic data structure Numpy

  np.array () function accepts a multi-dimensional list, returns a matrix corresponding to the latitude

  vector = np.array([1, 2, 3, 4])

  matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

  Special matrix:

  np.zeros ((size of the first dimension, the second dimension size, ...)) to initialize all-zero matrix, requires a tuple is passed, on the storage size of each dimension.

  np.ones ((size of the first dimension, the second dimension size, ...)) to initialize a full matrix, requires a tuple is passed, on the storage size of each dimension.

  np.arange (start, end, step) to create a sequence

  np.eye (size) to create a matrix of size * size

  np.linspace (start point, end point, length of columns) column number Returns a length from the start point to the finish line interpolation length sequence

  np.logspace (starting index, end index, the number of column length, base = base number) returns the length of the end of index number starting from the index in base sequence is listed in the end of several geometric length

  Basic operations and properties Numpy

  A type of data stored in the matrix should be the same, which dtype return data type attribute matrix elements

  Use asType (Type) Method to change the data type of security element.

  vector = numpy.array(["1", "2", "3", "4"]) # ['1' '2' '3' '4']

  vector = vector.astype(float) # [1. 2. 3. 4.]

  matrix shape properties of an object which returns the size of each dimension.

  Using the RESHAPE ((size of the first dimension, the second dimension size, ...)) to change the method of reforming the shape of an array, if the size of an incoming -1 dimension, this dimension of the thrust reversers according to the size of the other dimensions size

  matrix = np.arange (6) .reshape (-1, 3) # give [[012] [345]]

  Use Ravel () method of the high-dimensional matrix is ​​drawn into a one-dimensional vector

  matrix = np.arange(6).reshape(-1, 3)

  matrix = matrix.ravel () # give [012345]

  Numpy support matrix index and slice index, and the list is similar to python

  matrix = numpy.array([[5,10,15], [20,25,30], [35,40,45]])

  matrix [:, 1] # to give [102540]

  matrix [:, 0: 2] # give [[510] [2025] [3540]]

  matrix [1: 3,: 2] # give [[2025] [3540]]

  Numpy matrix reshape () and slice does not return the new matrix, just change a view (view) the original matrix, modifications to the new view will be applied to the original view

  Numpy support matrix comparison operators, return the same size as the original matrix bool matrix, stores the determination result as the corresponding

  print(matrix == 10) # 得到 [[False True False] [False False False] [False False False]]

  print(matrix > 10) # 得到 [[False False True] [ True True True] [ True True True]]

  And MATLAB Similarly, the matrix may be the result of the comparison operation to the index matrix

  matrix [matrix> 10] = 10 # to give [[51010] [101010] [101010]]

  Use min (axis = dimension), max (axis = dimension), sum (axis = dimension) return on a minimum dimension of the array, and summing the maximum value

  matrix = np.arange(9).reshape((3, 3))

  matrix.min(axis=1) # array([0, 3, 6])

  matrix.max(axis=1) # array([2, 5, 8])

  matrix.sum(axis=1) # array([3, 12, 21])

  Sequence

  numpy.sort (matrix, axis = dimension) returns a matrix corresponding to the dimension of the matrix to sort noted that returns a new matrix, and the matrix does not change the original

  numpy.argsort (matrix, axis = dimension) Returns the sort after each matrix element corresponding to the location index of the original matrix.

  Numpy matrix operations

  Addition and Subtraction:

  The same dimension of the matrix subtraction, addition and subtraction corresponding to the position of the element

  a1 = np.array ([20,30,40,50]) # give [20304050]

  a2 = np.arange (4) # give [0123]

  a3 = a1 - a2 # give [20293847]

  Matrix addition and subtraction of a scalar, then for each element of this scalar addition and subtraction

  a1 = np.array ([20,30,40,50]) # give [20304050]

  a2 = a1-1 # give [19293949]

  Matrix multiplication:

  * Use the same dimension of the matrix operator, returns the same dimensions of the new matrix, which is stored in a position corresponding to the multiplication result element

  Using the matrix 1.dot (matrix 1) or np.dot (Matrix 1, Matrix 2) calculated dot matrix

  A = np.array([[1,1],[0,1]])

  B = np.array([[2,0],[3,4]])

  Multiplying the matrix corresponds to position #

  print (A * B) # give [[20] [04]]

  Both versions dot matrix #

  print (A.dot (B)) # give [[54] [34]]

  print (np.dot (A, B)) # give [[54] [34]]

  Power operation: for each element of the matrix exponentiation operation

  A = np.range(5)

  A = A ** 2 # give [014925]

  Matrix mosaic:

  Using the tile (matrix (multiple first dimension, second dimension ratio, ...)) to extend the method corresponding to a multiple of the corresponding dimension of the matrix

  matrix = np.arange(2)

  matrix = np.tile (matrix, (1,3)) # give [[010101]]

  matrix = np.tile (matrix, (3,1)) # give [[01] [01] [01]]

  Transpose of a matrix, determinant, Inversion

  Use of Matrix Properties .T transpose

  Use np.linalg.inv (matrix) Matrix inverse

  Use np.linalg.det (matrix) Determinants of Matrix

  matrix = np.arange(1,5).reshape(2,2)

  # Of Matrix transpose

  print (matrix.T) # give [[13] [24]]

  # Inverse matrix

  print (np.linalg.inv (matrix)) # give [[-2. 1.] [-0.5 1.5]]

  # Matrix determinant

  print (np.linalg.det (matrix)) # get -2.0000000000000004

  Pandas

  pandas to read data

  Use read_csv pandas () method csv read data, the read data will be packed into a DataFrame object.

  food_info = pd.read_csv("food_info.csv")

  type(food_info) # pandas.core.frame.DataFrame

  Each object food_info.dtypes # DataFrame contained objects are seen as Numpy

  food_info.columns.tolist () # get all the column names

  food_info.values.tolist () # returns it as a form of np.array

  pandas data show

  DataFrame head calling object (number of rows) method to display the number of lines before the line, tail (the number of rows) display method before the line head.

  food_info.head () # First 5 rows

  Before food_info.head (3) # 3 lines

  food_info.tail () 5 # display row after

  Shape objects DataFrame call () method which returns a matrix shape

  food_info.shape # (8618, 36)

  DataFrame calling object LOC [rows] property taken at line number of lines, number of lines may be a number or a list int

  If the number of rows is an int number, a Series object is returned

  If the number of lines as a list, a DataFrame object is returned

  type(food_info.loc[[0]]) # pandas.core.frame.DataFrame

  type(food_info.loc[0]) # pandas.core.series.Series

  food_info.loc[3:5]

  food_info.loc[[2,5,10]]

  For DataFrame object using the subscript index to return a few columns

  If passed an index value, a Series object is returned

  If passed an index list, a DataFrame object is returned

  ndb_col = food_info["NDB_No"]

  zinc_copper_col = food_info[["Zinc_(mg)", "Copper_(mg)"]]

  type(ndb_col) # pandas.core.series.Series

  type(zinc_copper_col) # pandas.core.frame.DataFrame

  Pandas data processing

  DataFrame object of addition, subtraction, multiplication and division is equivalent to addition and subtraction of each of its elements

  div_100 = food_info["Iron_(mg)"] / 100

  add_100 = food_info["Iron_(mg)"] + 100

  sub_100 = food_info["Iron_(mg)"] - 100

  mult_100 = food_info["Iron_(mg)"] * 100

  Examples: calculating a weighted index

  # Score = 2*(protein_(g))-0.75*(Lipid_Tot_(g))

  weighted_protein = food_info["Protein_(g)"] * 2

  weighted_fat = -0.75 * food_info["Lipid_Tot_(g)"]

  initial_rating = weighted_protein + weighted_fat

  Most take the value, the average value: max (), min (), mean (), Pandas automatically remove bad value does not exist

  Use # max () takes the maximum value

  max_calories = food_info["Energ_Kcal"].max()

  mean_calories = food_info["Energ_Kcal"].mean()

  Examples: normalized data as a new row and

  normalized_protein = food_info["Protein_(g)"] / food_info["Protein_(g)"].max()

  normalized_fat = food_info["Lipid_Tot_(g)"] / food_info["Lipid_Tot_(g)"].max()

  food_info["Normalized_Protein"] = normalized_protein

  food_info["Normalized_Fat"] = normalized_fat

  Sequence

  sort_value () method all the rows are ordered by the value

  sort_index () method all rows sorted by index

  # The first argument: Sort keys

  # Inplace: whether a direct replacement for the original object

  # Ascending: ascending or not

  food_info.sort_values("Sodium_(mg)", inplace=True)

  food_info.sort_values("Sodium_(mg)", inplace=True, ascending=False)

  PivotTable

  titanic_surival = pd.read_csv("titanic_train.csv")

  Using the pivot_table () function generates a PivotTable, the following parameters

  index: Specifies the index statistics as a benchmark to which

  values: statistical field can be an index, the index can also list

  aggfunc: statistical method applied field values, default averaged, i.e. np.mean ()

  # Rescued statistical probability of different cabin

  passenger_survival = titanic_surival.pivot_table(index="Pclass", values="Survived", aggfunc=np.mean)

  Statistics different ages cabin #

  passenger_age = titanic_surival.pivot_table(index="Pclass", values="Survived")

  # Statistics total fare and the total number of rescued each port

  port_stats = titanic_surival.pivot_table(index="Embarked", values=["Fare", "Survived"], aggfunc=np.sum)

  Remove Empty value Wuxi gynecological which hospital http://www.87554006.com/

  Use dropna () to delete rows or columns with a null value, the parameters listed below

  axis: column dimension, taken remove empty value 0, fetch a row delete nullable

  subset: delete the row subset field null values

  # Delete all empty column values, axis = 1

  drop_na_columns = titanic_surival.dropna(axis=1)

  # Delete all [ "Age" or "Sex" field] line null values

  new_titanic_survival = titanic_surival.dropna(axis=0, subset=["Age", "Sex"])

  Re-index

  Use reset_index way to re-index

  new_titanic_survival = titanic_surival.sort_values("Age", ascending=False)

  # The data set according to the current order to re-index, drop the original index waiver

  titanic_reindexed = new_titanic_survival.reset_index(drop=True)

  apply (function) method DataFrame object may perform custom column data for each function, and the results summarized in a Series object.

  # Return the first 100 of each column

  def hundredth_row(column):

  # Extract the hundredth item

  hundredth_item = column.loc[99]

  return hundredth_item

  # Returns to its position 100 for each field, the next line is equivalent to titanic_surival.loc [99]

  hundredth_row = titanic_surival.apply(hundredth_row)

  # Count the number of null values ​​for each column

  def not_null_count(column):

  column_null = pd.isnull(column)

  null = column[column_null]

  return len(null)

  # Returns the number thereof for each field null

  column_null_count = titanic_surival.apply(not_null_count)

  # Pclass field coding for each line

  def which_class(row):

  pclass = row["Pclass"]

  if pd.isnull(pclass):

  return "Unknown"

  Elif pclass == 1:

  return "First Class"

  Elif pclass == 2:

  return "Second Class"

  Elif pclass == 3:

  return "Third Class"

  # Returns the encoding Pclass field of each line

  classes = titanic_surival.apply(which_class, axis=1)

  The basic data structure Pandas: DataFrame and Series

  DataFrame Series for the Pandas and the two most important structure, which is similar to Series one-dimensional vector, and DataFrame similar to the two-dimensional matrix.

  Series Numpy can be seen as a collection of objects, DataFrame can be seen as a collection of Series

  fandango = pd.read_csv('fandango_score_comparison.csv')

  # Of DataFrame index values ​​obtained Series

  series_film = fandango['FILM']

  type(series_film) # pandas.core.series.Series

  Series can be used () Constructor generating Series object, index parameter specifies the index

  from pandas import Series

  film_names = fandango [ 'FILM']. values ​​# get all the movie name

  rt_scores = series_rt.values ​​# get all the scores

  # Movie name to an index, the movie score composed of a Series

  series_custom = Series(rt_scores, index=film_names)

  # So you can find movies to movie named Index

  series_custom[['Minions (2015)', 'Leviathan (2014)']]

  The bottom is implemented by a series np.ndarray, so series objects as function arguments may alternatively ndarray

  np.add(series_custom, series_custom)

  np.sin(series_custom)

  np.max(series_custom)

Guess you like

Origin www.cnblogs.com/djw12333/p/11627591.html