12 great Pandas and NumPy function, make data analysis more with less python

We all know Pandas and NumPy function great, they play an important role in the daily analysis. Without these two functions, it will be lost in this vast data analysis and scientific world.

Today, the small core will share 12 great Pandas and NumPy functions that will make life more convenient, make analysis more effective.

At the end of this article, readers can find JupyterNotebook mentioned in the text code.

NumPy from the start:

NumPy scientific computing with Python is the basic package. It contains the following:

  • Powerful N-dimensional array object
  • Complex (radio broadcasting) function
  • Integrated C / C ++ and Fortran source tools
  • Useful linear algebra, Fourier transform and the random number function

In addition to the clear scientific use, general NumPy efficient multidimensional data containers, may define an arbitrary data type. This enables a high speed and NumPy to seamlessly integrate with a variety of databases.

1. allclose()

Allclose () for matching two arrays and outputs a Boolean. If the entry is not equal in the two arrays within the tolerance range, False is returned. This is similar to check whether two arrays a good way, because this is actually very difficult to achieve manually.

array1 = np.array([0.12,0.17,0.24,0.29]) 
array2 = np.array([0.13,0.19,0.26,0.31])# with a tolerance of 0.1, it shouldreturn False:  np.allclose(array1,array2,0.1)  False# with a tolerance of 0.2, it should return True:  np.allclose(array1,array2,0.2)  True  

2. argpartition()

NumPy This function is very good, you can find N maximum index. N output maximum value index, and then if necessary, the values ​​are sorted.

x = np.array([12, 10, 12, 0, 6, 8, 9, 1, 16, 4, 6,0])index_val = np.argpartition(x, -4)[-4:]  index_val  array([1, 8, 2, 0], dtype=int64)np.sort(x[index_val])  array([10, 12, 12, 16])  

3. clip()

The Clip () is used to retain the value in the array interval. Sometimes, we need to be maintained between the upper and lower limits. Thus, use of NumPy Clip () function. A given interval, a value other than the interval will be clipped to the edge intervals.

x = np.array([3, 17, 14, 23, 2, 2, 6, 8, 1, 2, 16,0])np.clip(x,2,5)  array([3, 5, 5, 5, 2, 2, 5, 5, 2, 2, 5, 2])  

4. extract()

As the name suggests, extract () function is used to extract a specific element from the array in accordance with certain conditions. With this function, you can also use the and and or other statements.

# Random integers 
array = np.random.randint(20, size=12) 
array 
array([ 0,  1,  8, 19, 16, 18, 10, 11,  2, 13, 14, 3])#  Divide by and check ifremainder is 1  cond = np.mod(array, 2)==1  cond  array([False,  True, False,  True, False, False, False,  True, False, True, False,  True])# Use extract to get the values  np.extract(cond, array)  array([ 1, 19, 11, 13,  3])# Applycondition on extract directly  np.extract(((array < 3) | (array > 15)), array)  array([ 0,  1, 19, 16, 18,  2])  

5. percentile()

Percentile () for calculating the elements of the array along a specified axis in the n-th percentile.

a = np.array([1,5,6,8,1,7,3,6,9])print("50thPercentile of a, axis = 0 : ",         np.percentile(a, 50, axis =0))  50th Percentile of a, axis = 0 :  6.0b =np.array([[10, 7, 4], [3, 2, 1]])print("30th Percentile of b, axis = 0 :",         np.percentile(b, 30, axis =0))  30th Percentile of b, axis = 0 :  [5.13.5 1.9]  

6. where()

WHERE () Returns the element from an array is used to meet specific conditions. It returns the index position value under certain conditions. This is almost similar to that where used in SQL statements. Consider the following example demonstrates.

y = np.array([1,5,6,8,1,7,3,6,9])# Where y is greaterthan 5, returns index position  np.where(y>5)  array([2, 3, 5, 7, 8], dtype=int64),)# First will replace the values that matchthe condition,  # second will replace the values that does not  np.where(y>5, "Hit", "Miss")  array(['Miss', 'Miss', 'Hit', 'Hit', 'Miss', 'Hit', 'Miss', 'Hit','Hit'],dtype='<U4')  

Then came talk about the magic of Pandas function.

Pandas

Pandas is a Python package that provides fast, flexible and expressive data structure designed to enable the processing of structured (forms, multi-dimensional, potentially heterogeneous) data and time series data is simple and intuitive.

Pandas are very suitable for many different types of data:

  • Data with heterogeneous type forms a column, for example, in a SQL table or Excel spreadsheet
  • Ordered and unordered (not necessarily a fixed frequency) of time-series data.
  • Arbitrary matrix data (the same type or heterogeneous) having a row and column labels
  • Any other form of observation / statistical data set. In fact, the data do not need to tag, you can put Pandas data structure.

Pandas are the following advantages:

  • Easy to handle missing data floating-point data and non-floating point data (expressed as NaN)
  • Size variability: can insert and delete columns from DataFrame and higher dimensional objects
  • Automatic data alignment and explicit: in the calculation, the object may be explicitly aligned to a set of labels, tags or the user may ignore, and so Series, DataFrame automatic alignment data, etc.
  • Powerful and flexible grouping function, splitting may be performed on the data set - Application - the merge operation, and to convert the data summary
  • Easily Python and NumPy other irregular data structures, different index data into objects DataFrame
  • Slice smart labels large data sets, advanced indexing and subsetting
  • The combined coupling and intuitive data collection
  • Flexible remodeling and spin data sets
  • Layered label axis (there may be multiple scales each tag)
  • IO powerful tool for the flat files (CSV and delimited files), Excel files, databases, load data, and save / load data at ultra-high speed HDF5 format
  • Function-specific time series: generating a date range and frequency conversion, moving window statistics, the date and the hysteresis displacement.

1. apply()

The Apply () function and the transfer function allows the user to which a single value for each sequence Pandas applied.

# max minus mix lambda fn 
fn = lambda x: x.max() - x.min()# Apply this on dframe that we've just createdabove 
dframe.apply(fn) 

2. copy()

Copy () function is used to create Pandas copy of an object. When the data frame is allocated to another data frame, change in another data frame, which will synchronize the changed value. To avoid such problems, use copy () function.

# creating sample series 
data = pd.Series(['India', 'Pakistan', 'China', 'Mongolia'])# Assigning issuethat we face  datadata1= data  # Change a value  data1[0]='USA'  # Also changes value in old dataframe  data# To prevent that, we use  # creating copy of series  new = data.copy()# assigning new values  new[1]='Changed value'# printing data  print(new)  print(data)  

3. read_csv(nrows=n)

Source: Pexels

Readers probably already know the importance of the read-csv function. But even unnecessary, most people will mistakenly read the entire .csv file. Assume unknown 10GB .csv file columns and data, read the entire .csv file is not a wise decision in this case, because it would be a waste of time and memory. .Csv can only import lines, then continue as required.

import io 
import requests# I am using this online data set just to make things easier foryou guys 
url = "https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/datasets/AirPassengers.csv" 
s = requests.get(url).content# read only first 10 rows  df = pd.read_csv(io.StringIO(s.decode('utf-8')),nrows=10 , index_col=0)  

4. map()

map () function for mapping the input values ​​in accordance with a correspondence relationship Series. Means for replacing each sequence of values ​​(Series) into another value which can be derived from a function, a dictionary or sequence (Series) in.

# create a dataframe 
dframe = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),index=['India', 'USA', 'China', 'Russia'])#compute a formatted string from eachfloating point value in frame  changefn = lambda x: '%.2f' % x# Make changes element-wise  dframe['d'].map(changefn)  

5 is a ()

ISIN () function is used to filter the data frame. ISIN () help selecting a specific row (or more) values ​​in a particular column. This is the most useful features I've ever seen.

# Using the dataframe we created for read_csv 
filter1 = df["value"].isin([112]) 
filter2 = df["time"].isin([1949.000000])df [filter1 & filter2]  

6. select_dtypes()

select_dtypes () function of a subset of columns based on column returns dtypes data frame. Set the parameters of this function, all columns to include certain data types; may be set to exclude certain columns having all types of data.

# We'll use the same dataframe that we used for read_csv 
framex = df.select_dtypes(include="float64")# Returns only time column 

welfare:

Pivot_table()

Pandas most amazing feature is most useful pivot_table. If you struggle with whether to use groupby, and would like to extend its functionality, then try pivot-table. If you understand the working principle of the pivot table in excel, then everything is very simple. PivotTable level will be stored in the object MultiIndex (layered index) in which the object is located DataFrame results and column index.

# Create a sample dataframe 
school = pd.DataFrame({'A': ['Jay', 'Usher', 'Nicky', 'Romero', 'Will'],        'B': ['Masters', 'Graduate','Graduate', 'Masters', 'Graduate'],        'C': [26, 22, 20, 23, 24]})  # Lets create a pivot table to segregate students based on age and course  table = pd.pivot_table(school, values ='A', index =['B', 'C'],                           columns =['B'], aggfunc = np.sum,fill_value="Not Available")     table 

Guess you like

Origin www.cnblogs.com/7758520lzy/p/12627396.html