[Reading notes] pandas

Ready to start

Pandas package Overview

https://pandas.pydata.org/pandas-docs/stable/getting_started/overview.html

 

pandas is a Python package that provides fast, flexible and expressive data structure designed to enable processing "relationship" or "tag" data is simple and intuitive. Its goal is to become the basic level, real data analysis of the actual building blocks with Python. In addition, it also has a broader goal of becoming any language available in the most powerful and flexible open source data analysis / manipulation tools. It is already moving towards this goal.

pandas is very suitable for many different types of data:

Table columns with different types of data, such as the SQL table or Excel spreadsheet

Ordered and unordered (not necessarily a fixed frequency) time-series data.

Any data matrix (homogeneous type or heterogeneous type) having a row and column labels

Any other form of observation / statistical data sets. These data can in fact do not need to tag data structure into a panda

pandas are two main data structures Series of (one-dimensional) and DataFrame (two-dimensional) , capable of handling the vast majority of typical finance, statistics, social sciences and engineering in many fields of use cases. For R users, DataFrame provides all the content of R data.frame offer even more. panda built on NumPy, designed to integrate well in scientific computing environments with many other third-party libraries.

The following are the pandas good at a few things :

- easy to handle floating-point data and floating point data in the non-missing data (represented by NaN)

- Size variability: can insert and delete columns from high-dimensional objects and DataFrame

- automatic and explicit data alignment: alignment of objects can be explicitly set to a label, or the user may simply ignore the label, so Series, DataFrame you like to automatically align the data in the calculation

- a powerful, flexible group by function, splitting may be performed on the data set - Application - combined operation, and conversion data for polymerization

- converting the simplified Python and NumPy other irregular data structures, different indices of data objects during the data aframe

- a subset of the slices, fancy indexes and large data sets based on smart tags

- merger and intuitive connection data sets

- Flexible data set and reset rotation

- Layered labeled shaft (there may be a plurality of markers each)

- IO robust tool for, Excel files, a database is loaded from a flat file (CSV and separators) data, and saving from the ultrafast format HDF5 / load data

- Time Series unique features: generating a date range and frequency conversion, moving window statistics, moving window linear regression, and the shift lag date.

Many of the principles here is to address the shortcomings often encountered when using other language / scientific environment. Scientists for data, the processing data is generally divided into several stages: cleaning the data conversion and data analysis / modeling, and analysis results organized into a form suitable for display of graphics or tables. pandas is the ideal tool for all these tasks.

Some other notes:

-pandas soon. Cython code for many of the underlying algorithms, conducted a wide range of adjustment. However, the same way as any other generalization, generalization often sacrificing performance. So, if you focus on one characteristic of the application, you may be able to create a faster special tools.

-pandas is dependency statsmodels library, which makes it an important part of the Python statistical computing ecosystems.

-pandas has been widely used in the production and financial applications.

data structure

dimension

name

description

1

Series

The same type of one-dimensional array of labels

2

DataFrame

Tag is generally two-dimensional, the table structure has a variable size and potentially different type column

Why should there be more than one kind of data structure?

The best way to understand pandas data structure is a flexible container as the low dimensional data. For example, DataFrame a container Series, Series scalar container. We hope that in a similar fashion dictionary inserting and deleting objects from these containers.

Reasonable default behavior of the public API functions In addition, we wish to consider the typical orientation time series and cross section data set. When using two-dimensional and three-dimensional data stored ndarrays, users need to be considered in the preparation of a function of the direction of the data set; axis is said to be equivalent or less (unless fortran related to the performance or C). In pandas, the role of the axis is given to more semantic data; i.e., for a particular data set, there may be a "right" way to determine the direction of data. Thus, the goal is to reduce the required coded downstream data conversion mental function.

For example, the data table (DataFrame), consider indexes (rows) and columns 0 and the ratio of axial shaft 1 semantically more helpful. Therefore, traversing DataFrame column will be more readable code:

for col in df.columns:

    series = df[col]

    # do something with series

- meaning that if this structure only multi-dimensional arrays, data dimensions have to consider when we traverse direction, and this one-dimensional array structure and dimensions of two-dimensional array of tiles, it is well understood when traversing dimension (column) a.

Replication and variability of the data

All pandas are data structures of variable values ​​(values ​​they contain can be changed), but not always the size of the variable. Length of the sequence can not be changed, but, for example, a column may be inserted into the DataFrame. However, most methods will generate a new object, and does not change the input data. In general, we like to remain unchanged at a reasonable place.

- meaning the value of all be changed, but you can not change the length of the sequence (being not quite sure what it means?). And the vast majority of calls to change the method does not change the original object, but was assigned to the new object or objects assigned to the original to see the updated data.

Getting Help

The first station pandas issues and ideas are ithub Issue Tracker . If you have a general question, panda expert community by Stack Overflow answer.

10 minutes Getting pandas

This is pandas brief introduction, mainly for new users. You can Cookbook see more complex cases.

By convention, so we introduced:

In [1]: import numpy as np

In [2]: import pandas as pd

Create Object

We can look at the data structure introduced this section

 https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#dsintro

Create a list by passing a series, let pandas create a default integer index:

 

DataFrame create a NumPy array by passing a label with the date and index columns:

 

Guess you like

Origin www.cnblogs.com/everda/p/11594522.html