Pandas data analysis at a glance - a guide to quickly learn data analysis in a short time (book given at the end of the article)


Preface

After three years of working as a data analyst in a major company, some tools must be mastered, especially the three swordsmen of data analysis in Python: Pandas, Numpy and Matplotlib. Just from personal experience, Pandas is a must to master. It provides easy-to-use data structures and data manipulation tools, making processing structured data in Python easier and more efficient. Whether you are processing common time series data or financial data, linking with various databases, or using various algorithms for calculation and analysis, you cannot do without the data processing support of Pandas. As a data analyst, I have to deal with Pandas almost every day, so learning Pandas is unavoidable, but how to learn Pandas efficiently is a question that requires in-depth thinking. Therefore, this article uses my many years of personal experience in using Pandas to share learning experiences and routes for students who want to get started with Pandas data analysis.

If you feel uneasy about sharing or think it is false propaganda, you can read the data analysis column specially written by the blogger:

One-article quick learning series - data analysis icon-default.png?t=N7T8https://blog.csdn.net/master_hunter/category_11740969.html , the column has been written for three years, "Consolidate the foundation and master the use of various Pandas basic functions. Using practical projects as clues, Deeply understand the principles of data analysis and master the basic methods of Pandas to process data, and finally be able to fully master the use of Pandas to deal with common business needs and data modeling data processing." This is my original intention when creating this column, so I hope everyone can read it with confidence instead of worrying about falsehoods publicity.

Book giving event rules

  1. Follow my blog : Become a follower of my blog and you will be the first to receive all new blog posts and event information.
  2. Leave a message to participate : Leave a message below the article in each issue. Please see the participation method of each issue for the message content .
  3. Announcement of results : A few lucky readers will be selected for free gifts in the comments, and the list of winners will be pinned to the top of the comment area at 2023/9/17 12:00:00 .

The fan selection algorithm is completely transparent.

Each event will end after the winners are announced.

way of participation

Participating in the book donation event is very simple. You only need to follow the following steps to be considered as participating:

  1. Follow the blogger .
  2. Comment below this article "Learn quickly in one article - Pandas data analysis" .

The gift book "Pandas Data Analysis" in this issue is at the end of the article.


1. Pandas learning content

I made a four-dimensional map to make it easier for everyone to remember. In fact, the modules used by Pandas for data analysis are not that complicated. Many complex functions can be easily solved with one function, but if you want to achieve more efficient results, you need to use these It is quite difficult to conceive functional functions. For example, requirements such as splitting a column into multiple columns or batch aggregation of multiple files require a certain level of Pandas proficiency. Generally speaking, once you have mastered the above functions of Pandas, it is basically enough to handle business needs. You just need to deepen your proficiency. However, Pandas is not used alone in many cases. In many scenarios, it is used in conjunction with other third-party libraries such as numpy, matplotlib, or sklearn, so it requires flexibility to master other functions.

2. Pandas learning route

For those who want to learn Pandas data analysis, they have been exposed to general data analysis tools, such as Excel or PowerBI, and should understand the basic knowledge of data analysis. Many people learn Pandas to work more efficiently, and many people learn Pandas for data processing. In short, I think you should have a certain learning ability before learning Pandas. I am the same, so my recommended learning route and the order of my column writing are Consistent, you can refer to:

1. Learn Pandas data structure first

As the basic part of the tool, data structure must have a very clear understanding of it. It is necessary to master Series and DataFrame first. The two data structures are not difficult to understand. The languages ​​​​are common. As long as you understand the data structure of C language structure or Python or JAVA, you can understand it. The key is how to use functions to process these data structures. Mainly learn the three major operations of each data structure. After laying the foundation, you can perform complex table operations, because most of them still need to output table-structured data.

1.Series

2.DataFrame

  • create
    • pd.Series()
    • pd.DataFrame()
  • conversion operation
    • Dictionary conversion, array conversion
  • Index operations
    • Rename index, reset index
  • Query operation
    • .at(),.iloc(),.loc()
  • Slicing operation
  • Splicing operation
    • pd.concat(),pd.merge()

2. Common I/O operations

The advantage of using pandas for data analysis is that it is not as rigid as excel and can only read and write xlsx or csv files. You can even use some json and sql statements as read and write carriers and then convert them into table structures. In this way, the data format is no longer affected, and data can be output freely. Therefore, the second step to learn is the common I/O operations. But at this time, many people will ask, isn't Pandas's I/O operation as simple as pd.read_csv or pd.read_excel()?

Of course, the readI/O function we usually use usually reads in and then writes internal code to process the data format. However, in many cases, we can directly modify some parameters such as read_csv to directly complete part of the data processing work in advance, which can save us a lot of work. Great functionality. Here is an example:

If there is a csv file, I want to change the type of one column to a text type:

df_csv=pd.read_csv("user_info.csv")
df_csv.user_id=df_csv.user_id.astype(str)

Generally, we adopt the above approach, but if we want to convert multiple columns to format, it is still relatively cumbersome, and we can modify it directly through read_csv():

df_csv=pd.read_csv(r'\user_info.csv',dtype={'user_id':'str'})

Of course, each of the read series functions basically has more than 20 parameters that can be modified. There are many cases above, and you will naturally find them when you use them.

Pandas comes with a lot of I/O functions, but if you use them frequently and master the following eight, it is enough to read them. If it involves other data formats, you can use them again. Many parameters are similar:

  • pd.read_csv()-pd.to_csv()
  • pd.read_excel()-pd.to_excel()
  • pd.read_sql()-pd.to_sql()
  • pd.read_json()-pd.to_json()

3. Complex operations on table structures

1. Data cleaning

Pandas is often used as a data cleaning tool for machine learning, and friends who often perform mathematical modeling and analysis have basically come into contact with it. Generally, data cleaning is performed to process three types of data: data with null values, data with outliers, and data with repeated values. Missing value processing includes null value counting, filtering and filling, all of which have corresponding function processing. Duplicated values ​​include duplicated()  and drop_duplicates()  , etc. Needless to say other processing. If you have data processing needs, you can read my column. .

2. Index complex operations

There are many index operations for DataFrame, and the ones used in general business are relatively simple. However, some requirements require high index usage and are difficult to use. Commonly used index operations include index reset:

For example, the original data set is:

reset_index will convert all indexes into columns:

Reset the index:

index.set_index('ID')

 

More complex index operations include index reshaping to convert long-width table data. Understanding and using this function requires a certain amount of effort. Long-wide table conversion is applied to many tables in NoSQL databases such as Hive, or there are multiple indexes for bill data, etc. The data.

Index reshaping is to reconstruct the original index. According to the structure table of DataFrame, we can lock a data by relying on the correspondence between its column name and row name, which can be understood as the x and y coordinate axes of the data. For example, we want to find the 2021 data of user2. Reshaping the index is more like changing the coordinate system, which is equivalent to changing the base.

This method of determining a unique value through two features can be represented not only by a table structure, but also by a tree structure:

The tree structure actually changes the column index to a secondary row index while maintaining the tabular row index unchanged, which is equivalent to establishing a hierarchical index for tabular data.

The method used in pandas is stack():

df1.stack()

 user1  sum     100
       2020     30
       2021     30
       2022     40
user2  sum     120
       2020     30
       2021     50
       2022     40
user3  sum     130
       2020     40
       2021     50
       2022     40
user4  sum     150
       2020     50
       2021     20
       2022     80
user5  sum     160
       2020     40
       2021     40
       2022     80
dtype: int64
 

 According to the stack() method provided by pandas, it is easy to convert between long and wide tables. The following is a wide table:

 To convert a wide table into a long table, you must first change the year information into a row index while keeping the name and city unchanged. Therefore, you must first set the name and city as indexes, and then call the stack() method to change the column The index is also converted to a row index, and finally the index is reset using the reset_index() method:

df1.set_index(['name','city'],inplace=True)

 

If you are confused, you can read my blog post: Quick Learning in One Article (9) - Pandas Index Reshaping for Data Analysis to Implement Long-Wide Table Data Conversion

 There is also the conversion of a long table into a wide table, which I will not show here.

3. Numerical operations

Numerical operations are considered basic operations. Just look at my mind map. Here are a few necessary ones. Numerical sorting:

df1.sort_values(by=['old','weight'],ascending=[True,False])

  The above code means to sort by old in ascending order first. If the same value is encountered, sort by weight in descending order:

Ranking:

se1=df1['old'].rank()
df1.insert(0,'randk',se1)
df1.sort_values(by='old')

 

These operations are very common in SQL. Pandas was also designed with reference to the operations on database tables. Generally speaking, SQL operations on tables are available in Pandas. Pandas basically replicates all the grouping and aggregation functions that are often used in SQL. .

4. Data grouping

Data engineers probably can type groupby with their eyes closed. Pandas perfectly replicates various aggregation functions in database processing tables. There are three steps to grouping:

 The first step is to divide the specified data table into several groups according to different keys. The second part is to perform calculation operations on these groups, which can be set as custom function operations. After the calculation in the third step is completed, the merging operation is performed to obtain a new table. This operation is very similar to MapReduce in the big data architecture computing framework. After the above operation, we can operate on some disorderly arranged tables to get the data we want. I won’t go into the explanation here. Everyone needs to understand that Pandas has the function of benchmarking SQL grouping.

5. Time series processing

Generally, the data read from the database or log file contains time series. To do time series data processing or real-time analysis, the time series needs to be classified and archived. Some objects or characters, integers, etc. need to be classified into certain columns. Convert to datetime time type data that can be recognized by pandas to facilitate time calculations and other operations. You don’t need too much prompting from me. Everyone will naturally learn such methods by then. In terms of functions, I will give the commonly used processing here:

Each frequency change can be assigned a value for each time slice by using the asfreq change frequency function:

rng = pd.date_range('1/1/2011', periods=2, freq='d')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
converted = ts.asfreq('360Min', method='pad')

 

Resample the series to daily frequency resample:

rng = pd.date_range('1/1/2011', periods=5, freq='H')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts.resample('D').mean()

 

Series data type conversion

df_csv['collect_date']=pd.to_datetime(df_csv['collect_date'],format="%Y-%m-%d")
df_csv.dtypes

 

4. Text data processing

There are two types of text data used by Pandas:

  • object: usually a NumPy array
  • string: the most common text data

We most commonly use string to store text files, but the object data type is often used when using dataframe and series for data processing and conversion. Before Pandas version 1.0, there was only the object type, which would cause all character data and non-character data to be stored in object mode, resulting in confusing processing. Subsequent versions have optimized and added String to better distinguish the coupling issues of processing text data. The current object type is still text data and array type string data, but Pandas still sets the object type as the default text data storage type for subsequent compatibility.

There are many ways to handle the String data type, and both sequence and index are equipped with a set of string handling methods that can easily operate on each element of the array. The most important thing is that these methods automatically exclude missing /NA values. These methods are accessed through the str attribute, and usually the name matches the equivalent (scalar) built-in string method.

Here are a few commonly used cases:

1. Case conversion

s=pd.Series(['A','b','C',np.nan,'ABC','abc','AbC'],dtype='string')

 Lowercase conversion lower()

s.str.lower()

 

Uppercase conversion upper()

s.str.upper()

 

2. String space removal

It is often used with regular expressions, which means that the code often appears in crawlers. This method has three forms of control. Here we create a data set that comprehensively covers the test cases:

s=pd.Index([' A','A ',' A ','A'],dtype='string')

 Remove strip() from all

s.str.strip()

 

String methods on indexes are particularly useful for processing or transforming DataFrame columns. For example, there may be columns with leading or trailing spaces

df = pd.DataFrame(
    np.random.randn(3, 2), columns=[" Column A ", " Column B "], index=range(3)
)

 

 After we extract the column index, we can use the str class method to handle the conversion. If the functions are used together, it can also be used with other functions to achieve complex conversion effects:

df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")

 

3. Split and splice

To split a string, we generally use the split function, which is very convenient to use:

s2 = pd.Series(["a_b_c", "c_d_e", np.nan, "f_g_h"], dtype="string")
s2.str.split("_")

 

Based on cat(), resp. The methods of Index.str.cat can concatenate a sequence or index with itself or other sequences or indexes.

Values ​​of Series (or index) can be concatenated:

s = pd.Series(["a", "b", "c", "d"], dtype="string")
s.str.cat(sep=",")

 

The first argument to cat() can be a list-like object, as long as it matches the length of the calling sequence (or index).

s.str.cat(["A", "B", "C", "D"])

 Missing values ​​on either side will also result in missing values ​​in the result, unless na_rep is specified:

s.str.cat(t, na_rep="-")

 

I won’t explain it further. Interested students can read my data analysis column.

5. Quick chart visualization

Generally, when we do data mining or data analysis, or big data development to extract data from a database, we inevitably have to look left and right with the table data, and we always hope to be able to immediately generate a data visualization based on what we want. charts to present the data more intuitively. When we want to visualize data, we often need to call a lot of libraries and functions, as well as data conversion and a lot of code processing. This is very tedious work. Indeed, just for data visualization, we do not need to implement engineering programming for data visualization. This is what data analysts and professional reporting tools do. For daily analysis, we directly perform it according to our own needs. You can quickly draw a picture, and Pandas has this function. Of course, it still relies on the matplotlib library, but it is easier to compress the code.

So this article ends here. The Pandas data analysis column series has been updated for a long time, and basically covers all aspects of using pandas to handle daily business and regular data analysis. From the gradual introduction of basic data structures to the processing of various types of data and the professional explanation of common pandas functions, a lot of time and effort have been spent on creation. If you have friends who need to engage in data analysis or big data development, recommend subscribing to the column, and we will do it as soon as possible. Learn the most practical and commonly used knowledge of Pandas data analysis.


⭐️ Good book recommendations

Tsinghua News Agency [Autumn Reading Plan] Get coupons and enjoy discounts immediately

IT Good Book 5 fold plus 10 yuan no threshold coupon: https://u.jd.com/Yqsd9wj

Event time: September 4th to September 17th, first come first served, come and grab it quickly

"Pandas Data Analysis"

brief introduction

 "Pandas Data Analysis" details the basic solutions related to Pandas data analysis, mainly including introduction to data analysis, using PandasDataFrame, using Pandas for data sorting, aggregating Pandas DataFrame, using Pandas and Matplotlib to visualize data, using Seabom and custom technologies Graphing, financial analysis, rule-based anomaly detection, introduction to Python machine learning, making better predictions, machine learning anomaly detection, and more. In addition, the book also provides corresponding examples and codes to help readers further understand the implementation process of related solutions.
  "Pandas Data Analysis" is suitable as a textbook and teaching reference book for computer and related majors in colleges and universities, and can also be used as a self-study book and reference manual for relevant developers.

Editor's Choice

Pandas is a powerful and popular library synonymous with data science in Python. This book will introduce you to how to use Pandas to perform data analysis on real-world data sets, such as stock market data, simulated hacker attack data, weather trends, earthquake data, wine data, astronomical data, etc. Pandas enables us to work with tabular data efficiently, making data organization and visualization easier.

Guess you like

Origin blog.csdn.net/master_hunter/article/details/132869828