"Basics of Python Data Analysis" learning record 002:2.4 select consecutive rows, pandas method implementation

1. Data description and requirements:

See the end of the article for details. If you need to practice, copy it directly and save it as a csv file.
The first three rows (1, 2, 3) and the last three rows (16, 17, 18) of the data are data that do not need to be saved.
Requirement: discard these six lines and save the middle part in another folder.

2. Analysis of pandas module ideas

  1. Use the drop method to index and discard unnecessary columns.
  2. Use the .iloc[] method to index rows or columns with integers.
  3. Use the reindex method to regenerate the index.
  4. Use to_csv() to save the rows that need to be saved

problem:

What is the difference between reindex or not?

3. Involving method learning:

1)pandas.DataFrame.drop

DataFrame.drop(self, labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')[source]
Delete the row or column of the specified label.
Delete the row or column by specifying the row or column label name or the related axis name. You can also delete rows or columns by directly specifying the index value or column name to be deleted. If you use multiple indexes, once you specify the level that needs to be deleted, you can also delete the label of that level.

For details, please refer to:

2)pandas.DataFrame.reindex

** DataFrame.reindex(self, labels=None, index=None, columns=None, axis=None, method=None, copy=True, level=None, fill_value=nan, limit=None, tolerance=None)**
Create a new index for the data structure (table) according to a certain filling logic (additional or default).
If there is no such index before, then this method will fill NA or NaN into the value of the newly added index. Unless the new index is the same as the previous index and copy=False, this method will generate a new object.
See details:

3)pandas.DataFrame.columns

Returns the column labels of the data table

4)pandas.DataFrame.index

Returns the index or row label of the data table

4. Implementation code:

import pandas as pd

file1 = 'supplier_data1.csv'
file2 = 'output_file.csv'

data_frame = pd.read_csv(file1, header=None)
data_frame = data_frame.drop([0, 1, 2, 16, 17, 18])
data_frame.columns = data_frame.iloc[0]
data_frame = data_frame.reindex(data_frame.index.drop(3))
data_frame.to_csv(file2, index=False)

1) Analysis:

data_frame = data_frame.drop([0, 1, 2, 16, 17, 18])

Used to delete unnecessary first three lines and last three lines.

data_frame.columns = data_frame.iloc[0]

Specify column label

data_frame = data_frame.reindex(data_frame.index.drop(3))

Use the index after removing the third row of the original index as the new index

data_frame.to_csv(file2, index=False)

Write the processed value to a new csv file.

2) Visual demonstration

If you understand 1), then skip 2).
If you don’t understand, just keep reading here.
Add print to each line of code to print the value you want to see:

import pandas as pd

file1 = 'supplier_data1.csv'
file2 = 'output_file.csv'

data_frame = pd.read_csv(file1, header=None)
data_frame = data_frame.drop([0, 1, 2, 16, 17, 18])
#想看看drop是否被整正确执行,查看执行结果
print(data_frame.values)  # 1
data_frame.columns = data_frame.iloc[0]
# 想了解data_frame.columns是个什么东西
# 这样操作之后,数据表变成什么样子
print(data_frame.columns)  # 2
print(data_frame)  # 3
data_frame = data_frame.reindex(data_frame.index.drop(3))
print(data_frame.index)  # 4
data_frame.to_csv(file2, index=False)

下面是打印结果:
1、
[[‘Supplier Name’ ‘Invoice Number’ ‘Part Number’ ‘Cost’ ‘Purchase Date’]
[‘Supplier X’ ‘001-1001’ ‘2341’ '$500.00 ’ ‘1/20/14’]
[‘Supplier X’ ‘001-1001’ ‘2341’ '$500.00 ’ ‘1/20/14’]
[‘Supplier X’ ‘001-1001’ ‘5467’ '$750.00 ’ ‘1/20/14’]
[‘Supplier X’ ‘001-1001’ ‘5467’ '$750.00 ’ ‘1/20/14’]
[‘Supplier Y’ ‘50-9501’ ‘7009’ '$250.00 ’ ‘1/30/14’]
[‘Supplier Y’ ‘50-9501’ ‘7009’ '$250.00 ’ ‘1/30/14’]
[‘Supplier Y’ ‘50-9505’ ‘6650’ '$125.00 ’ ‘2002/3/14’]
[‘Supplier Y’ ‘50-9505’ ‘6650’ '$125.00 ’ ‘2002/3/14’]
[‘Supplier Z’ ‘920-4803’ ‘3321’ '$615.00 ’ ‘2002/3/14’]
[‘Supplier Z’ ‘920-4804’ ‘3321’ '$615.00 ’ ‘2002/10/14’]
[‘Supplier Z’ ‘920-4805’ ‘3321’ '$6,015.00 ’ ‘2/17/14’]
[‘Supplier Z’ ‘920-4806’ ‘3321’ '$1,006,015.00 ’ ‘2/24/14’]]
2、
Index([‘Supplier Name’, ‘Invoice Number’, ‘Part Number’, ‘Cost’,
‘Purchase Date’],
dtype=‘object’, name=3)
3、
3 Supplier Name Invoice Number Part Number Cost Purchase Date
3 Supplier Name Invoice Number Part Number Cost Purchase Date
4 Supplier X 001-1001 2341 $500.00 1/20/14
5 Supplier X 001-1001 2341 $500.00 1/20/14
6 Supplier X 001-1001 5467 $750.00 1/20/14
7 Supplier X 001-1001 5467 $750.00 1/20/14
8 Supplier Y 50-9501 7009 $250.00 1/30/14
9 Supplier Y 50-9501 7009 $250.00 1/30/14
10 Supplier Y 50-9505 6650 $125.00 2002/3/14
11 Supplier Y 50-9505 6650 $125.00 2002/3/14
12 Supplier Z 920-4803 3321 $615.00 2002/3/14
13 Supplier Z 920-4804 3321 $615.00 2002/10/14
14 Supplier Z 920-4805 3321 $6,015.00 2/17/14
15 Supplier Z 920-4806 3321 $1,006,015.00 2/24/14
4、
Int64Index([4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], dtype=‘int64’)

It can be seen that after executing the data_frame.columns statement, data_frame has one more row-3.
This also explains why the following reindex uses the data table with index 3 removed to regenerate the index.

in conclusion:

  1. Pandas is really powerful, with many methods, but very detailed. If the result is not what people want, you can add print statements to the code to see what it returns. Do code adjustments later.
  2. Learn more, practice more, think more!

Guess you like

Origin blog.csdn.net/Haoyu_xie/article/details/106584373