Python pandas DataFrame

Python pandas DataFrame

小象学院 Python人工智能学习纪要
http://www.chinahadoop.cn/bootcamp/course/1276

  • applymap

Code demo:

import pandas as pd


def transfer_odd_even(x):
    if x % 2 == 0:
        return 'odd'
    else:
        return 'even'


arr = pd.DataFrame({
    'a': [1, 2, 3],
    'b': [4, 5, 6]
})

arr2 = arr.applymap(transfer_odd_even)
print(arr2)

Output:

      a     b
0  even   odd
1   odd  even
2  even   odd

Practice:

import pandas as pd

fruits_sold = pd.DataFrame({
    's1': [20.85746739, 48.69399627, 6.64139183, 43.97206466, 42.24557245, 6.59165588, 18.14644399, 0.51207489,
           4.07669522, 40.2318477],
    's2': [44.12131395, 21.25162529, 10.10155038, 18.26017017, 26.81437566, 22.6139889, 34.7631294, 12.53410469,
           23.19350524, 13.7823773],
    's3': [45.81103321, 6.06090098, 13.54856198, 26.20863107, 23.12541538, 45.61618866, 46.27495742, 14.26893644,
           24.02366712, 0.64639531],
    's4': [6.284111, 21.20035631, 20.64741436, 17.21031833, 1.14764087, 48.34064417, 43.86282243, 28.06504264,
           49.73980011, 24.48708205],
    's5': [33.1696067, 42.65221954, 10.76371411, 30.74131832, 0.37146529, 42.56511105, 42.08960913, 16.59236303,
           6.43504234, 45.85999181],
    's6': [42.05696716, 27.41229091, 11.66774126, 1.85674496, 3.32968606, 40.72835009, 28.8835585, 26.46075757,
           46.33800476, 35.4963476]},
    index=['5-1', '5-2', '5-3', '5-4', '5-5', '5-6', '5-7', '5-8', '5-9', '5-10'],
)


def sold_grade(amount):
    if amount >= 40:
        return 'A'
    elif amount >= 30:
        return 'B'
    else:
        return 'C'


fruits_sold_grade = fruits_sold.applymap(sold_grade)

print(fruits_sold_grade)
     s1 s2 s3 s4 s5 s6
5-1   C  A  A  C  B  A
5-2   A  C  C  C  A  C
5-3   C  C  C  C  C  C
5-4   A  C  C  C  B  C
5-5   A  C  C  C  C  C
5-6   C  C  A  A  A  A
5-7   C  B  A  A  A  C
5-8   C  C  C  C  C  C
5-9   C  C  C  A  C  A
5-10  A  C  C  C  A  B

Conclusion

Applymap 方法将DataFrame的每个元素经过函数运算之后转换成新的元素。


  • apply

Code demo

import pandas as pd


demo_arr = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]])


def compared_by_5_return_true_false(num):
    # 这里原先用了一个非必要的if判断
    # return True if num > 5 else False
    return num >= 5


def sum_by_series(series):
    return series.sum()


# 判断不小于5
print(demo_arr.apply(compared_by_5_return_true_false))
print(demo_arr.applymap(compared_by_5_return_true_false))

# 求每列和
print(demo_arr.apply(sum_by_series))

# 求每行和
print(demo_arr.apply(sum_by_series, axis=1))

Output

       0      1      2
0  False  False  False
1  False   True   True
2   True   True   True
       0      1      2
0  False  False  False
1  False   True   True
2   True   True   True
0    12
1    15
2    18
dtype: int64
0     6
1    15
2    24
dtype: int64

Practice 1

import pandas as pd

df = pd.DataFrame({

    's1': [27.93, 58.08, 38.67, 45.83, 70.26, 46.61, 49.73, 34.02, 56.64, 57.28],

    's2': [28.18, 50.61, 31.73, 31.48, 55.96, 22.73, 40.47, 42.02, 31.39, 64.21],

    's3': [29.39, 51.62, 57.91, 45.94, 53.81, 45.77, 69.13, 28.75, 43.43, 55.7],

    's4': [40.52, 48.55, 59.24, 71.21, 58.48, 63.63, 55.16, 34.9, 54, 68.03],

    's5': [26.26, 54.03, 49.08, 46.53, 43.23, 56.79, 58.71, 26.43, 44.97, 54.16]

}, index=['05-21', '05-22', '05-23', '05-24', '05-25', '05-26', '05-27', '05-28', '05-29', '05-30'])


def cell_larger_than_mean(bool_v):
    if bool_v:

        return 'A'

    else:

        return 'B'


def larger_than_mean(numbers):
    mean = numbers.mean()

    return numbers > mean


print(df.apply(larger_than_mean, axis=1).applymap(cell_larger_than_mean))

      s1 s2 s3 s4 s5
05-21  B  B  B  A  B
05-22  A  B  B  B  A
05-23  B  B  A  A  A
05-24  B  B  B  A  B
05-25  A  B  B  A  B
05-26  B  B  B  A  A
05-27  B  B  A  A  A
05-28  A  A  B  A  B
05-29  A  B  B  A  B
05-30  B  A  B  A  B

Practice 2

import pandas as pd

df = pd.DataFrame({

    's1': [27.93, 58.08, 38.67, 45.83, 70.26, 46.61, 49.73, 34.02, 56.64, 57.28],

    's2': [28.18, 50.61, 31.73, 31.48, 55.96, 22.73, 40.47, 42.02, 31.39, 64.21],

    's3': [29.39, 51.62, 57.91, 45.94, 53.81, 45.77, 69.13, 28.75, 43.43, 55.7],

    's4': [40.52, 48.55, 59.24, 71.21, 58.48, 63.63, 55.16, 34.9, 54, 68.03],

    's5': [26.26, 54.03, 49.08, 46.53, 43.23, 56.79, 58.71, 26.43, 44.97, 54.16]

}, index=['05-21', '05-22', '05-23', '05-24', '05-25', '05-26', '05-27', '05-28', '05-29', '05-30'])


def gap_between_max_min(numbers):
    return numbers.idxmax() + '-' + numbers.idxmin() + '=' + str(numbers.max() - numbers.min())[:4]


print(df.apply(gap_between_max_min, axis=1))

      s1 s2 s3 s4 s5
05-21  B  B  B  A  B
05-22  A  B  B  B  A
05-23  B  B  A  A  A
05-24  B  B  B  A  B
05-25  A  B  B  A  B
05-26  B  B  B  A  A
05-27  B  B  A  A  A
05-28  A  A  B  A  B
05-29  A  B  B  A  B
05-30  B  A  B  A  B

Conclusion

DataFrame apply的作用

  • 将DataFrame的每行或者每列(Series)经过函数运算之后转换成新的行或列。
  • 将DataFrame的每行或者每列(Series)经过函数运算之后转换成一个值。

  • DataFrame的数组定义

Code demo

import pandas as pd

arr_test = pd.DataFrame({'s1': [1, 2], 's2': [10, 20], 's3': [100, 200]}, index=['a', 'b'])
print('arr3:\n', arr_test)
print('s1:\n', arr_test['s1'])
print('a:\n', arr_test.loc['a'])
print('a:\n', arr_test.iloc[0])

Output

arr3:
    s1  s2   s3
a   1  10  100
b   2  20  200
s1:
 a    1
b    2
Name: s1, dtype: int64
a:
 s1      1
s2     10
s3    100
Name: a, dtype: int64
a:
 s1      1
s2     10
s3    100
Name: a, dtype: int64

Conclusion

‘column_name’: [a,b,c] 中 column_name 为列索引 如: ‘column1’: [1, 2, 3] 的列标签为 coulumn1 ,而 1 , 2, 3 是这列的元素

‘column_name’:[a,b,c] 需要在{}中定义 如 pd.DataFrame({ ‘column1’: [1, 2, 3], ‘column1’: [4, 5, 6]}

通过 index=[‘a’, ‘b’, ‘c’] 来定义数据的行索引,如 arr_test = pd.DataFrame({‘s1’: [1, 2], ‘s2’: [10, 20], ‘s3’: [100, 200]}, index=[‘a’, ‘b’]) ,在这里行 a 数据就是 1, 10, 100 , 行 b数据 是 2, 20, 200
如果 index [ ] 中的元素个数 与 其他 [ ] 数量不一致 ,则会报类似如下错误

Traceback (most recent call last):
  File "C:\Users\jianzhang\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\internals\managers.py", line 1667, in create_block_manager_from_arrays
    mgr = BlockManager(blocks, axes)
  File "C:\Users\jianzhang\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\internals\managers.py", line 114, in __init__
    self._verify_integrity()
  File "C:\Users\jianzhang\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\internals\managers.py", line 311, in _verify_integrity
    construction_error(tot_items, block.shape[1:], self.axes)
  File "C:\Users\jianzhang\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\internals\managers.py", line 1691, in construction_error
    passed, implied))
ValueError: Shape of passed values is (2, 3), indices imply (1, 3)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "E:/code/python_project/AI_Learn/pandas_dataframe_define.py", line 3, in <module>
    arr_test = pd.DataFrame({'s1': [1, 2], 's2': [10, 20], 's3': [100, 200]}, index=['a'])
  File "C:\Users\jianzhang\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\frame.py", line 392, in __init__
    mgr = init_dict(data, index, columns, dtype=dtype)
  File "C:\Users\jianzhang\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\internals\construction.py", line 212, in init_dict
    return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
  File "C:\Users\jianzhang\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\internals\construction.py", line 61, in arrays_to_mgr
    return create_block_manager_from_arrays(arrays, arr_names, axes)
  File "C:\Users\jianzhang\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\internals\managers.py", line 1671, in create_block_manager_from_arrays
    construction_error(len(arrays), arrays[0].shape, axes, e)
  File "C:\Users\jianzhang\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\internals\managers.py", line 1691, in construction_error
    passed, implied))
ValueError: Shape of passed values is (2, 3), indices imply (1, 3)

Process finished with exit code 1

访问 DataFrame

  • 可以通过 数组名[‘列索引’] 来访问某列的数据,如 array[‘s1’]
  • 可以通过 loc(行索引) 来访问某行的数据,如 array.loc[‘a’]
  • 可以通过 iloc(下标) 来访问某行的数据,如 array.iloc[0]

  • apply

Code demo

Output

Practice

Conclusion

猜你喜欢

转载自blog.csdn.net/m0_37549544/article/details/88072073