Python uses pandas to read xlsx data and store it in txt

Use pandas to read xlsx data and store it in txt file

Convert a file:

import pandas as pd
df = pd.read_excel('../data/x/ant_1.5.xlsx',usecols="C,X")	# 使用pandas模块读取数据
df['Class']=df['Class'].str.replace('.','/',regex=True) # 替换 .换成/
print('开始写入txt文件...')
df.to_csv('../data/t/ant_1.5.txt', header=None, sep=' ', index=False)		# 写入txt,空格分隔
print('文件写入成功!')
① Read the specified column
df = pd.read_excel("data.xlsx", usecols=[0, 5]) # 指定读取第1列和第6列
# 可以用"A,F"代替[0,5]

Refer to pandas to read Excel specified columns

②Replacement: replace()

It can replace a column, a row, or the entire table.
df.replace() or df[col].replace()

#参数如下:
df.replace(to_replace=None, value=None, inplace=False, limit=None, regex=False, method='pad',)
Parameter Description:
  • to_replace: the replaced value value: the replaced value
  • inplace: Whether to change the original data, False is not changed, True is changed, the default is False
  • limit: control the number of fills
  • regex: Whether to use regular, False is not used, True is used, the default is False
  • method: filling method, pad, ffill, bfill are forward, forward and backward filling respectively

df.replace() or df[col].replace() replaces the entire value, that is to say, if the replaced value ='asdfg', the previous value can only be replaced if it is equal to ='asdfg'.
But we often want to replace local values, such as 'Shenzhen area', replace it with 'Shenzhen', then we must first str, the code is as follows:

main_copy['city']=main_copy['city'].str.replace('地区','市')

Refer to replace replacement usage

If you want to replace . in a column with /, you need regex=True (regular use), otherwise all characters will be replaced with / (don’t know why?)

regular replacement

The escape character \ can escape many characters, such as \n means newline, \t means tab, and the character \ itself must be escaped, so the character represented by \ is that if there are many characters
in the string that need to be escaped, then You need to add a lot of \. For simplicity, Python also allows r'' to indicate that the internal string of '' is not escaped by default.

df.replace(r'\?|\.|\$',np.nan)  #和原来没有变化
df.replace(r'\?|\.|\$',np.nan,regex=True)#用np.nan替换?或.或$原字符
df.replace([r'\?',r'\$'],np.nan,regex=True)#用np.nan替换?和$
df.replace([r'\?',r'\$'],[np.nan,'NA'],regex=True)#用np.nan替换?用NA替换$符号
df.replace(regex={
    
    r'\?':None})

Of course, if you don't want to use inplace=True, you can also express it like this

df=df.replace(20,30)
df.replace(20,30,inplace=True)

Batch processing:

#简单的逻辑
import pandas as pd
import os
xlsx_path='../data/x/'
txt_path='../data/t/'
for xlsxfile in os.listdir(xlsx_path):
    if xlsxfile == '.DS_Store':
        continue
    df = pd.read_excel(xlsx_path+xlsxfile,usecols="C,X")
    df['Class'] = df['Class'].str.replace('.', '/', regex=True)
    df.to_csv(txt_path+xlsxfile.split('.xlsx')[0]+'.txt', header=None, sep=' ', index=False)

os.listdir() is used to return a list of all files or folder names contained in the specified folder
. DS_Store: hidden files. This is used to filter hidden files


2022.3.11


pandas adds prefix and suffix to a column of data

Treat the columns of DataFrame as str, and then use the connection operation of str.

# 添加前缀
newDF = strs + oldDF[col].astype('str')
# 添加后缀
newDF = oldDF[col].astype('str') + strs

Where oldDF is the original DataFrame, col is the column index name, strs is the string to be added, and the new column will be saved in newDF.
If you need to add similar codes in other positions, you can directly perform slicing and splicing according to the operation of strings.
Reference to add suffix and suffix to a column of data

Add the corresponding prefix according to the condition:
import pandas as pd
import os
xlsx_path='../data/x/'
txt_path='../data/t/'
for xlsxfile in os.listdir(xlsx_path):
    if xlsxfile == '.DS_Store':
        continue
    df = pd.read_excel(xlsx_path+xlsxfile,usecols="C,X")
    df['Class'] = df['Class'].str.replace('.', '/', regex=True)
    if (xlsxfile == 'ant_1.5.xlsx' or xlsxfile == 'ant_1.6.xlsx' or xlsxfile == 'ant_1.7.xlsx'):
        df['Class'] = 'src/main/' + df['Class'].astype('str')
    elif (xlsxfile == 'camel_1.2.xlsx' or xlsxfile == 'camel_1.4.xlsx' or xlsxfile == 'camel_1.6.xlsx'):
        df['Class'] = 'camel-core/src/main/java/' + df['Class'].astype('str')
    elif (xlsxfile == 'ivy_1.4.xlsx' or xlsxfile == 'ivy_2.0.xlsx' or xlsxfile == 'log4j_1.0.xlsx' or xlsxfile == 'log4j_1.1.xlsx' or xlsxfile == 'lucene_2.0.xlsx' or xlsxfile == 'lucene_2.2.xlsx' or xlsxfile == 'lucene_2.4.xlsx' or xlsxfile == 'poi_1.5.xlsx' or xlsxfile == 'poi_2.5.1.xlsx' or xlsxfile == 'poi_3.0.xlsx'):
        df['Class'] = 'src/java/' + df['Class'].astype('str')
    elif (xlsxfile == 'synapse_1.0.xlsx' or xlsxfile == 'synapse_1.1.xlsx' or xlsxfile == 'synapse_1.2.xlsx'):
        df['Class'] = 'modules/core/src/main/java/' + df['Class'].astype('str')
    elif (xlsxfile == 'xalan_2.4.0.xlsx' or xlsxfile == 'xalan_2.5.0.xlsx' or xlsxfile == 'xerces_1.2.0.xlsx' or xlsxfile == 'xerces_1.3.0.xlsx'):
        df['Class'] = 'src/' + df['Class'].astype('str')
    df.to_csv(txt_path+xlsxfile.split('.xlsx')[0]+'.txt', header=None, sep=' ', index=False)

2022.3.12


Add the filename as a prefix:
import pandas as pd
import os
xlsx_path='../data/xlsx/'
txt_path='../data/txt/'
for xlsxfile in os.listdir(xlsx_path):
    if xlsxfile == '.DS_Store':
        continue
    df = pd.read_excel(xlsx_path+xlsxfile,usecols="C,X")
    df['Class'] = df['Class'].str.replace('.', '/', regex=True)
    if (xlsxfile == 'ant_1.5.xlsx' or xlsxfile == 'ant_1.6.xlsx' or xlsxfile == 'ant_1.7.xlsx'):
        df['Class'] = xlsxfile.split('.xlsx')[0] + '/src/main/' + df['Class'].astype('str')
    elif (xlsxfile == 'camel_1.2.xlsx' or xlsxfile == 'camel_1.4.xlsx' or xlsxfile == 'camel_1.6.xlsx'):
        df['Class'] = xlsxfile.split('.xlsx')[0] + '/camel-core/src/main/java/' + df['Class'].astype('str')
    elif (xlsxfile == 'ivy_1.4.xlsx' or xlsxfile == 'ivy_2.0.xlsx' or xlsxfile == 'log4j_1.0.xlsx' or xlsxfile == 'log4j_1.1.xlsx' or xlsxfile == 'lucene_2.0.xlsx' or xlsxfile == 'lucene_2.2.xlsx' or xlsxfile == 'lucene_2.4.xlsx' or xlsxfile == 'poi_1.5.xlsx' or xlsxfile == 'poi_2.5.1.xlsx' or xlsxfile == 'poi_3.0.xlsx'):
        df['Class'] = xlsxfile.split('.xlsx')[0] + '/src/java/' + df['Class'].astype('str')
    elif (xlsxfile == 'synapse_1.0.xlsx' or xlsxfile == 'synapse_1.1.xlsx' or xlsxfile == 'synapse_1.2.xlsx'):
        df['Class'] = xlsxfile.split('.xlsx')[0] + '/modules/core/src/main/java/' + df['Class'].astype('str')
    elif (xlsxfile == 'xalan_2.4.0.xlsx' or xlsxfile == 'xalan_2.5.0.xlsx' or xlsxfile == 'xerces_1.2.0.xlsx' or xlsxfile == 'xerces_1.3.0.xlsx'):
        df['Class'] = xlsxfile.split('.xlsx')[0] + '/src/' + df['Class'].astype('str')
    df.to_csv(txt_path+xlsxfile.split('.xlsx')[0]+'.txt', header=None, sep=' ', index=False)

Guess you like

Origin blog.csdn.net/qq_45484237/article/details/123434392
Recommended