Python actual statistical data into Execl the right way

background

Ad visibility of the project company is too low, you need to find out the reasons, before making a statistical Buried, operation and maintenance are given filtered data, a txt file more than 500M, open the file very chaotic.

Special Note: Many people learn Python process will encounter a variety of problems to worry about, no one easy answer to give up. For this reason small series built a Python full-stack free Q & A skirt: Seven Yiyi nine hundred seventy-seven bar and the next five (digital homonym) conversion can be found, but not the older drivers have problems to solve, there is also the latest real Python Tutorial Free non-,, together under mutual supervision and common progress!

txt.png

Ask whether there is operation and maintenance tool changers, replied no, and then expect to relearn mind Python3 before, it is to be able to write some scripts easy to deal with some things.

This data is mainly HTTP GET request, require transcoding, and then get the parameters, the goal is written Execl file.

As has been useless, grammar can not remember, can not stop looking back at the notes, this has not been used on Execl, temporary Internet to find.

Reading file

Since the original file is too big, so it makes a copy of a portion of the deposit of the current directory  test.txt files.

import os

file = open('./test.txt') # 打开文件 for line in file: # 按行遍历文件 ... file.close() # 最后要关闭 

After traversing each row of data, transcoding, divided, transferred into a dictionary structure

import urllib.parse

# 定义个方法将一行数据转成字典
def getOneDict(line): params = dict() # 创建字典保存数据 pre = 'timeAnalysis?' index = line.find(pre) # 找到这个位置,后面的数据是需要的参数 if index > -1: line = line[index+len(pre):].split()[0] # 先截子串然后选空格之前的 line = urllib.parse.unquote(line) # 网上搜到的用于将 HTTP 编码后的文字解码 paramList = line.split('&') # 得到类似 s=android 的内若 for item in paramList: param = item.split('=') if (len(param) > 1): # 如果等于号后面有值 params[param[0]] = param[1] # 字典加个数据 return params 

Create a file Execl

With each row corresponding dictionary, I wanted to put the contents written Execl document. Pip3 first switch to the directory, perform  sudo pip3 install openpyxl the installation openpyxl module, the latest grammar reference  http://openpyxl.readthedocs.io/en/stable/

import openpyxl

def createExcel(): if 'time.xlsx' in os.listdir('./'): # 如果有了这个文件,就不要再创建了 return wb = openpyxl.Workbook() # 创建 Execl wb.active.title = 'Splash' # 默认有一个活动的 Sheet,把名字改成 Splash wb.create_sheet(title='Home') # 再创建一个 Sheet wb.save('time.xlsx') # 最后一定不能忘了这句 

Results locally generated documents  time.xlsx, there are two open Sheet.

 
execl.png

 

To insert data in Execl

See Execl, lateral is A, B, C ... such a numbers 1,2,3 ... numbered such that the longitudinal direction, see the document can be located by a combination of two numbers. I am here this function is useless.

I need to put the key in the dictionary as the first line as a header, and then to a dictionary where each data to find out whether With this key, you have to add a line down the corresponding value, if not the first to add a row number is the first line, this value is a dictionary of key, and then add value.

def writeToExecl(params): wb = openpyxl.load_workbook('time.xlsx') # 打开 Execl 文件 splashSheet = wb['Splash'] # 找到两个 Sheet homeSheet = wb['Home'] if params['pid'] == '04': # 如果字典里有一个 pid=04 的,要放到 Splash 表里 writeToSheet(splashSheet, wb, params) elif params['pid'] == '00': writeToSheet(homeSheet, wb, params) 
def getValue(t): return t.value def writeToSheet(sheet, wb, params): rows = tuple(sheet.rows) # 拿到所有行转成元组 index = len(rows) + 1 # 在原有的行数的下一行插入数据 print('rows ' + str(index)) titles = list(map(getValue, rows[0])) # 用高阶函数 map 转换第一行,拿到里面的 value,然后要转成 list if titles == [None]: # 测试发现列表不是空的,长度是 1,里面是 None,所以判断一下 titles = [] for k,v in params.items(): # 遍历字典 if k in titles: # 如果第一行有这个标题,直接插值 sheet.cell(row = index, column = titles.index(k) + 1).value = v # 行数为 index,列为这个 key 在的行,cell 里的索引从 1 开始 else: sheet.cell(row = 1, column = len(titles) + 1).value = k # 先插标题,先插了一列 sheet.cell(row = index, column = len(titles) + 1).value = v # 再插字典的 value titles.append(k) # 标题多了一个,要更新 wb.save('time.xlsx') # 保存以生效 

Traversing File

Under the document to be analyzed into the current directory of the web folder, and then iterate, for each file for reading.

for file in os.listdir('./web'): print('read file ' + os.path.basename(file)) readAFile('./web/' + file) def readAFile(filename): file = open(filename) for line in file: params = getOneDict(line) if (len(params) == 0): # 如果字典为空,继续下一行 continue writeToExecl(params) file.close() 

other

Times running out of a mistake

...
self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte ... 

Internet search, is the coding problem, not to solve specifically, one or two lines of data does not affect the statistical proportion, find a way to ignore the error when opening the file.

file = open(filename, errors='ignore') 

There are generated file name, file path, etc. to be analyzed by command-line arguments passed, so the availability of higher, not to engage in it.

Overall do down, in fact, function is very simple, but because of unfamiliarity with the language, spent a lot of time, map the advanced functions or because of Kotlin, Rx years have similar, so search the next to know that some may not know He wrote cumbersome.

File is too large, too long to write, run only for a while and read some data, analyze the reasons probably, eventually let go operational filtration analysis of proportion.

result

Source

import os
import urllib.parse
import openpyxl
import sys # 将一行数据转成字典 def getOneDict(line): params = dict() # 创建字典保存数据 pre = 'timeAnalysis?' index = line.find(pre) if index > -1: line = line[index+len(pre):].split()[0] line = urllib.parse.unquote(line) paramList = line.split('&') for item in paramList: param = item.split('=') if (len(param) > 1): params[param[0]] = param[1] return params def createExcel(): if 'time.xlsx' in os.listdir('./'): return wb = openpyxl.Workbook() # 创建Execl wb.active.title = 'Splash' wb.create_sheet(title='Home') wb.save('time.xlsx') def writeToExecl(params): wb = openpyxl.load_workbook('time.xlsx') splashSheet = wb['Splash'] homeSheet = wb['Home'] if params['pid'] == '04': # 开屏 writeToSheet(splashSheet, wb, params) elif params['pid'] == '00': # 首页 writeToSheet(homeSheet, wb, params) def getValue(t): return t.value def writeToSheet(sheet, wb, params): rows = tuple(sheet.rows) index = len(rows) + 1 # 从这一行开始插入数据 titles = list(map(getValue, rows[0])) # 拿到第1行的内容,做标题用 if titles == [None]: titles = [] for k,v in params.items(): if k in titles: sheet.cell(row = index, column = titles.index(k) + 1).value = v else: sheet.cell(row = 1, column = len(titles) + 1).value = k # 先插标题 sheet.cell(row = index, column = len(titles) + 1).value = v titles.append(k) # 更新标题 wb.save('time.xlsx') def readAFile(filename): file = open(filename, errors='ignore') for line in file: params = getOneDict(line) if (len(params) == 0): continue writeToExecl(params) file.close() createExcel() for file in os.listdir('./web'): print('read file ' + os.path.basename(file)) readAFile('./web/' + file) 

Execl generated as follows

 

 

Summary Note: Many people learn Python process will encounter a variety of problems to worry about, no one easy answer to give up. For this reason small series built a Python full-stack free Q & A skirt: Seven Yiyi nine hundred seventy-seven bar and the next five (digital homonym) conversion can be found, but not the older drivers have problems to solve, there is also the latest real Python Tutorial Free non-,, together under mutual supervision and common progress!
Text and images in this article from the network with their own ideas, only to learn, exchange, not for any commercial purposes, belongs to original author, if any questions, please contact us for treatment.

Guess you like

Origin www.cnblogs.com/chengxuyuanaa/p/12499007.html