[Series] 6. python reptile local data storage

Section VI: local data storage
routine operation file Many times, the problem is not operating there, but routing problem _
6.0 os module.
Daily action file Many times, the problem is not operating there, but the path problem,
but a problem processing path, you can use the os module
the following code will judge for themselves your folder exists, if there is no will to create your own

import os

filename = "test"

if not os.path.exists(filename):   #判断文件夹是否存在
	os.mkdir(filename)             #如果不存在则创建

Here Insert Picture Description

Here what I created a test folder

print(os.getcwd())
print(os.path.join(os.getcwd(),"test"))  #即使不加\也会自动替换

Output:

D:\科技\python\爬虫\大师
D:\科技\python\爬虫\大师\test
#写入方法:
import os

filename = "test"

if not os.path.exists(filename):   #判断文件夹是否存在
	os.mkdir(filename)             #如果不存在则创建


filename = os.getcwd()   #获取当前目录
#print(filename)
txt = "test"

with open (os.path.join(filename,txt),"w") as f:  #利用join方法混入test
	f.write("test web")

Simplified method from the file name: When you need to download large quantities of image or document, they can use UUID

import uuid
from uuid import UUID 

#基于时间戳
print(uuid.uuid1())
#基于名字的MD5散列
print(uuid.uuid3(UUID(int=1),"no"))
#基于随机数 推荐
print(uuid.uuid4())
#基于名字的SHA-1散列
print(uuid.uuid5(UUID(int=3),"zss"))

Output:

cf59c5b4-5843-11ea-9270-e86a64061ace
ced504fe-c732-3784-85e8-e4ef35e0834b
0a2f143c-af28-423b-98c8-82243d3b6be7
c442e430-9405-5bd2-aaf2-c880d2ae2655

6.1.python manipulate text
r- read the file
w- create a file, repeatedly overwrites the source file
a- additional file, it does not exist will create
b- operating a binary stream (Picture Music Video wb)
± rw set of
general text file storage we only need to be stored in txt inside it, as follows:

with open("test.txt","w",encoding="utf-8") as f:
	f.write("test web \ntest web")

They have closed the file open, the reason for use with expression, because eliminating the need for close, to prevent memory overflow error caused by
reading the file is very simple, as follows:

with open("test.txt","r",encoding="utf-8") as f:
	result=f.read()
	print(result)

Output:

test web 
test web

Or a read row by row, as follows:

with open("test.txt","r",encoding="utf-8") as f:
	result=f.readlines()
	print(result)

Output: ['test web \n', 'test web']
the expansion of knowledge:

  1. readline()
with open("test.txt","r",encoding="utf-8") as f:
		result=f.readline()
		print(result)

Output: Output only one line

test web
  1. We should try to deal with daily use redlines ()
    to facilitate our operations on txt file line by line, into a list, iterative process
    as follows: w
ith open("test.txt","r",encoding="utf-8") as f:
	result=f.readlines()
	for r in result:
		print(r.strip()) #去除空格

Output:

test web
test web

Images stored in the binary stream music is also very simple,
as we mentioned earlier, like Baidu picture crawling

#图片视频
def download(url):
	img = requests.get(url,headers=headers)
	with open("imgs/{}.jpg".format(uuid.uuid4()),"wb") as f:
		chunks = img.iter_content(125)
		for c in chunks:
			f.write(c)

It is worth mentioning that the chunks = img.iter_content(125)
use of the method of processing binary stream into small pieces
to avoid a one-time fill up the memory, CPU relieve pressure, when reading large files very edge

6.2.Python operation csv
our daily txt storage is no problem, but work mostly in the form csv, excel, word of data transfer
scientific name: Comma Separated Values (Comma-Separated Values, CSV, sometimes referred to as character-separated values, because the separator character It may not be a comma), which is stored in plain text files tabular data (numbers and text).

Csv is the comma separator, generally of the form:
the Symbol, Price, a Date, Time, Change, Volume
"AA", 39.48, "6/11/2007", "9:36 AM", - 0.18,181800
"AIG", 71.38, "6/11/2007", "9:36 AM", - 0.15,195500
0.46,935000 - "AXP", 62.58, "6/11/2007", "9:36 AM",
98.31 "BA",, "6/11/2007", "9:36 AM", 0.12,104800 +
"C", 53.08, "6/11/2007", "9:36 AM", - 0.25,360900
"CAT", 78.29, ". 6 /11/2007","9:36am",-0.23,225400
. 1) reading
is different from TXT, using specialized Reader () buffer
Next () removing the first row header, the content after iteration

import csv

def get_csv():
	with open("test.csv",encoding="utf-8") as f:
		f_csv = csv.reader(f)
		header =next(f_csv)
		for row in f_csv:
			print(row)
if __name__ == '__main__':
	get_csv()

Output:

['lilei', '12']
['hanmeimei', '100']

2) write
the write csv, attention needs to be initialized into the f writer, obtains a handle
data writing is nested list of tuples, Writerow write line, Writerows write multiple lines

Reading module with csv csv, csv written to the following format:

headers=["12522SS,715282,4FB55FE8"]
rows = [("a",1,300),("b",2,420)]

with open("test1.csv","w") as f:
	f_csv = csv.writer(f)
	f_csv.writerow(headers)
	f_csv.writerows(rows)

Wrapper functions written:

def wrirte_csv(data):
	with open("存储文件名.csv","a") as f:
		f_csv = csv.writer(f)
		f_csv.writerow(data)

def crawl():
	html = request.get("url")
	soup = html.text
	for i in soup:
		wrirte_csv(i)

Csv written dictionary (extended use, as far as possible NA)
Generally, dictionary because there is a hash of disorderly
but realized ordered dictionary python
From collections import OrderedDict orderly dictionary object
OrderedDict ([ 'Symbol', ' AA '),]
popularize knowledge: Why disorder prior to the dictionary (version 3.6 disorder)
algorithm hash table is to get the key, perform a key function called a hash operation, and based on the results of the calculation, select the data structure an address stored in your values. Address any value of a bond with it all depends.

Because of this arbitrary value, the hash table is not ordered
you have an unordered set of data.

6.3 Processing Files json
JSON (] avaScript Object Notation) is a lightweight data-interchange format, JSON a fully language independent text format, JSON These characteristics make it an ideal data exchange language. Easy to read and write, but also easy for machines to parse and generate.

Json module provides a simple method for encoding and decoding json format
mainly json.dumps () and json.loads () Interface little
while json.dump () and json.load () when applied to write to the file
if you want to our data structures into python json, need to write:

import json


data = {
	"name":"hanmeimei",
	"score":99
}

res = json.dumps(data)
print(res)

Output:

{"name": "hanmeimei", "score": 99}	

Of course, you can also write to the file:

with open("data.json","w") as f:
		json.dump(data,f)

The json string parsing for our python structure is also very easy.

import json
data = '{"name":"hanmeimei","score":"100"}'
res = json.loads(data)
print(res)

Output:

{'name': 'hanmeimei', 'score': '100'}

If not directly json string we operate, but stored in the text, you need to write:

with open("data.json","r") as f:
	print(json.load(f))

Output:

{'name': 'hanmeimei', 'score': 99}

Note: 6.4, 6.5 only do understand, is not commonly used,
we are interested can learn to use python office automation operations
6.4.Python Operation Excel
(try to apply csv)
is installed module pip install lxrd lxwt

Description:
Lxwt for creating and for reading data writing XIrd
a w, a r
if you want to obtain data in a specified cell, use
sheet.cell value (l, i)

6.5.Python written word (basic need):
Installation Module: pip install python-docx
but mostly is used to read word file.
Interested readers can follow my blog column, there are some office automation operations
https://blog.csdn.net/ai_linnglong/category_9718088.html

Published 31 original articles · won praise 29 · views 2452

Guess you like

Origin blog.csdn.net/AI_LINNGLONG/article/details/104515712