Python data analysis and visualization (a) obtaining data representing notes _

A local data acquisition - file

1. file operation three steps

Open File -> read and write files -> Close File

Why do we need to close the file it? Because the data may be cached write python, an exception if the program crashes, the data might not be written to a file, so for safety reasons, after the completion of the file reading and writing should take the initiative to close.

2. Open the file

Using the open function, the first parameter is the file name (path may comprise a), the second parameter indicates the read mode, the third parameter represents a buffer

The first argument: there must be

The second parameter: R & lt default (read-only), may be omitted. Comprises a read-write mode r, w, a, r +, w +, a + etc.

The third parameter: the default value is -1, the system default buffer size may be omitted. Binary files can not use the buffer in Python, but you must use a text file buffer.

open return result of the function is a file object can be iterative, so we can traverse each sub-item of which

3. read and write files

约定f = open("hello.txt", 'w')

a write file:. f.write ( 'Hello, World')

# A better read statement 
with Open ( ' hello.txt ' , ' W ' ) AS F: 
    f.write ( ' the Hello, World ' )
 # Note: with statement automatically closed after execution file handle, thus in the program do not need to write close statement

b read file:. f.read (), optional parameter size, expressed from the document read out up to size bytes of data, it returns a string

By default, the file is read to the end of the file, still returns a string

. C extension:

f.readlines () reads a plurality of rows of data

f.readline () reads a line of data

f.writelines () to write multiple lines of data

d. Operation file pointer

f.seek (offset, whence = 0) can be used to move the file pointer in the file, from whence (0 indicates the file header, 1 indicates the current position, 2 denotes end of the file) byte offset offset, The whence optional parameters, lack of the provincial case 0

f. expansion of knowledge

python three standard streams:

stdin stream standard input standard output stream stdout stderr standard error

In Python, a keyboard and a display terminal are files that are actually realized by the function module sys provided, such as to achieve print ( 'hello, world') is:

import sys
sys.stdout.write('hello, world')



II. Network data acquisition

1. Process

Acquiring network data in two stages: crawling -> Analytical

2. crawling

Requests Library

Basic methods: requests.get () - request acquisition location URL of the specified resource corresponds to the GET method of the HTTP protocol

Description: get method returns a response object, the response object contains the requested information and requests the server response information, and automatically requests the decoding information from the server, information such as the page is returned json format, may be performed by object name .json () decoding, if the page information is returned binary format, the object name may be decoded by .content (), in particular, the object name .text () attribute type of the text can be automatically inferred and decoded, in addition, by encoding this property coding, coding to modify the common text for utf-8

Example:

# Example 1 
# assumption acquired is a binary file, the data can be stored by a method 
Import Requests 

R & lt = requests.get ( ' https://www.baidu.com/img/bd_logo1.png ' ) 
with Open ( ' baidu.png ' , ' wb ' ) AS fp: 
    fp.write (r.content) 

# example 2 
# for anti-climb, some sites will headers of the User-Agent is detected, the need to pass information to the headers headers parameter number of get letter # , know almost e.g., direct access will return 400, after adding headers to return the correct parameters: 
headers = { " User- 
- Agent " : " the Mozilla / 5.0 (X11; the Linux i686) AppleWebKit / 535.11 (KHTML, like Geck 
O) the Chrome/17.0.963.83 Safari/535.11"}
re = requests.get('https://www.zhihu.com', headers = headers)
print(re.status_code)

3. Resolving

After obtaining the source, the need for source code parsing

Source label format rules for parsing a beautiful soup library, complex data structures for the source, suitable for extracting regular expressions

#BeautifulSoup解析实例
import requests
from bs4 import BeautifulSoup

r = requests.get('https://movie.douban.com/subject/10533913/')
soup = BeautufulSoup(r.text, 'lxml')
pattern = soup.find_all('span', 'short')
for item in pattern:
    print(item.string)
#正则表达式解析实例
import requests
from bs4 import BeautifulSoup
import re
s = 0
r = requests.get('https://book.douban.com/subject/1165179/comments/')
soup = BeautifulSoup(r.text, 'lxml')
pattern = soup.find_all('span', 'short')
for item in pattern:
    print(item.string)
pattern_s = re.compile('<span class="user-stars allstar(.*?) rating">')
p = re.findall(pattern_s, r.text)
for star in p:
    s += int(star)
print(s)

Finally, a complete section of the reptiles Code: crawling before 50 Commentary contents of a book watercress and calculating the average score

import requests, re, time
from bs4 import BeautifulSoup


count = 0
i = 0


s, count_s, count_del = 0, 0, 0
lst_stars = []
while count < 50:
    try:
        r = requests.get('https://book.douban.com/subject/10517238/comments/hot?p=' + str(i+1))
    except Exception as err:
        print(err)
        break
    soup = BeautifulSoup(r.text, 'lxml')
    comments = soup.find_all('span', 'short')
    pattern = re.compile('<span class="user-stars allstar(.*?)rating"')
    p = re.findall(pattern, r.text)
    
    for item in comments:
        count += 1
        if count > 50:
            count_del += 1
        else:
            print(count, item.string)
    for star in p:
        lst_stars.append(int(star))
    time.sleep(5)
    i += 1
    for star in lst_stars[:-count_del]:
        s += int(star)
    if count >= 50:
        print(s // len(lst_stars)-count_del)

 

Guess you like

Origin www.cnblogs.com/laideng/p/11440209.html