Getting Started with Python Crawler | 6 Store the crawled data locally



1. Storing data with Python statements

When writing files, we mainly use the with open() statement:

with open(name,mode,encoding) as file:
  file.write()
  # 注意,with open() 后面的语句有一个缩进 

name: a string containing the file name, such as: 'xiaozhu.txt'; mode: determines the mode of opening the file, read-only/write/append, etc.; encoding: indicates the encoding of the data we want to write, generally utf- 8 or gbk ; file: Indicates the name of the file in our code.

Let's use the example of the pig we climbed earlier to see what it actually looks like:

from lxml import etree
import requests
import time

with open('/Users/mac/Desktop/xzzf.txt','w',encoding='utf-8') as f:
    for a in range(1,6):
        url = 'http://cd.xiaozhu.com/search-duanzufang-p{}-0/'.format(a)
        data = requests.get(url).text

        s=etree.HTML(data)
        file=s.xpath('//*[@id="page_list"]/ul/li')
        time.sleep(3)
    
        for div in file:
            title=div.xpath("./div[2]/div/a/span/text()")[0]
            price=div.xpath("./div[2]/span[1]/i/text()")[0]
            scrible=div.xpath("./div[2]/div/em/text()")[0].strip()
            pic=div.xpath("./a/img/@lazy_src")[0]
            
            f.write("{},{},{},{}\n".format(title,price,scrible,pic)) 

The filename xzzf.txt will be written to, if not it will be created automatically.

/Users/mac/Desktop/xzzf.txt 

Add a path to the desktop in front, it will exist in the desktop, if you do not add the path, it will exist in your current working directory.

w: write-only mode, if there is no file will be created automatically;

encoding='utf-8': Specify the encoding of the written file as: utf-8, generally specify utf-8;

f.write("{}  {}  {}  {}\n".format(title,price,scrible,pic))
#将 title,price,scrible,pic 的值写入文件

Let's see what the saved data looks like:

If you don't specify the file path, how to find the file written locally? Give you two methods:

1. Open cortana in win10 and search for your file name


2. Recommended software "everything", it is more convenient and quick to query files.

This software is very small, and Baidu is easy to find, but it is indeed an artifact. You will come back to thank me~

So it is still recommended that you add the path you want to store in front of the file name when writing code. What, you don't even know how to write the path? Well, for example, I want to save the file on the desktop, so how to check the path?

Find a document at random, such as a document on the desktop, right-click > "Properties", and the information behind "Location" is the path where the document is located.

2. Save the file in CSV format

Of course, you can also save the file in .csv format and change the file suffix after the with open() statement.

from lxml import etree
import requests
import time

with open('/Users/mac/Desktop/xiaozhu.csv','w',encoding='utf-8') as f:
    for a in range(1,6):
        url = 'http://cd.xiaozhu.com/search-duanzufang-p{}-0/'.format(a)
        data = requests.get(url).text

        s=etree.HTML(data)
        file=s.xpath('//*[@id="page_list"]/ul/li')
        time.sleep(3)
    
        for div in file:
            title=div.xpath("./div[2]/div/a/span/text()")[0]
            price=div.xpath("./div[2]/span[1]/i/text()")[0]
            scrible=div.xpath("./div[2]/div/em/text()")[0].strip()
            pic=div.xpath("./a/img/@lazy_src")[0]
            
            f.write("{},{},{},{}\n".format(title,price,scrible,pic)) 

In addition, it should be noted that each field of the CSV should be separated by a comma, so the previous space is changed to a comma here.

How to open CSV file?

Under normal circumstances, you can open it directly with Notepad. If you open it directly with Excel, it is very likely that garbled characters will appear, like the following:

What should I do if garbled characters appear when Excel opens CSV?

  1. Open the file in Notepad
  2. Save As - select encoding as "ANSI"

Let's take a look at the previous Douban TOP250 book writing file:

from lxml import etree
import requests
import time

with open('/Users/mac/Desktop/top250.csv','w',encoding='utf-8') as f:
    for a in range(10):
        url = 'https://book.douban.com/top250?start={}'.format(a*25)
        data = requests.get(url).text

        s=etree.HTML(data)
        file=s.xpath('//*[@id="content"]/div/div[1]/div/table')
        time.sleep(3)

        for div in file:
            title = div.xpath("./tr/td[2]/div[1]/a/@title")[0]
            href = div.xpath("./tr/td[2]/div[1]/a/@href")[0]
            score=div.xpath("./tr/td[2]/div[2]/span[2]/text()")[0]
            num=div.xpath("./tr/td[2]/div[2]/span[3]/text()")[0].strip("(").strip().strip(")").strip()
            scrible=div.xpath("./tr/td[2]/p[2]/span/text()")

            if len(scrible) > 0:
                f.write("{},{},{},{},{}\n".format(title,href,score,num,scrible[0]))
            else:
                f.write("{},{},{},{}\n".format(title,href,score,num)) 

最后存下来的数据是这样的:

好了,这节分享就到这里!喜欢Python并且想要学习Python的同学可以加下我的Python学习交流群:663033228


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324733330&siteId=291194637