5.xlwt package learning (read and write excel tables)

The learning session xlwt reads and writes to excel is to persist the crawled useful data for use at any time, and to provide data for the later visualization of the crawler

For example, create a new table and write the given data

Fixed step
1. lead package
import xlwt
2. Create a workbook object
workbook = xlwt.Workbook(encoding="utf-8")
3. Create a worksheet
sheet1 = workbook.add_sheet("sheet1")

Worksheets are sheets in excel, which can be manually added or renamed. The same principle is used for code operations.

4. Write data
sheet1.write(0, 0, "hello") 

The parameters of write correspond to: row, column, value, just note that the row and column of the write parameter start from 0 as shown in the figure below:

5. Save the form
workbook.save('student.xls')

Open excel to see if the data is written successfully

A small exercise output 9*9 port to publish to multipexcel table
    multiplicationBook = xlwt.Workbook(encoding="utf-8")
    multipsheet = multiplicationBook.add_sheet("sheet1")
    for i in range(1, 10):
        for j in range(1, i+1):
            multipsheet.write(i-1, j-1, str(i)+"*"+str(j)+"="+str(i*j))
    multiplicationBook.save('multip.xls')

Read excel data

Publish the above 9*9 ports and finish reading objects

1. lead package
import xlrd
2. Open the excel sheet

The parameter can be the absolute path of excel or a relative path

data = xlrd.open_workbook(‘multip.xls’)

3. Get a worksheet

Two ways: get by index order, get by name

  • Get by index order
    table = data.sheets()[0]
    table = data.sheet_by_index(0)
  • Get by name
    table = data.sheet_by_name(u'sheet1')
4. Get the whole row and column, cell, get the number of rows and columns
  • Get the value of the entire row and the entire column (return an array)
    table.row_values(0)
    table.col_values(0)

row_values, col_values ​​parameter row and column

  • Get cell
    table.cell(0,0).value

cell parameters: row, column

  • Get the number of rows and columns
    print(table.nrows)
    print(table.ncols)

Change excel data

To make data changes, you need to read all the data of the table into the memory, and then rewrite the modified data to excel


Uplift

Python uses the xlwt module to operate Excel


Task-driven development: Store the data crawled by Douban Top250 movies in a table

import re  # 正则表达式,进行文字匹配
# import bs4 #只需要使用bs4中的BeautifulSoup因此可以如下写法:
from bs4 import BeautifulSoup  # 网页解析,获取数据
import xlwt  # 进行excel操作
import sqlite3  # 进行SQLlite数据库操作
import urllib.request, urllib.error  # 指定url,获取网页数据


def main():
    # 爬取的网页
    baseurl = "https://movie.douban.com/top250?start="
    # # 保存的路径
    savepath = ".\\豆瓣电影Top250.xls"  # 使用\\表示层级目录或者在整个字符串前加r“.\豆瓣电影Top250”
    savepath2Db = "movies.db"
    # # 1.爬取网页
    # print(askURL(baseurl))
    datalist = getData(baseurl)
    print(datalist)
    # # 3.保存数据(存储到excel中)
    saveData(datalist, savepath)


# 影片详情链接的规则
findLink = re.compile('<a href="(.*?)">')  # 创建正则表达式对象
# 影片图片的链接规则
findImgSrc = re.compile('<img alt=".*src="(.*?)"', re.S)  # re.S忽略换行
# 影片片名
findTitle = re.compile('<span class="title">(.*)</span>')
# 影片评分
findRating = re.compile('<span class="rating_num" property="v:average">(.*)</span>')
# 评价人数
# findJudge = re.compile('<span>(\d*)(.*)人评价</span>')
findJudge = re.compile('<span>(\d*)人评价</span>')
# 概况
findInq = re.compile('<span class="inq">(.*)</span>')
# 影片相关内容
findBd = re.compile('<p class="">(.*?)</p>', re.S)  # 中间有</br>,因此要忽略换行符


# 爬取网页
def getData(baseurl):
    datalist = []
    for i in range(0, 10):  # 一页25条电影
        url = baseurl + str(i*25)
        html = askURL(url)  # 保存获取到的网页源码
        # print(html)
        # 2.解析数据(逐一)
        soup = BeautifulSoup(html, "html.parser")  # 使用html.parser解析器解析html文档形成树形结构数据
        for item in soup.find_all("div", class_="item"):  # 查找符合要求的字符串,形成列表
            # print(item)
            data = []  # 保存一部电影的信息
            item = str(item)
            # 影片详情链接
            link = re.findall(findLink, item)[0]
            data.append(link)
            # 图片
            img = re.findall(findImgSrc, item)[0]
            data.append(img)
            # 标题
            titles = re.findall(findTitle, item)
            if(len(titles) == 2):
                ctitle = titles[0]  # 中文名
                data.append(ctitle)
                otitle = titles[1].replace("/", "")
                data.append(otitle)  # 外文名
            else:
                data.append(titles[0])
                data.append(' ')  # 外文名留空
            # data.append(title)
            # 评分
            rating = re.findall(findRating, item)[0]
            data.append(rating)
            # 评价人数
            judgeNum = re.findall(findJudge, item)[0]
            # print(judgeNum)
            data.append(judgeNum)
            # 添加概述
            inq = re.findall(findInq, item)
            if len(inq) == 0:
                data.append(" ")
            else:
                data.append(inq[0].replace("。", ""))
            # 影片相关内容
            bd = re.findall(findBd, item)[0]
            bd = re.sub('<br(\s+)?/>(\s+)?', " ", bd)  # 去掉</br>
            bd = re.sub('/', " ", bd)  # 替换/
            data.append(bd.strip())  # 去掉前后的空格

            datalist.append(data)  # 把处理好的一部电影的信息保存
        # for it in datalist:
        #     print(it)
    return datalist


# 得到执行url的网页信息
def askURL(url):
    # 头部信息 其中用户代理用于伪装浏览器访问网页
    head = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
                          "Chrome/87.0.4280.88 Safari/537.36"}
    req = urllib.request.Request(url, headers=head)
    html = ""  # 获取到的网页源码
    try:
        response = urllib.request.urlopen(req)
        html = response.read().decode("utf-8")
    except urllib.error.URLError as e:
        if hasattr(e, "code"):  # has attribute
            print(e.code)
        if hasattr(e, "reason"):
            print(e.reason)
    return html


def saveData(datalist, savepath):
    book = xlwt.Workbook(encoding="utf-8", style_compression=0)  # style_compression:压缩的效果
    sheet = book.add_sheet("豆瓣电影top250", cell_overwrite_ok=True)  # 单元格内容可覆盖
    col = ("电影详情链接", "图片链接", "影片中文名", "影片外文名", "评分", "评价数", "概述", "相关信息")  # 元组添加表头
    for i in range(8):  # 写入表头(列名)
        sheet.write(0, i, col[i])

    for i in range(1, len(datalist)+1):
        for j in range(len(datalist[i-1])):
            sheet.write(i, j, datalist[i-1][j])
    book.save(savepath)




if __name__ == '__main__':
    main()

Successfully saved to excel sheet

Guess you like

Origin blog.csdn.net/qq_43808700/article/details/113593341