bilibili用户信息的抓取

网址

https://space.bilibili.com/

打开之后可能会跳到登录界面,登录进去分析网页,个人信息的网页如下:

然后点击进去别人的个人中心,看看网址的区别:

区别就是后面的数字不一样了,可以尝试多点几个个人中心去试试。

接下来构造请求头。

代码如下:

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
    'Referer': 'https://space.bilibili.com/4899781/',
    'Origin': 'http://space.bilibili.com',
    'Host': 'space.bilibili.com',
    'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
}

构建ip:

代码如下:

proxies = {
    'http': 'http://118.190.95.35:9001',
    'http': 'http://121.49.110.65:8888',
}

构建url列表:

代码如下:

urls = []
for x in range(1,2):
    for i in range(x * 100,(x+1) * 100):
        url = 'https://space.bilibili.com/' + str(i)
        print(url)
        urls.append(url)

获取数据:

def getSource(url):

    ua = random.choice(uas)

    headers = {
        'User-Agent': ua,
        #随机产生的Referer
        'Referer': 'https://space.bilibili.com/' + str(1) + '?from=search&seid=' + str(random.randint(10000, 50000))
    }

    jscontent = requests.session().post('http://space.bilibili.com/ajax/member/GetInfo',
                headers=headers,
                data=payload,
                proxies=proxies).text
    time2 = time.time()

解析数据:

可以看出,我们要解析的是个字典,代码如下:

try:
    jsDict = json.loads(jscontent)
    statusJson = jsDict['status'] if 'status'in jsDict.keys() else False
    if statusJson == True:
        if 'data' in jsDict.keys():
            jsData = jsDict['data']
            mid = jsData['mid']
            mid = jsData['mid']
            name = jsData['name']
            sex = jsData['sex']
            rank = jsData['rank']
            face = jsData['face']
            #将时间转化成时间格式
            regtimestamp = jsData['regtime']
            regtime_local = time.localtime(regtimestamp)
            regtime = time.strftime("%Y-%m-%d %H:%M:%S", regtime_local)
            spacesta = jsData['spacesta']
            birthday = jsData['birthday'] if 'birthday' in jsData in jsData.keys() else 'nobirthday'
            sign = jsData['sign']
            level = jsData['level_info']['current_level']
            OfficialVerifyType = jsData['official_verify']['type']
            OfficialVerifyDesc = jsData['official_verify']['desc']
            vipType = jsData['vip']['vipType']
            vipStatus = jsData['vip']['vipStatus']
            toutu = jsData['toutu']
            toutuId = jsData['toutuId']
            coins = jsData['coins']
            print("Succeed get user info:"+ str(mid) + '\t'+str(time2 - time1))

except Exception as e:
    print(e)

接下来存入数据库

这里选择mysql数据库,首先我们需要先建一个表,建表的代码如下:

DROP TABLE IF EXISTS `bilibili_user_info`;
/*!40101 SET @saved_cs_client     = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `bilibili_user_info` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `mid` int(20) unsigned NOT NULL,
  `name` varchar(45) NOT NULL,
  `sex` varchar(45) NOT NULL,
  `rank` varchar(45) NOT NULL,
  `face` varchar(200) NOT NULL,
  `regtime` varchar(45) NOT NULL,
  `spacesta` varchar(45) NOT NULL,
  `birthday` varchar(45) NOT NULL,
  `sign` varchar(300) NOT NULL,
  `level` varchar(45) NOT NULL,
  `OfficialVerifyType` varchar(45) NOT NULL,
  `OfficialVerifyDesc` varchar(100) NOT NULL,
  `vipType` varchar(45) NOT NULL,
  `vipStatus` varchar(45) NOT NULL,
  `toutu` varchar(200) NOT NULL,
  `toutuId` int(20) unsigned NOT NULL,
  `coins` int(20) unsigned NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
/*!40101 SET character_set_client = @saved_cs_client */;

--
-- Dumping data for table `bilibili_user_info`
--

LOCK TABLES `bilibili_user_info` WRITE;
UNLOCK TABLES;

存入数据库,代码如下:

try:
    # Please write your MySQL's information.
    conn = pymysql.Connect(host='localhost', user='root', passwd='123456', db='weixin', charset='utf8')
    cur = conn.cursor()
    cur.execute('INSERT INTO bilibili_user_info values(%s,"%s","%s","%s","%s","%s","%s","%s","%s","%s","%s","%s","%s","%s","%s","%s", "%s","%s")'%
                (1,mid, name, sex, rank, face, regtime, spacesta,birthday, sign, level, OfficialVerifyType, OfficialVerifyDesc, vipType, vipStatus, \
                toutu, toutuId, coins,))
    conn.commit()

except Exception as e:
    print('存入数据库失败',e,url)

这样就完成了整个目标的实现。

然后我们将以上代码整合。就可以实现大量爬取。

数据库的效果如下:

这样我们就完成了已经注册人员信息的爬取。

猜你喜欢

转载自blog.csdn.net/qq_39138295/article/details/83350412