爬取网易云课堂、网易公开课课程数据

二话不说,先上代码~
import requests
import json
def getdata(index):
    a=input("调用gedata方法")
    print("正在抓取{index}页数据")
    payload = {"pageIndex":index,
            "pageSize":700,
            "relativeOffset":50,
            "frontCategoryId":400000001295013,
            "searchTimeType":-1,
            "orderType":50,
            "priceType":-1,
            "activityId":0,
            "keyword":""
    }
    payload = json.dumps(payload)
    headers = {"Accept":"application/json",
               "Host":"study.163.com",
               "Origin":"https://study.163.com",
               "Content-Type":"application/json",
               "Referer":"https://study.163.com/courses",
               "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36"
    }
    req = requests.post("https://study.163.com/p/search/studycourse.json",data=payload,headers=headers)
    e=input("成功post到数据")
    print(type(req))
    res_json = json.loads(req.text)
    print(type(res_json))
    with open("C:/Users/Administrator/Desktop/wangyiCloud.json","w") as f:
        json.dump(res_json,f)
        print("写入文件完成...")
    
a=getdata(1)
b=input("运行到了这")

     

这段数据是爬取网易云课堂的代码~因为我是写php的,所以以上代码如果有什么问题敬请斧正
 
我先讲一下业务背景吧,leader让我把市面上主流的线上学习的网站的课程数据全部爬取下来~
一开始接到的时候,有点无从开始,没做过啊,
最开始是去搜怎么爬取网页的数据,了解到了一种是通过模拟headers来获取数据,另一种就是获取整个页面的html,再通过选择器来获取你想要的数据
 
最开始接触的就是scrapy框架,打算建立在windows环境下,果然windows下的安装果然不省心,遇到这方面问题的可以去看看我的另一篇博文:windows下安装scrapy的各种问题
 
安装好了之后,根据他的教程走,很快的就把csdn,极客,腾讯课堂都爬下来了~
 
之后爬取网易云课堂的时候,发现爬取下来的html页面里面没有具体的课程数据,去看网站的整个加载过程发现,是通过js加载的数据
可以看到,数据都是通过studycourse.json加载的,那这种就简单了,直接通过模拟headers跟post的数据就能获取了~
 
数据是通过post获取的,提交的是Payload类型,数据格式是json,
提取一下post关键字,frontCategory,字面意思,前面 类别,大致猜一下应该就是课程的大分类id,keyword应该是我们搜索时才有
pageSize是加载的数据的大小,pageIndex是第几个页面
 
因为是写php的,所以就直接想通过curl模拟post
代码如下:
   
 //curl模拟post获取网易云数据
    public function wangyiDataAction(){
        $url = "https://study.163.com/p/search/studycourse.json";
        $headers = array(
            "Accept"    =>"application/json",
            "Host"        =>"study.163.com",
            "Origin"    =>"https://study.163.com",
            "Content-Type"=>"application/json",
            "Referer"    =>"https://study.163.com/courses",
            "User-Agent"=>"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36",
        );
        $payload = array(
            "pageIndex"        =>1,
            "pageSize"        =>700,
            "relativeOffset"=>50,
            "frontCategoryId"=>400000001295013,
            "searchTimeType"=>-1,
            "orderType"        =>50,
            "priceType"        =>-1,
            "activityId"    =>0,
            "keyword"        =>"",
        );
        $payload = json_encode($payload);
        $curl = curl_init();
        curl_setopt($curl, CURLOPT_URL, $url);
        curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
        curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, FALSE);
        curl_setopt($curl, CURLOPT_HEADER, $headers);
        curl_setopt($curl, CURLOPT_POST, 1);
        curl_setopt($curl, CURLOPT_POSTFIELDS, $payload);
        curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
        $output = curl_exec($curl);
        curl_close($curl);
        echo"<pre>";print_r($output);
        return $output;
    }
 
运行之后获取的结果却是
 
搞不懂这是什么?知道的求科普一下~
 
没办法,用python再写一遍~
 
代码如下~
import requests
import json
def getdata(index):
    a=input("调用gedata方法")
    print("正在抓取{index}页数据")
    payload = {"pageIndex":index,
            "pageSize":700,
            "relativeOffset":50,
            "frontCategoryId":400000001295013,
            "searchTimeType":-1,
            "orderType":50,
            "priceType":-1,
            "activityId":0,
            "keyword":""
    }
    print(type(payload))
    payload = json.dumps(payload)
    print(type(payload))
    headers = {"Accept":"application/json",
               "Host":"study.163.com",
               "Origin":"https://study.163.com",
               "Content-Type":"application/json",
               "Referer":"https://study.163.com/courses",
               "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36"
    }
    print(type(headers))
    req = requests.post("https://study.163.com/p/search/studycourse.json",data=payload,headers=headers)
    e=input("成功post到数据")
    print(type(req))
    res_json = json.loads(req.text)
    print(type(res_json))
    with open("C:/Users/Administrator/Desktop/wangyiPublic.json","w") as f:
        json.dump(res_json,f)
        print("写入文件完成...")
    
a=getdata(1)
b=input("运行到了这")
 
因为对python不会,所以有很多打印的
运行结果如下:
比较要注意的点是req的数据类型,打印出来是requests.models.Reaponse
去百度了一下:
它返回来的数据包含很多信息,text就是我们想要的,获取后存入本地文件
 
代码里比较值得注意的两个点
1、是frontCategory,这个是课程分类,因为网易云课堂不能显示全部课程,只能显示一级分类下的全部课程,这个frontCategoryId就是以及课程分类Id,这个可以自己去看~
    这个id要对的才能拿到对应课程的数据
2、是pageSize,这个是每次获取数据的条数,网易默认是50,因为他每页显示50个课程,我们不要这么麻烦,直接往大了些,2000,他每个一级分类下的课程数也就几百上千,肯定小于2K的
 
这是获取到的数据,本来应该直接代码处理输出csv文件的,但python不怎么会,就用php来处理了
 
    //通过python post获取到https://study.163.com/p/search/studycourse.json的数据,存入文件后,再通过php处理
    public function readJsonAction(){
        $wangyi = file_get_contents("C:/Users/Administrator/Desktop/wangyi.json");
        $wangyi = json_decode($wangyi);
        $wangyi = $wangyi->result->list;
        $size = sizeof($wangyi);print_r($size);
        for ($i=0; $i < $size; $i++) {
            $courseInfo = $wangyi[$i];
            $courseInfo = (array)$courseInfo;
            $insertData = array(
                'title'          => $courseInfo['title'],
                'productName'    => $courseInfo['productName'],
                'lectorName'     => $courseInfo['lectorName'],
                'learnerCount'   => $courseInfo['learnerCount'],
                'lessonCount'    => $courseInfo['lessonCount'],
                'description'    => $courseInfo['description'],
                'score'          => $courseInfo['score'],
                'type'           => $courseInfo['type'],
                'imgUrl'         => $courseInfo['imgUrl'],
                'addtime'        => date("Y-m-d H:i:s",time())
            );
            $this->addCsvFile($insertData);
            echo"<pre>{$insertData['title']}写入成功";
        }
    }
结果如下:
网易云课堂总共有3600余个课程
 
之后爬取网易云公开课,通过scrapy shell获取也是获取不到具体的数据,
通过浏览器开发者模式发现:
通过curl模拟,将size改为1000,全部的课程数据就全部都拿到了~~~
 
具体代码如下:
 
   
 //网易公开课数据,数据隐藏在下面的url中,通过get方式获取,再处理
    public function wangyiPublicAction(){
        $url = "https://vip.open.163.com/open/trade/pc/course/listByClassify.do?classifyId=-1&type=2&page=1&size=1032";
        $res = $this->https_request($url);
        $wangyiPublic = json_decode($res);
        $wangyiPublic = $wangyiPublic->data->items;
        $size = sizeof($wangyiPublic);print_r($size);
        for ($i=0; $i < $size; $i++) {
            $courseInfo = $wangyiPublic[$i];
            $courseInfo = (array)$courseInfo;
            $insertData = array(
                'title'        => $courseInfo['title'],
                'subtitle'    => $courseInfo['subtitle'],
                'authorName'=> $courseInfo['authorName'],
                'authorDesc'=> $courseInfo['authorDescription'],
                'price'        => $courseInfo['originPrice']/100,
                'chapter'    => $courseInfo['contentCount'],
                'purchase'    => $courseInfo['purchaseCount'],
                'interest'    => $courseInfo['interestCount'],
            );
            $this->addCsvFile($insertData);
            echo"<pre>{$insertData['title']}写入成功";
        }
    }
部分数据如下:
 
好了,网易课程的爬取就基本完成了~
 
 
 

猜你喜欢

转载自www.cnblogs.com/little-orangeaaa/p/10259802.html