原因

主要是由于现在最近正在找房子，所以对链家的网站进行了分析

##代码

                for (int i =0;i<50;i++){
                    String everypageurl = "https://sh.lianjia.com/zufang/pg"+i+"rco11l1rp6/#contentList";
                    Document document = null;
                    try {
                        document = Jsoup.connect(everypageurl).timeout(500000).get();
                    } catch (IOException e) {
                        e.printStackTrace();
                    }
                    Elements elements = document.select("div[class=content__list--item]");
                    for (Element element : elements)  {
                        zufang zu =new zufang();
                        String url = element.select("a[class=content__list--item--aside]").attr("href");
                        String allUrl = "https://sh.lianjia.com"+url;
                        Document doc = null;
                        try {
                            doc = Jsoup.connect(allUrl).get();
                        } catch (IOException e) {
                            e.printStackTrace();
                        }
                        String title = doc.select("p[class=content__title]").text(); //标题
                        zu.setTitle(title);
                        String description = doc.select("p[class=content__aside--tags]").text();//特点描述
                        zu.setDescription(description);
                        String brand = "链家";//品牌
                        zu.setBrand(brand);
                        String time = doc.select("div[class=content__subtitle]").text();//发布时间
                        zu.setTime(time);
                        String price = doc.select("p[class=content__aside--title]").select("span").text();//价格
                        zu.setPrice(price);
                        String feature  = doc.select("p[class=content__article__table]").text();
                        zu.setFeature(feature);
                        String floor = doc.select("li[class=fl oneline]").eachText().get(7)+"-------"+
                                doc.select("li[class=fl oneline]").eachText().get(8);
                        zu.setFloor(floor);
                        String around = doc.select("div[id=around]").select("ul").text();
                        zu.setAround(around);
                        String houseComent = doc.select("div[class=content__article__info3]").select("p").
                                attr("data-el","houseComment").attr("data-desc");
                        zu.setHouseComent(houseComent);
                        String lxr = doc.select("ul[id=agentList]").select("li:nth-child(1)").select("div[class=desc]").
                                select("div[class=title]").select("a[class=name]").text()+"--------"+
                                doc.select("ul[id=agentList]").select("li:nth-child(1)").select("div[class=desc]")
                                        .select("div[class=phone]").text();
                        zu.setLxr(lxr);
                    }


                }

没有读取页数，因为链家只展示100页，此处我是按发布时间爬取的前50页。

主要的难点

其实爬虫最主要的是分析网页结构但是，对于这个爬虫最主要的是怎么爬取第二级页面，最主要的方案是 doc = Jsoup.connect(allUrl).get();
即使用jsoup自己的请求而不是使用httpclient的请求就行了。

jsoup多级爬取链家租房数据

原因

主要的难点

猜你喜欢