关于在多线程情况下同步爬虫爬取结果的一个例子

这些天一直在用java做爬虫工作，之前遇到的都比较简单，大多都是单界面的爬取，这次需要爬虫100多个界面，肯定得多跑几个线程

然而这些界面由于信息中有重复，leader要求我们必须去重，因为数据库更改是有次数限制的。所以搞了几天，才把这个程序写出来。

先写一下思想：首先，利用JAVA自己带的线程安全的集合，ConcurrentHashMap进行一个自动去重的工作。但是在多线程情况下，一定要注意线程同步，集合类的线程安全，仅仅是存的时候是锁住的，这不代表我们在进行条件判断时候也是线程安全的，这就需要我们自己对需要同步的代码块上锁。

public static ConcurrentHashMap<String,Integer>  hashMap=new ConcurrentHashMap<String, Integer>();

AtomicInteger integer = new AtomicInteger();
public void process(Page page)
{

    if(page.getUrl().toString()=="https://www.toryburch.com/stores-viewal")
    

    {
        for (String a : url) {

            page.addTargetRequest(a);
        }
    }

    else
    {

        String shopId = "";
        String branchName = "";
        StringBuffer address = new StringBuffer();
        String crawlUrl = "";
        String region = "";
        String city = "";
        String country = "the United States";
        String shopName = "LanCome";
        String sourceName = "LanCome";
        String phone = "";
        String openTime = "";
        double lat = 0.0;
        double lng = 0.0;


        List<Selectable> infoList=page.getHtml().xpath("poi").nodes();
        List<BrandPoiDto> brandPoiDtos =new ArrayList<BrandPoiDto>();
        //防止有些没有获取到商户信息，
        if(infoList.size()>=1)
        {
            for(Selectable b:infoList)
            {
                phone=b.xpath("//phone/text()").toString();
                shopId=b.xpath("//uid/text()").toString();
                city=b.xpath("//city/text()").toString();
                branchName=b.xpath("//name/text()").toString();
                lat=Double.parseDouble(b.xpath("//latitude/text()").toString());
                lng=Double.parseDouble(b.xpath("//longitude/text()").toString());

                //提取地址：包含address1和address2
                address=address.append(b.xpath("//address1/text()").toString()).append(b.xpath("//address2/text()").toString());


                BrandPoiDto dto = new BrandPoiDto();

                dto.setBranchName(branchName);
                dto.setAddress(address.toString());
                dto.setCrawlUrl(page.getUrl().toString());
                dto.setCity(city);
                dto.setCountry(country);
                dto.setPhone(phone);
                dto.setLat(lat);
                dto.setLng(lng);
                dto.setShopName(shopName);
                dto.setSourceName(sourceName);
                dto.setShopId(shopId);




                synchronized (this)
                {
                    if(hashMap.get(shopId)==null)
                    {
                        hashMap.put(shopId,1);
                        brandPoiDtos.add(dto);

                        System.out.println(dto);
                        integer.incrementAndGet();

                    }

                }

                //对数据进行清除
                address.delete(0,address.length());
                phone="";
                lat=0.0;
                lng=0.0;
                shopId="";
                branchName="";
                city="";


            }
        }
        //之后要入库操作
        System.out.println("总共抓取数量为"+integer);

    }

代码如下，我用synchronized同步的代码块，先判断在hashMap中有没有这个信息（key为商户uid，是唯一的），如果没有，则将数据存入，并设置标志位1，如果有的话，就不进行存储和打印验证了。通过同步代码块和用hashmap，解决了多个线程抓取时重复数据的去重问题。

关于在多线程情况下同步爬虫爬取结果的一个例子

猜你喜欢