Use selenium + Java to crawl HowNet data to solve the verification code problem

The solution to crawling HowNet data encountered picture verification code

Detailed description and ideas:
1: In the process of crawling HowNet data using selenium, the thread sleep method was used to deceive at first, but it was later found that this method could not solve the problem of more than one hundred pages. Later, I switched to a method to solve the image verification code. I just thought about using OCR to identify, but the effect is not good. The last thought is to call a third-party interface to identify the verification code, the idea is as follows:
1) First take a screenshot, intercept the verification code by screenshot and save it to a certain location.
2) Call a third-party interface to identify the verification code in the screenshot and fill it into the verification code input box. Because it is not recognized 100% every time, so iterate the recognition many times to know the correct position.
Note: The third-party interface uses Baidu's text recognition interface. For details, please click:
http://ai.baidu.com/tech/ocr

Part of the core code is as follows:

  // if (i % 15 == 0) {
                // Thread.sleep(20000);
                WebElement bodyEle = driver.findElement(By.tagName("body"));
                List<WebElement> list = bodyEle.findElements(By.tagName("input"));
                /////////////////////////////////////////////////////
                //  注意:截图和识别是一个连续的过程,如果验证码识别出错,那么久无法进行到下一步,那么就该继续截图识别
                //  java 截图
                //  获取验证码的位置在屏幕中 //*[@id="CheckCodeImg"]
                // WebElement checkImage = driver.findElement(By.id("CheckCodeImg"));
                // int x = checkImage.getLocation().getX();
                // int y = checkImage.getLocation().getY();
                // System.out.println(x);
                //  System.out.println(y);
                /*WebElement checkImage2 = driver.findElement(By.id("CheckCodeImg"));
                String message = checkImage2.getAttribute("src");
                System.out.println("X:" + checkImage2.getLocation().getX());
                System.out.println("Y:" + checkImage2.getLocation().getY());
*/
                //WebElement checkImage2 = driver.findElement(By.id("CheckCodeImg"));
                //  /html/body/p[1]/label
                WebElement text1 = driver.findElement(By.xpath("/html/body/p[1]/label"));
                String text1Str = text1.getText();
                System.out.println(text1.getText());
                // while (text1Str.equals("请输入验证码")) {   //  验证码一直存在,就一直截图验证
                do {
                    pageCount++;
                    WebElement checkImage2 = driver.findElement(By.id("CheckCodeImg"));
                    // String message = checkImage2.getAttribute("src");
                    int X = checkImage2.getLocation().getX();
                    int Y = checkImage2.getLocation().getY();
                    System.out.println(X + "    " + Y);
                    // System.out.println("X:" + checkImage2.getLocation().getX());
                    //  System.out.println("Y:" + checkImage2.getLocation().getY());
                    File scrFile = ((RemoteWebDriver) driver).getScreenshotAs(OutputType.FILE);
                    byte[] bytes = File2byte(scrFile);
                    ByteArrayInputStream bais = new ByteArrayInputStream(bytes);
                    BufferedImage image = ImageIO.read(bais);

                    //  修改图片存放的位置
                    File pathCheck = new File(ResourceUtils.getURL("classpath:").getPath());
                    if (!pathCheck.exists()) pathCheck = new File("");
                    // System.out.println("path:"+path1.getAbsolutePath());
                    File uploadCheck = new File(pathCheck.getAbsolutePath(), "src/main/webapp/checkCodeImage");
                    if (!uploadCheck.exists()) uploadCheck.mkdirs();
                    String pathKey = uploadCheck.getAbsolutePath() + "\\screenfile.png";   //  这里最终是  detailUrl.txt
                    // 路径文件
                  //  File filekey = new File(pathKey);
                    File screenFile = new File(pathKey);

                    // 如果文件夹路径不存在,则创建
                    if (!screenFile.getParentFile().exists()) {
                        screenFile.getParentFile().mkdirs();
                    }
                    //  图片的像素为   63   22   坐标为:469   39
                    //  截取这儿是一个问题  因为如果第一次不对  那么 图片的位置就会变化
                    //  首先获取图片的位置  两次图片的位置 如:   412, 40, 63, 22
                    //  第一次: X:469  Y:39
                    //  第二次: X:469   Y:76
                    BufferedImage subimage = image.getSubimage((X - 63), Y, 63, 22);
                    ImageIO.write(subimage, "png", screenFile);
                    // ImageIO.write(image, "png", screenFile);
                    // Thread.sleep(10000);
                    //   验证码的识别
                    // 初始化一个AipOcr
                    AipOcr client = new AipOcr(APP_ID, API_KEY, SECRET_KEY);
                    // 可选:设置网络连接参数
                    client.setConnectionTimeoutInMillis(2000);
                    client.setSocketTimeoutInMillis(60000);
                    // 可选:设置log4j日志输出格式,若不设置,则使用默认配置
                    // 也可以直接通过jvm启动参数设置此环境变量
                    System.setProperty("aip.log4j.conf", "path/to/your/log4j.properties");
                    // 调用接口
                    //String path = "E:\\Images\\screenfile.png";
                    org.json.JSONObject res = client.basicGeneral(pathKey, new HashMap<String, String>());
                    net.sf.json.JSONObject myJson = net.sf.json.JSONObject.fromObject(res.toString());
                    Map m = myJson;
                    Object object = m.get("words_result");
                    JSONArray json = JSONArray.fromObject(object);
                    List<Map<String, Object>> mapListJson = json;
                    Map<String, Object> checkMap = mapListJson.get(0);
                    String key = (String) checkMap.get("words");
                    System.out.println(key);
                    //  输入框
                    WebElement inputEle = driver.findElement(By.id("CheckCode"));//list.get(0);
                    //inputEle.sendKeys("123");
                    inputEle.sendKeys(key);
                    Thread.sleep(5000);
                    //  提交按钮
                    WebElement submitButn = driver.findElement(By.xpath("/html/body/p[1]/input[2]"));//list.get(1);
                    submitButn.click();
                    //  如果页数超过60那么就休息两分钟
                    if (i % 60 == 0 || i % 70 == 0 || i % 90 == 0){
                        Thread.sleep(12000);
                    }
                } while (text1Str.equals("请输入验证码"));

                //   }

                System.out.println("退出循环了,验证通过了");

                /*File scrFile = ((RemoteWebDriver) driver).getScreenshotAs(OutputType.FILE);
                byte[] bytes = File2byte(scrFile);
                ByteArrayInputStream bais = new ByteArrayInputStream(bytes);
                BufferedImage image = ImageIO.read(bais);
                File screenFile = new File("E:\\Images\\screenfile.png");
                // 如果文件夹路径不存在,则创建
                if (!screenFile.getParentFile().exists()) {
                    screenFile.getParentFile().mkdirs();
                }
                //  图片的像素为   63   22   坐标为:469   39
                BufferedImage subimage = image.getSubimage(412, 40, 63, 22);
                ImageIO.write(subimage, "png", screenFile);
                // Thread.sleep(10000);
                //   验证码的识别
                // 初始化一个AipOcr
                AipOcr client = new AipOcr(APP_ID, API_KEY, SECRET_KEY);
                // 可选:设置网络连接参数
                client.setConnectionTimeoutInMillis(2000);
                client.setSocketTimeoutInMillis(60000);
                // 可选:设置log4j日志输出格式,若不设置,则使用默认配置
                // 也可以直接通过jvm启动参数设置此环境变量
                System.setProperty("aip.log4j.conf", "path/to/your/log4j.properties");
                // 调用接口
                String path = "E:\\Images\\screenfile.png";
                org.json.JSONObject res = client.basicGeneral(path, new HashMap<String, String>());
                net.sf.json.JSONObject myJson = net.sf.json.JSONObject.fromObject(res.toString());
                Map m = myJson;
                Object object = m.get("words_result");
                JSONArray json = JSONArray.fromObject(object);
                List<Map<String, Object>> mapListJson = (List) json;
                Map<String, Object> checkMap = mapListJson.get(0);
                String key = (String) checkMap.get("words");
                System.out.println(key);
                //  输入框
                WebElement inputEle = list.get(0);
                inputEle.sendKeys(key);
                Thread.sleep(2000);
                //  提交按钮
                WebElement submitButn = list.get(1);
                submitButn.click();*/
                //Thread.sleep(5000);

The code only involves the screenshot and identification part, and the part with many comments has not been deleted. I want to record it in detail for later thinking. As for the functions such as simulated user search, if you need netizens, please comment or private chat below to help solve it.

Published 13 original articles · Like1 · Visit 2006

Guess you like

Origin blog.csdn.net/qq_31152023/article/details/100066460