网页提取内容 - 代码天地

网页提取内容

编程语言 2018-05-12 14:09:55 阅读次数: 2

package com.viewer;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

import javax.xml.transform.TransformerException;

import org.cyberneko.html.parsers.DOMParser;
import org.w3c.dom.Document;
import org.w3c.dom.NamedNodeMap;

import com.sun.org.apache.xpath.internal.XPathAPI;
import com.viewer.common.CommonFileOperator;

public class Test {
public void caijiNekoFirst(String url) throws Exception {
DOMParser parser = new DOMParser();
try {
// 设置网页的默认编码
parser.setProperty("http://cyberneko.org/html/properties/default-encoding","gb2312");
// 关闭命名空间为false
parser.setFeature("http://xml.org/sax/features/namespaces", false);
// 设置html路径
parser.parse(url);
} catch (Exception e) {
e.printStackTrace();
}
Document doc = parser.getDocument();
String titlexpath = "//*[@id=\"Img_a\"]";
org.w3c.dom.NodeList titles = null;
try {
titles = XPathAPI.selectNodeList(doc, titlexpath);
org.w3c.dom.Node node = null;
System.out.println(titles.getLength());
for (int i = 0; i < titles.getLength(); i++) {
node = titles.item(i);
/* 获取属性值 */
NamedNodeMap namedNodeMap = node.getAttributes();
org.w3c.dom.Node n = namedNodeMap.getNamedItem("src");
System.out.println(n.getNodeValue());
}
} catch (TransformerException e) {
e.printStackTrace();
}
}
/**
* @param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
// Test t = new Test();
// try {
// t.caijiNekoFirst("http://localhost:9090/PaperViewer/node_2.htm");
// } catch (Exception e) {
// // TODO Auto-generated catch block
// e.printStackTrace();
// }
String s = "C:\\Documents and Settings\\Administrator\\桌面\\新建文件夹\\node_2.htm";
try {
String content = CommonFileOperator.readFile(s);
// System.out.println(content);
Pattern p = Pattern.compile("<img useMap=#PagePicMap1.*?id=\"Img_a\" >");
Matcher m = p.matcher(content);
while (m.find()) {
String tmp = m.group();
System.out.println(tmp);
}
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}

}

}

猜你喜欢

转载自jasonwo.iteye.com/blog/1931430

网页提取内容

python 爬虫 css提取网页内容

python爬虫 selector xpath提取网页内容

Python 爬虫网页内容提取工具xpath

python网页内容提取神器lxml

【Python】提取网页正文内容的相关模块与技术

基于ChatGPT等大模型快速爬虫提取网页内容

BeautifulSoup方法提取网页内容，并且保存到csv和excel中

Python爬虫万金油，使用工具goose快速提取网页内容

Python：使用readability-lxml 提取网页标题和主体内容

python 提取网页源码中注释内容非常规方法

正则表达式查找网页源代码提取指定内容

使用readability-lxml 提取网页标题和主体内容 - 尝试

使用readability-lxml 提取网页标题和主体内容 - 尝试

提取任意网页核心内容 —— 像搜索引擎一样精准

提取网页数据

提取网页代码

网页提取的工具

js 提取网址内容

提取公共内容

netCDF文件内容提取

提取URL中的内容

python 日志内容提取

从内容中提取图片

提取括号中的内容

用lxml的xpath演示爬虫提取笑话集网页其中的标题，url，浏览数，日期，笑话内容

WebCollector 网页正文快速提取

Delphi提取网页中的图片

提取本地网页文本

提取网页所有文本

今日推荐

“开源信徒”周鸿祎开源360智脑大模型

华为ensp中vrrp虚拟路由器冗余协议原理及配置命令

基于Python爬虫广东广州水酒店宾馆数据可视化系统设计与实现(Django框架) 研究背景与意义、国内外研究现状

知识融合：知识图谱构建的关键技术

文心一言收费还是免费：全面解析其价格策略与服务价值

百万用户通话新风潮：仅需50秒，无界AI让彩铃变身短视频

【STM32项目】基于STM32多传感器融合的新型智能导盲杖设计（完整工程资料源码）

文生视频大模型Sora的复现经验

腾讯云函数计算技术：云原生架构下的Serverless与微服务新篇章

干货分享｜JumpServer 三种常见的文件传输方式效果对比

【榜单公布】2023年度征文活动已结束

周排行

Java中关于时间的操作及格式化

《HTML5与CSS3基础教程》第五章学习笔记图像

nginx下安装PHP发生问题的逐步解决

HDU-1048，The Hardest Problem Ever（字符串处理）

新一代多媒体技术与应用的部分课后题

Shader 绘制特殊图形

Oracle数据库三种备份方案

CodeForces - 983B XOR-pyramid(两次区间DP/记忆化DFS)

Python3基础语法——变量与运算符

（转载）KMP算法详解（原创）详解KMP算法

每日归档

更多

2024-04-16(70)

2024-04-15(42)

2024-04-14(0)

2024-04-13(119)

2024-04-12(38)

2024-04-11(14)

2024-04-10(68)

2024-04-09(5)

2024-04-08(60)

2024-04-07(4)