css选择器提取html

参考

https://developer.mozilla.org/zh-CN/docs/Learn/CSS/Building_blocks/Selectors

元素、类名、id选择器

css	xpath
div	//div
*	//*
.su	//*[@class=“su”]
div.su	//div[@class=“su”]
#su	//*[@id=“su”]

属性选择器

css	xpath	解释
`a[title]`	`//a[@title]`	存在title属性的a标签
`a[title="x"]`	`//a[@title="x"]`	title属性等于"x"的a标签
`a[href*="x"]`	`//a[contains(@href, "x")]`	href属性包含"x"字符串的a
`a[href$="x"]`	`//a[ends-with(@href, 'x')]`	hred属性以x结尾的a	①
`a[class~="x"]`	略	class属性中包含以空格分隔的x的a	②
`a[href^="x"]`	略	href属性值以x开头
`a[href\|="x"]`	略	href属性值等于x，或者以x-开头	③
`a[href*="x" i]`	`//a[contains(lower-case(@href),"x")]`	不区分大小写	④
`a[href*="x" s]`	略	区分大小写

①：表格里的是xpath2.0的写法，如果是1.0则：//a[substring(@href,string-length(@href) -string-length('x') +1) = 'x']
②：例如：<a class="a x b">, 而<a class="ax b">则不算。xpath不知道怎么写
③：应用场景：匹配属性为zh-CN，zh-TW等
④: 表格里的是xpath2.0的写法, 1.0麻烦一点：//a[contains(translate(@href, "ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz"), "x")]

伪类选择器

css	解释
`a:link`	匹配所有未被访问的a标签
`a:visited`	匹配所有已被访问的a
`a:any-link`	匹配所有 a
`div:first-child`	兄弟节点的第一个是div的元素
`div:first-of-type`	选择兄弟节点的第一个div元素
`div:last-child`	兄弟节点的最后一个是div的元素
`div:last-of-type`	选择兄弟节点的最后一个div元素
`:not(p)`	选择不是p的元素
`:nth-child(an+b)`	这个比较复杂见注释①
`:nth-last-child(an+b)`	同nth-child，不过是从后往前匹配，最后一个为第一个元素
`div:nth-of-type(an+b)`	与nth-child区别在于,不是div的不参与计数
`:nth-last-of-type(an+b)`	与nth-of-type类似，从后往前匹配
`p:only-child`	没有兄弟节点的元素, 同`:first-child:last-child`
`p:only-of-type`	兄弟节点中没有p的p元素

①:nth-child(an+b)

a, b为整数常数, n为固定写法, 例如 1n +0, 2n+1等
元素计数从1开始，不是从0开始
0n+3: 匹配第三个元素
1n+0: 匹配每个元素
2n+0: 匹配偶数位的元素
2n+1: 匹配奇数位元素
3n+4: 匹配4, 7, 10, 13的元素
an+b: n从0开始，an+b的元素

后代子代选择器

css	xpath	解释
`div p`	//div//p
`div > p`	//div/p
`div + img`	//div/following-sibling::img[1]	兄弟节点在div下面第一个元素是img的img元素
`div ~ img`		兄弟节点在div下面第一个img元素

参考

元素、类名、id选择器

属性选择器

伪类选择器

后代子代选择器

猜你喜欢