将metalink中的网页链接用python 提取 - 代码天地

将metalink中的网页链接用python 提取

其他 2018-07-02 16:12:44 阅读次数: 0

想下TED，下载的到metalink格式的文件，现在都没有工具支持下载，自己动手用python提取吧

（1）问题

原始文件有几千个类似的结构：要把从Https 到MP4的字符串找出来，变成一个list 文件，

<files>
<file name="Bren Brown - The power of vulnerability.mp4">
<resources>
<url type="http">https://download.ted.com/talks/BreneBrown_2010X-low-en.mp4</url>
</resources>
</file>
<file name="Isabel Behncke - Evolutions gift of play from bonobo apes to humans.mp4">
<resources>
<url type="http">https://download.ted.com/talks/IsabelBehnckeIzquierdo_2011U-low-en.mp4</url>
</resources>

</file>

（2）网上找的原始解决方案

https://zhidao.baidu.com/question/560038575.html

results=re.findall("(?isu)(http\://[a-zA-Z0-9\.\?/&\=\:]+)")
open("urls.txt","wb").write("\r\n".joint(results))

（3）调试后的结果：

import re
s=open("TEDEN.TXT","rb").read()
#results=re.findall("(?isu)(https\\://[a-zA-Z0-9\.\?/&\=\:]+)",s)
results=re.findall("(?isu)(https\\://[a-zA-Z0-9 _\-\.\?/&\=\:]+)",s)
with open("OUTPUT.txt","wb") as handle:

handle.write("\r\n".join(results))

（4）输出的文件内容：

https://download.ted.com/talks/BreneBrown_2010X-low-en.mp4
https://download.ted.com/talks/IsabelBehnckeIzquierdo_2011U-low-en.mp4

。。。。。。。

调试成功

（5）回顾

学到了re的符号含义，如何用正则式匹配你要的格式。

.join 和 .joint 的用法

猜你喜欢

转载自blog.csdn.net/fumingf1/article/details/79691455

将metalink中的网页链接用python 提取

python 提取链接中的域名

Python 链接提取器 CrawlSpider

提取页面、文件中的链接

提取EXCEL文字中的链接

Delphi提取网页中的图片

如何提取网页中的日期？

网页提取内容

提取网页数据

提取网页代码

网页提取的工具

用Python提取视频中的图片

用Python提取Redis数据

用selenium提取html标签中的@href链接

js用正则提取${}中的值，提取{}中的值

python数据提取方式

python 从kafka提取数据

Python 简单页面提取

Python之数据提取

Python——爬虫——数据提取

python 日志内容提取

python 数据提取及拆分

python json提取

Python 提取想要的元素

python提取pdf

python提取mfcc特征

python 提取log字段

python——数据提取，处理

python实现边缘提取

python文本时间提取

今日推荐

开放签电子签章：停止新增，优化体验，前进更进（五一假期前工作）

开源日报 | 中学生开源前端动画引擎；全球首个Llama3 8B中文版开源模型；联想电脑恐出局；Linus讽刺AI炒作

“百模大战”必有一战 | 2024中国“百模大战”竞争格局分析

最强开源大模型 Llama 3 上架 Gitee AI

虽然老乡鸡开源的不是代码，但背后的原因却让人很暖心

富文本编辑器 Quill 2.0 重磅发布，特性、可靠性与开发者体验大幅提升

周排行

使用Redis中间件解决商品秒杀活动中出现的超卖问题（使用Java多线程模拟高并发环境）

野指针及c++指针使用注意点

redis 3.0　新特性

(翻译)火狐操作系统javascript API

微信小程序开发入门

mysql数据查询之五子句(where、group by、having、order by和limit)

Codeforces Round #517 Div. 1翻车记

在caffe 中实现Generative Adversarial Nets（二）

企业级漏洞扫描工具

java byte数组与String互转

每日归档

更多

2024-04-23(26)

2024-04-22(39)

2024-04-21(0)

2024-04-20(6)

2024-04-19(5)

2024-04-18(0)

2024-04-17(5)

2024-04-16(70)

2024-04-15(42)

2024-04-14(0)