gather all links to *.txt files

2023-02-20 01:53 问答作者：

I need to retrieve all the links to text files in an HTML document, I don't know what the best way to do this is but, I have tried the following in ruby...

line.scan(/<a href="([\w+:\/.-]*.txt)/)

but I am not sure if this expression covers all possible links pointing to text file, I am wondering if there are some bui开发者_运维技巧lt-in regular expressions for this? or if anyone knows of a better way to retrieve all links to a text file in a huge webpage.

This will walk through the HTML and find all hrefs with a '.txt' extension:

#!/usr/bin/env ruby

require 'nokogiri'

html = <<EOT
<html>
  <head><title>foo</title></head>
  <body>
    <a href="file.txt">text file</a>
    <a href="file.jpg">jpg file</a>
    <a href="file2.txt">text file 2</a>
  </body>
</html>
EOT
doc = Nokogiri::HTML(html)
puts doc.search('a').select { |n| n['href'][/\.txt$/] }.map{ |n| n['href'] }

> file.txt
> file2.txt

It's using Nokogiri to parse the content, which really is a lot more bullet-proof than trying to use regex.

Try this (captures all txt files, not just links):

html.scan(/[^\s"']+\.txt/)

To capture links to text files only:

html.scan(/<a [^<>\n]*?href=["']([^\s"']+\.txt)["'][^<>\n]*?>.*?<\/a>/m)

继续阅读：regex ruby

gather all links to *.txt files

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？