Perl Regex to extract URLs from HTML

2023-01-30 18:14 问答作者：

This s开发者_运维百科hould be a simple regex but I can't seem to figure it out.

Can someone please provide a 1-liner to take any string of arbitrary HTML input and populate an array with all the Facebook URLs (matching http://www.facebook.com) that were in the HTML code?

I don't want to use any CPAN modules and would much prefer a simple regex 1-liner.

Thanks in advance for your help!

Obligatory link explaining why you shouldn't parse HTML using a regular expression.

That being said, try this for a quick and dirty solution:

my $html = '<a href="http://www.facebook.com/">A link!</a>';
my @links = $html =~ /<a[^>]*\shref=['"](https?:\/\/www\.facebook\.com[^"']*)["']/gis;

See HTML::LinkExtor. There is no point wasting your life energy (nor ours) trying to use regular expressions for these types of tasks.

You can read the documentation for a Perl module installed on your computer by using the perldoc utility. For example, perldoc HTML::LinkExtor. Usually, module documentation begins with an example of how to use the module.

Here is a slightly more modern adaptation of one of the examples in the documentation:

#!/usr/bin/env perl

use v5.20;
use warnings;

use feature 'signatures';
no warnings 'experimental::signatures';

use autouse Carp => qw( croak );

use HTML::LinkExtor qw();
use HTTP::Tiny qw();
use URI qw();

run( $ARGV[0] );

sub run ( $url ) {
    my @images;

    my $parser = HTML::LinkExtor->new(
        sub ( $tag, %attr ) {
            return unless $tag eq 'img';
            push @images, { %attr };
            return;
        }
    );

    my $response = HTTP::Tiny->new->get( $url, {
            data_callback => sub { $parser->parse($_[0]) }
        }
    );

    unless ( $response->{success} ) {
        croak sprintf('%d: %s', $response->{status}, $response->{reason});
    }

    my $base = $response->{url};

    for my $image ( @images ) {
        say URI->new_abs( $image->{src}, $base )->as_string;

    }
}

Output:

$ perl t.pl https://www.perl.com/
https://www.perl.com/images/site/perl-onion_20.png
https://www.perl.com/images/site/twitter_20.png
https://www.perl.com/images/site/rss_20.png
https://www.perl.com/images/site/github_light_20.png
https://www.perl.com/images/site/perl-camel.png
https://www.perl.com/images/site/perl-onion_20.png
https://www.perl.com/images/site/twitter_20.png
https://www.perl.com/images/site/rss_20.png
https://www.perl.com/images/site/github_light_20.png
https://i.creativecommons.org/l/by-nc/3.0/88x31.png

Russell C, have you seen the beginning of the Facebook movie, where Mark Zuckerburg uses Perl to automatically extract all the photos from a college facebook (and then posted them online). I was like "that's how i'd do it! I'd use Perl too!" (except it would probably take me a few days to work out, not 2 minutes). Anyway I'd use the module WWW::Mechanize to extract links (or photos):

use strict; use WWW::Mechanize; open (OUT, ">out.txt"); my $url="http://www.facebook.com"; my $mech=WWW::Mechanize->new(); $mech->get($url); my @a = $mech->links; print OUT "\n", $a[$_]->url for (0..$#a);

However this won't log you in to your facebook page, it will just take you to the log in screen. I'd use HTTP::Cookies to log in. For that, see the documentation. Only joking, just ask. Oh god the apple strudle is burning!

Maybe this can help you:

if ($input =~ /(http:\/\/www\.facebook\.com\/\S+)/) { push(@urls, $1); }

继续阅读：perl regex

Perl Regex to extract URLs from HTML

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？