Perl web scraper, extract content from DIV that only has "style" tag?

2023-01-07 00:10 问答作者：

I'm stuck on this and have been all day.. I'm still pretty new to parsing / scraping in perl but I thought I had it down until this.. I have been trying this with different perl modules (tokeparser, tokeparser:simple, web parser and some others)... I have the following string (which in reality is actually an entire HTML page, but this is just showing the relevant part.. I a开发者_高级运维m trying to extract "text1" and "text1_a".. and so on (the "text1", etc is just put in there as an example)... so basically I think I need to extract this first from each:

"<span style="float: left;">test1</span>test1_a"

Then to parse this to get the 2 values.. I don't know why this is giving me so much trouble as I thought I could just do it in tokeparser:simple but I couldn't seem to return the value inside of the DIV, I wonder if its because it contains another set of tags (the tags)

string (represents html web page)

<div id="dataID" style="font-size: 8.5pt; width: 250px; color: rgb(0, 51, 102); margin-right: 10px; float: right;">
<div style="width: 250px; text-align: right;"><span style="float: left;">test1</span>test1_a</div>
<div style="width: 250px; text-align: right;"><span style="float: left;">test2</span>test2_a</div>
<div style="width: 250px; text-align: right;"><span style="float: left;">test3</span>test3_a</div>

my attempt in perl web parser module:

my $uri  = URI->new($theurl);

my $proxyscraper = scraper {
process 'div[style=~"width: 250px; text-align: right;"]',
'proxiesextracted[]' => scraper {
process '.style',  style => 'TEXT';
};
result 'proxiesextracted';

I'm just kind of blindly trying to make sense of the web:parser module as there is essentially no documentation on it so I just pieced that together from the examples they included with the module and one I found on the internet.. any advice is greatly appreciated.

If you want a DOM parser (easier to use tree browsing, slightly slower). Try HTML::TreeBuilder

HTML::Element man page (module is included)

Note also that look_down considers "" (empty-string) and undef to be
different things, in attribute values. So this:
  $h->look_down("alt", "")

Which leads us to your answer:

use HTML::TreeBuilder;

# check html::treebuilder pod, there are a few ways to construct (file, fh, html string)
my $tb = HTML::TreeBuilder->new_from_(constructor)

$tb->look_down( _tag => 'div', style => '' )->as_text;

using Web::Scraper, try :

#!/usr/bin/perl

use strict;
use warnings;
use Data::Dumper::Simple;
use Web::Scraper;

$Data::Dumper::Indent = 1;

my $html = '<div id="dataID" style="font-size: 8.5pt; width: 250px; color: rgb(0, 51, 102); margin-right$
<div style="width: 250px; text-align: right;"><span style="float: left;">test1</span>test1_a</div>
<div style="width: 250px; text-align: right;"><span style="float: left;">test2</span>test2_a</div>
<div style="width: 250px; text-align: right;"><span style="float: left;">test3</span>test3_a</div>';


my $proxyscraper = scraper {
    process '//div[@id="dataID"]/div', 'proxiesextracted[]' => scraper {
       process '//span', 'data1' => 'TEXT';
       process '//text()', 'data2' => 'TEXT';
     }
};

my $results = $proxyscraper->scrape( $html );

print Dumper($results);

It give :

$results = {
  'proxiesextracted' => [
    {
      'data2' => 'test1_a',
      'data1' => 'test1'
    },
    {
      'data2' => 'test2_a',
      'data1' => 'test2'
    },
    {
      'data2' => 'test3_a',
      'data1' => 'test3'
    }
  ]
};

Hope this helps

继续阅读：parsing perl web-scraping

Perl web scraper, extract content from DIV that only has "style" tag?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？