Huge problem to parse this "simple" html page
I'm trying to parse http://www.google.com/finance?q=INDEXDJX:.DJI and I can't achieve it I can't see why:
symbol_list: ["GOOG" "AAPL" "MSFT" "INDEXDJX:.DJI"]
foreach symbol symbol_list [
url0: rejoin [http://www.google.com/finance/historical?q= symbol]
;stock-data: read/lines url
dir: make-dir/deep to-rebol-file "askpoweruser/stock-download/google/"
either none? filename: find symbol ":" [filename: symbol
url: rejoin [url0 "&output=csv"]
content: read url
out-string: copy rejoin ["Time;Open;High;Low;Close;Volume" newline]
reversed-quotes: reverse parse/all content ",^/"
foreach [v c l h o d] reversed-quotes [
either not (error? try [d: to-date d]) [
d: rejoin [d/year "-" d/month "-" d/day]
append out-string rejoin [d ";" o ";" h ";" l ";" c ";" v newline]
][
]
]
write to-rebol-file rejoin [dir symbol ".csv"] out-string
][
filename: next next filename
out: copy []
for i 0 1 1 [
p: i
url: rejoin [url0 "&start=" (p * 200) "&num=" ((p + 1) * 200)]
content: read url
rule: [to "<table" thru "<table" to ">" thru ">"
开发者_开发问答 to "<table" thru "<table" to ">" thru ">"
to "<table" thru "<table" to ">" thru ">"
copy quotes to </table> to end
]
parse content rule
parse load/markup quotes [
some [set tag tag! (probe tag) | set x string! (
if (not none? tag) [
if ((left-range tag 3) = "<td") [
replace/all (replace/all x "^/" "") "," ""
append out x
]
]
)
]
]
;write/lines to-rebol-file rejoin [dir filename "_" p ".html"] quotes
]
write to-rebol-file rejoin [dir filename "_temp" ".txt"] mold out
remove/part out 2
out-string: copy rejoin ["Time;Open;High;Low;Close;Volume" newline]
out: reverse out
insert/only out "" 1
foreach [x v c l h o d] out [
either not (error? try [d: to-date d]) [
d: rejoin [d/year "-" d/month "-" d/day]
append out-string rejoin [d ";" o ";" h ";" l ";" c ";" v newline]
][
probe d
input
]
]
write/lines to-rebol-file rejoin [dir filename ".csv"] out-string
]
]
Finally I do it another way (see my own answer below) using parse instead of load/markup which appears to be simpler at first but google html doesn't seem to be very kind so I changed my mind :
parse quotes [
some [to "<td" thru "<td" to ">" thru ">" [copy x to "<" | copy x to end] (append out replace/all x "^/" "")]
to end
]
sample output:
Time;Open;High;Low;Close;Volume
2009-11-30;10,309.77;10,364.34;10,263.29;10,344.84;223,576,049
2009-12-1;10,343.82;10,501.28;10,343.44;10,471.58;190,219,357
2009-12-2;10,470.44;10,513.52;10,421.47;10,452.68;159,501,469
2009-12-3;10,455.63;10,507.63;10,350.05;10,366.15;243,970,136
2009-12-4;10,368.57;10,516.70;10,311.81;10,388.90;460,658,589
2009-12-7;10,386.86;10,443.16;10,360.18;10,390.11;196,577,978
2009-12-8;10,385.42;10,385.65;10,249.84;10,285.97;221,774,698
2009-12-9;10,282.85;10,342.27;10,235.63;10,337.05;188,605,901
2009-12-10;10,336.00;10,444.60;10,335.77;10,405.83;195,906,049
2009-12-11;10,403.41;10,484.05;10,400.08;10,471.50;179,968,842
2009-12-14;10,471.28;10,514.66;10,471.28;10,501.05;154,359,615
Finally I gave up using load/markup and directly use parse, now it works:
symbol_list: ["GOOG" "AAPL" "MSFT" "INDEXDJX:.DJI"]
foreach symbol symbol_list [
url0: rejoin [http://www.google.com/finance/historical?q= symbol]
;stock-data: read/lines url
dir: make-dir/deep to-rebol-file "askpoweruser/stock-download/google/"
either none? filename: find symbol ":" [filename: symbol
url: rejoin [url0 "&output=csv"]
content: read url
out-string: copy rejoin ["Time;Open;High;Low;Close;Volume" newline]
reversed-quotes: reverse parse/all content ",^/"
foreach [v c l h o d] reversed-quotes [
either not (error? try [d: to-date d]) [
d: rejoin [d/year "-" d/month "-" d/day]
append out-string rejoin [d ";" o ";" h ";" l ";" c ";" v newline]
][
]
]
write to-rebol-file rejoin [dir symbol ".csv"] out-string
][
filename: next next filename
out: copy []
for i 0 1 1 [
p: i
url: rejoin [url0 "&start=" (p * 200) "&num=" ((p + 1) * 200)]
content: read url
rule: [to "<table" thru "<table" to ">" thru ">"
to "<table" thru "<table" to ">" thru ">"
to "<table" thru "<table" to ">" thru ">"
copy quotes to </table> to end
]
parse content rule
parse quotes [
some [to "<td" thru "<td" to ">" thru ">" [copy x to "<" | copy x to end] (append out replace/all x "^/" "")]
to end
]
;write/lines to-rebol-file rejoin [dir filename "_" p ".html"] quotes
]
write to-rebol-file rejoin [dir filename "_temp" ".txt"] mold out
;remove/part out 2
out-string: copy rejoin ["Time;Open;High;Low;Close;Volume" newline]
out: reverse out
foreach [v c l h o d] out [
d: parse/all d " ,"
d: to-date rejoin [d/4 "-" d/1 "-" d/2]
d: rejoin [d/year "-" d/month "-" d/day]
append out-string rejoin [d ";" o ";" h ";" l ";" c ";" v newline]
]
write to-rebol-file rejoin [dir filename ".csv"] out-string
]
]
精彩评论