Powershell: Screenscraping http and returning specific lines as variables
I'm relatively new to PowerShell and have got to the limit of my knowledge. I'm writing a script to scrape the backup data from an internal webpage and then extract information from the scrape to manipulate and later display in excel.
$Yesterday = [DateTime]::Now.AddDays(-1)
$datestr = $Yesterday.ToString("dd-MMM-yyyy")
$WebClient = New-Object System.Net.WebClient
$Results = $WebClient.DownloadString("http://fakeurl")
This results in a large amount of output containing http code as well as the data I'm interested in but all bunched up together. I then do this:
[StringSplitOptions]$option = "None"
[string[]]$separator = "</td>"
$SPL = $Results.Split($separator, $option)
This splits the data up into a more readable format. Here's a snippet of the section I'm interested in from $SPL.
<tr><td headers="HOST_NAME" class="t13dataalt">server01
<td headers="AUTOSYS_JOB" class="t13dataalt">nbu.os.wn.135b.server01
<td headers="START_TIME" class="t13dataalt">01-Aug-2011 21:23
<td headers="END_TIME" class="t13dataalt">01-Aug-2011 21:51
<td headers="BACKUP_TYPE" class="t13dataalt">differential
<td headers="SCHEDULE" class="t13dataalt">daily
<td align="right" headers="SIZE_MB" class="t13dataalt"> 2,091.18
<td headers="IMAGES" class="t13dataalt">1
<td headers="EXIT_STATUS" class="t13dataalt">0
</tr><tr><td headers="HOST_NAME" class="t13data">server02
<td headers="AUTOSYS_JOB" class="t13data">nbu.os.wn.135b.server02
<td headers="START_TIME" class="t13data">31-Jul-2011 21:22
<td headers="END_TIME" class="t13data">31-Jul-2011 21:41
<td headers="BACKUP_TYPE" class="t13data">differential
<td headers="SCHEDULE" class="t13data">daily
<td align="right" headers="SIZE_MB" class="t13data"> 2,496.31
<td headers="IMAGES" class="t13data">1
<td headers="EXIT_STATUS" class="t13data">0
From this I need to extract the Start and End time, to work out the elapsed time, and also return the EXIT_STATUS of the most recent backup. I've tried the following but I feel I may be barking up the wrong tree:
$Position = select-string -inputobject $SPL -pattern $datestr
$Position.matches results in:
PS C:\Scripts> $Position.matches
Groups : {03-Aug-2011}
Success : True
Captures : {03-Aug-2011}
Index : 12056
Length : 11
Value : 03-Aug-2011
My theory was to do a substring using the Index added to the Length to extract the time value after date but I have no idea how to do that. I also think it's a bit primative. There must be an 开发者_Python百科easier way of returning the line of information I need from that variable without counting to the spot and then ripping out the rest of the line?
OK, as I'm not sure how to add a section like this at the bottom of the page, I'm going to add it here.
This is my script at the moment and it runs through without any errors but doesn't return any results.
# Get yesterdays date and convert it to the required search format
$Yesterday = [DateTime]::Now.AddDays(-1)
$datestr = $Yesterday.ToString("dd-MMM-yyyy")
# Scrape the webpage
$url = "http://fake-url"
$WebClient = New-Object System.Net.WebClient
$Results = $WebClient.DownloadString($url)
# Determine if the previous day is listed in the backups
$IsDateThere = $Results.Contains($datestr)
If ($IsDateThere){
# split the data into rows
[StringSplitOptions]$option = "None"
[string[]]$separator = "</td>"
$SPL = $Results.Split($separator, $option)
#strip the data into a hash table
$SPL |
Foreach-Object {
where {$_ -match 'headers="(.*)" class.*>(.*)'} |
ForEach-Object {
@{
$matches[1] = ($matches[2]).trim()
}
}
}
}
Else{
Write-Host "Yesterday's date not found"
}
Any ideas? I'm not sure what to do next to get the start time and end time of the most recent backup and the exit code as variables.
I would approach it something like this
$html = @"
<tr><td headers="HOST_NAME" class="t13dataalt">server01
<td headers="AUTOSYS_JOB" class="t13dataalt">nbu.os.wn.135b.server01
<td headers="START_TIME" class="t13dataalt">01-Aug-2011 21:23
<td headers="END_TIME" class="t13dataalt">01-Aug-2011 21:51
<td headers="BACKUP_TYPE" class="t13dataalt">differential
<td headers="SCHEDULE" class="t13dataalt">daily
<td align="right" headers="SIZE_MB" class="t13dataalt"> 2,091.18
<td headers="IMAGES" class="t13dataalt">1
<td headers="EXIT_STATUS" class="t13dataalt">0
</tr><tr><td headers="HOST_NAME" class="t13data">server02
<td headers="AUTOSYS_JOB" class="t13data">nbu.os.wn.135b.server02
<td headers="START_TIME" class="t13data">31-Jul-2011 21:22
<td headers="END_TIME" class="t13data">31-Jul-2011 21:41
<td headers="BACKUP_TYPE" class="t13data">differential
<td headers="SCHEDULE" class="t13data">daily
<td align="right" headers="SIZE_MB" class="t13data"> 2,496.31
<td headers="IMAGES" class="t13data">1
<td headers="EXIT_STATUS" class="t13data">0
"@
$html -split "`r`n" | where {$_ -match 'start_time|end_time'} |
ForEach {
$pos = $_.IndexOf("headers")
$begin = $pos+9
$end = $_.IndexOf('"', $begin)
new-object PSObject -Property @{
Key = $_.SubString($begin, $end-$begin)
Value = Get-Date( $_.SubString( $_.IndexOf(">")+1 ) )
}
}
Results
Key Value
--- -----
START_TIME 8/1/2011 9:23:00 PM
END_TIME 8/1/2011 9:51:00 PM
START_TIME 7/31/2011 9:22:00 PM
END_TIME 7/31/2011 9:41:00 PM
this isn't an orginal answer - just an alternate version of Doug's using reg ex to capture all the data:
$html -split "`n" | where {$_ -match 'headers="(.*)" class.*>(.*)'} |
% {
@{
$matches[1] = ($matches[2]).trim()
}
}
EDIT: using the code in the question:
$Yesterday = [DateTime]::Now.AddDays(-1)
$datestr = $Yesterday.ToString("dd-MMM-yyyy")
$WebClient = New-Object System.Net.WebClient
$Results = $WebClient.DownloadString("http://fakeurl")
[StringSplitOptions]$option = "None"
[string[]]$separator = "</td>"
$SPL = $Results.Split($separator, $option)
$SPL |
Foreach-Object {
where {$_ -match 'headers="(.*)" class.*>(.*)'} |
% {
@{
$matches[1] = ($matches[2]).trim()
}
}
}
EDIT 2:
$SPL |
Foreach-Object {
where {$_ -match 'headers="(.*)" class.*>(.*)'} |
% {
if (($matches[2]).trim() -eq $datestr ) { "$($matches[1]) is yesterday's back up" }
}
}
精彩评论