开发者

How to search a HTML page for an item in a given list

I have a list of schools

schools = ['Harvard Law School', 'Stanford Law School', 'Yale Law School', 'Columbia Law School', 'NYU School of Law', 'University of Chicago Law School']

and bios of lawyers that contain one of these schools:

html = "page that contains one of these schools" 

like this

"<strong><em开发者_如何学Python>Education</em></strong><br />JD, Columbia Law School, Harlan Fiske Stone Scholar, Parker School Recognition of Achievement in International and Foreign Law, 2005<br />BM, BM, University of Michigan - Ann Arbor, <EM>summa cum laude</EM>, 1997<br />"

I've been extracting the school info with regex. But I thought it would be better to have a lookup list of schools and search each page for the matching school. I'm new to Python so I was searching about how to do this and I found difflib.SequenceMatcher.

I've been playing with it, and it's fun but I don't think it is the right tool for what I want to do. Can anyone direct me to the right way of doing this?

Thanks!


This is a very basic screen scraping way to achieve what you want

import urllib
html = urllib.urlopen(pageToLawyersBio)

htmlstr=''
for line in html.readlines():
    htmlstr += line.lower()

for school in listOfSchools:
    if school.lower() in htmlstr:
        print "This lawyer went to", school


I don't know nothing about Python but I often create dynamic regex expressions into a string like:

"(school 1|school 2|school 3|school n)"

Then I instantiate a regex object, passing the string.

You can then match your schools, regardless of the form of the document unless a HTML tag is in the middle of a school name.

Mike

EDIT - example (sorry c#): "(" + String.Join("|", arrayOfSchools) + ")"


I hate to rain on your parade, but building a lookup list of law schools and then doing a set membership type of test in the source code probably will not work. The flawed approach:

schools = []
html = page.read()
for school in list:
    if school in html:
        schools.append(school)

The reason why is this: you're assuming law school names are represented uniformly on lawyer websites, but that assumption isn't reliable. For example, I went to a law school called University of California, Hastings College of the Law. Sometimes it appears on lawyer websites as Hastings College of Law, and others it appears as UC Hastings. Often the data about where a lawyer went to school is collected directly from the lawyer, so it will appear verbatim as he or she supplied it. You probably can't assume the data was later normalized.

As a result, any school names that deviate from your lookup list won't be found. To further complicate matters, the shortest version of my school's name--UC Hastings--might even confound a difflib 'get close matches' lookup unless you set the match ratio very low, which inevitably causes the routine to find a number of other false positives as well.

Here's my advice. Spider a list of all law school names and put it in a database table. Create a second table with known deviations from the list. Each time you spider a site, try a basic set membership test in the lookup list (or dynamically generated regex). In the probable event that such a lookup fails, make the script throw an error and print the unmatched school to a console. Add that school the table of known variants and key it to the correct school name in the main lookup table. Repeat this process until you feel confident you have most variants accounted for. From there, add a hack to check unfound school names against a list of the official lookup items and all known variants using

difflib.get_close_matches

Use this kind of method to return the closest valid match any time a school isn't found. It may be the best your clients can ask for. I use django for this kind of thing because the built-in database admin makes it easy to add in known variants.


I would need to know which school matches.

Now I am extracting the school info with regex (I am still testing):

    item = re.search('(JD)(.*?)(\d+)', html)
    if item:
        JD = item.group()
        f = open('test1.txt', 'a')
        f.write(JD)
    else:
        NoJD = ("empty cvs schema goes here")
        f = open('test1.txt', 'a')
        f.write(NoJD) 

This picks up the relevant part from the html:

JD, Columbia Law School, Harlan Fiske Stone Scholar, Parker School Recognition of Achievement in International and Foreign Law, 2005

I still need to parse this to format properly so that I can write it into an items.csv file:

first,initial,last,title,firm,school,year

So, I thought that if i looked for matches for schools, I could pick up the school names (and year graduated) by looking up a school list. But if this is too complicated I'll go ahead with the regex.

Thanks.


You should consider using beautifulSoup to parse the HTML. In regards to your question, you might want to try something like:

for line in html.split("<br \>"):
    # This gives a lot of crap, filter it with
    for values in line.split(", "):
         try: 
             if values[0] in schools:
                  #This line contains a school, write it out.
         except:
             # Ignore badly formatted lines
             pass


inspectorG4dgt: This is great! thanks. I think this better than using regex. Because in some pages "JD" comes before school name, in others, after the school name. Same with the graduation dates.

I have been trying to get the line where the school names occurs but I couldn't do it. Something like this:

htmlstr = ''
for line in html.readlines():
    htmlstr += line.lower()

for school in listOfSchools:
    if school.lower() in htmlstr:
        [ schoolLine = line with the school and date ]

To learn more about this stuff I've been studying this tutorial.

For instance, I tried to use readline() to loop on each line but that didn't work.

Or it may be better to search for listOfSchools and a list of years = [1956, ... 2008]. Since schools and dates are on the same lines. Any suggestions how I can do this? Thanks.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜