开发者

Python Crawler - need help with my algorithm

** Added a summary of the problem at the end of the post **

I've written a crawler that fetches and parses URLs.

in the first version, in order to get the next valid page I was increasing the URL ID and saved non-valid IDs to a file, the valid URLs were moved to a parser that parsed the content I need, after a while I noticed that most valid IDs has a returning subtrahend.

I made some statistics and got to that list of subtrahends - [8,18,7,17,6,16,5,15], ordered by the most repeating to the least.

so I changed my code to -

def checkNextID(ID):
    numOfRuns = 0
    global curRes, lastResult
    while ID < lastResult:
        try:
            numOfRuns += 1
            if numOfRuns % 10 == 0:
                time.sleep(7) # sleep every 10 iterations
                numOfRuns = 0
            if isValid(ID + 8): 
                parseHTML(curRes, ID)
                ID = ID + 8
            elif isValid(ID + 18):
                parseHTML(cu开发者_高级运维rRes, ID)
                ID = ID + 18
            elif isValid(ID + 7):
                parseHTML(curRes, ID)
                ID = ID + 7
            elif isValid(ID + 17):
                parseHTML(curRes, ID)
                ID = ID + 17
            elif isValid(ID+6):
                parseHTML(curRes, ID)
                ID = ID + 6
            elif isValid(ID + 16):
                parseHTML(curRes, ID)
                ID = ID + 16
            elif isValid(ID + 5):
                parseHTML(curRes, ID)
                ID = ID + 5
            elif isValid(ID + 15):
                parseHTML(curRes, ID)
                ID = ID + 15
            else:
                if isValid(ID + 1):
                    parseHTML(curRes, ID)
                ID = ID + 1
        except Exception, e:
            print "something went wrong: " + str(e) 

isValid() is a function that gets an ID + one of the subtrahends and returns True if the url contains what I need and saves a soup object of the url to a global variable named - 'curRes', it returns False if the url doesn't contain the data I need and saves the id to 'badIdFile'.

parseHTML is a function that gets the soup object (curRes), parses the data I need and then saves the data to a csv, then returns True.

in a perfect world that piece of code would be everything I need in order to run on all valid IDs (there are about 400K in a range of 5M), it gave me better results in less time (x50 faster).

BUT, when getting to ranges that doesn't contain any valid URL, my code is very inefficient, it turns that I'm crawling mostly the same URLs over and over in each iteration, this is because I'm increasing the ID by one in order to keep progressing until I'll find the next valid URL and then I'm checking the ID + 8, then 18, 17 etc' which sometimes gives me the same URLs I checked in the previous iteration.

so I went and changed the code so it will keep a a set of non-valid URLs which I'll avoid checking again and I can't get it to work, I'm breaking my head for a few hours now, it's not working like it should.

this is my new function -

def checkNextID(ID):
    runs = 0
    badTries = set()
    diff = [8,18,7,17,6,16,5,15,1]
    global curRes, lastResult
    while ID < lastResult:
        if len(badTries) == 100: #every 100 non-valid IDs I can reset badTries(I think), if I've increased the ID 100 times it means I won't be checking previous IDs and I can start saving the non-valid IDs again.
            badTries = set()
        try:
            for i in diff:
                tryID = ID + i #tryID will be the ID + one of the subtrahends.
                if tryID not in badTries: #if we haven't already tried that ID
                    if runs % 10 == 0:
                        time.sleep(6)  #sleep every 10 requests
                    if isValid(tryID):
                        runs += 1
                        parseHTML(curRes, ID)
                        ID = tryID
                        badTries = set() #I can reset the badTries now, I'm moving forward.
                        break #will move to the next id.
                    else:
                        badTries.add(ID) #save the non-valid ID to the set
                        if i == 1: #it's the last iteration of the for loop, if the subtrahend is 1 and it's not valid I must increase the ID by 1 in order to go forward.
                            ID += 1
                else:
                    ID += 1 #if the ID is not a valid ID and I've already checked it I must increase the ID by one or I'll get into an infinite loop.
        except Exception, e:
            print "something went wrong: " + str(e) + ' ID - ' + str(ID)

I'm saving each non-valid ID to a set, and before every call to isValid() I'm checking if I've already tried that ID, if I didn't, I'm calling isValid(), else, ID is increased by one.

that's how the bad ID file looks like -

513025328
513025317
513025327
513025316
513025326
513025312
513025320
513025330
513025319
513025329
513025318
513025328
513025317
513025327
513025313
513025321
513025331
513025320
513025330
513025319
513025329
513025318
513025328
513025314
513025322
513025332
513025321
513025331
513025320
513025330
513025319
513025329
513025315
513025323
513025333
513025322
513025332
513025321
513025331
513025320
513025330
513025316
513025324
513025334
513025323
513025333
513025322
513025332
513025321
513025331
513025317
513025325
513025335
513025324
513025334
513025323
513025333
513025322
513025332
513025318
513025326
513025336
513025325
513025335
513025324
513025334
513025323
513025333
513025319
513025327
513025337
513025326
513025336
513025325
513025335
513025324
513025334
513025320
513025328
513025338
513025327
513025337
513025326
513025336
513025325
513025335
513025321
513025329
513025339
513025328
513025338
513025327
513025337
513025326
513025336
513025322
513025330
513025340
513025329
513025339
513025328
513025338
513025327
513025337
513025323
513025331
513025341
513025330
513025340
513025329
513025339
513025328
513025338
513025324
513025332
513025342
513025331
513025341
513025330
513025340
513025329
513025339
513025325
513025333
513025343
513025332
513025342
513025331
513025341
513025330
513025340
513025326
513025334
513025344
513025333
513025343
513025332
513025342
513025331
513025341
513025327
513025335
513025345
513025334
513025344
513025333
513025343
513025332
513025342
513025328
513025336
513025346
513025335
513025345
513025334
513025344
513025333
513025343
513025329
513025337
513025347
513025336
513025346
513025335
513025345
513025334
513025344
513025330
513025338
513025348
513025337
513025347
513025336
513025346
513025335
513025345
513025331
513025339
513025349
513025338
513025348
513025337
513025347
513025336
513025346
513025332
513025340
513025350
513025339
513025349
513025338
513025348
513025337
513025347
513025333
513025341
513025351
513025340
513025350
513025339
513025349
513025338
513025348
513025334
513025342
513025352
513025341
513025351
513025340
513025350
513025339
513025349
513025335
513025343
513025353
513025342
513025352
513025341
513025351
513025340
513025350
513025336
513025344
513025354
513025343
513025353
513025342
513025352
513025341
513025351
513025337
513025345
513025355
513025344
513025354
513025343
513025353
513025342
513025352
513025338
513025346
513025356
513025345
513025355
513025344
513025354
513025343
513025353
513025339
513025347
513025357
513025346
513025356
513025345
513025355
513025344
513025354
513025340
513025348
513025358
513025347
513025357
513025346
513025356
513025345
513025355
513025341
513025349
513025359
513025348
513025358
513025347
513025357
513025346
513025356
513025342
513025350
513025360
513025349
513025359
513025348
513025358
513025347
513025357
513025343
513025351
513025361
513025350
513025360
513025349
513025359
513025348
513025358
513025344
513025352
513025362
513025351
513025361
513025350
513025360
513025349
513025359
513025345
513025353
513025363
513025352
513025362
513025351
513025361
513025350
513025360
513025346
513025354
513025364
513025353
513025363
513025352
513025362
513025351
513025361
513025347
513025355
513025365
513025354
513025364
513025353
513025363
513025352
513025362
513025348
513025356
513025366
513025355
513025365
513025354
513025364
513025353
513025363
513025349
513025357
513025367
513025356
513025366
513025355
513025365
513025354
513025364
513025350
513025358
513025368
513025357
513025367
513025356
513025366
513025355
513025365
513025351
513025359
513025369
513025358
513025368
513025357
513025367
513025356
513025366
513025352
513025360
513025370
513025359
513025369
513025358
513025368
513025357
513025367
513025353
513025361
513025371
513025360
513025370
513025359
513025369
513025358
513025368
513025354
513025362
513025372
513025361
513025371
513025360
513025370
513025359
513025369
513025355
513025363
513025373
513025362
513025372
513025361
513025371
513025360
513025370
513025356
513025364
513025374
513025363
513025373
513025362
513025372
513025361
513025371
513025357
513025365
513025375
513025364
513025374
513025363
513025373
513025362
513025372
513025358
513025366
513025376
513025365
513025375
513025364
513025374
513025363
513025373
513025359
513025367
513025377
513025366
513025376
513025365
513025375
513025364
513025374
513025360
513025368
513025378
513025367
513025377
513025366
513025376
513025365
513025375
513025361
513025369
513025379
513025368
513025378
513025367
513025377
513025366
513025376
513025362
513025370
513025380
513025369
513025379
513025368
513025378
513025367
513025377
513025363
513025371
513025381
513025370
513025380
513025369
513025379
513025368
513025378
513025364
513025372
513025382
513025371
513025381
513025370
513025380
513025369
513025379
513025365
513025373
513025383
513025372
513025382
513025371
513025381
513025370
513025380
513025366
513025374
513025384
513025373
513025383
513025372
513025382
513025371
513025381
513025367
513025375
513025385
513025374
513025384
513025373
513025383
513025372
513025382
513025368
513025376
513025386
513025375
513025385
513025374
513025384
513025373
513025383
513025369
513025377
513025387
513025376
513025386
513025375
513025385
513025374
513025384
513025370
513025378
513025388
513025377
513025387
513025376
513025386
513025375
513025385
513025371
513025379
513025389
513025378
513025388
513025377
513025387
513025376
513025386
513025372
513025380
513025390
513025379
513025389
513025378
513025388
513025377
513025387
513025373
513025381
513025391
513025380
513025390
513025379

as you can see, it's not working, I know I have a flaw in the whole design but I can't find it, I would really appreciate your help.

a summary of the problem -

I have a diff list [8,18,7,17,6,16,5,15] the program starts with getting an id, each time I need to check the next id which is - id + diff[i] (i=0) if (id + diff[i]) isn't a valid id I'm checking the next id which is (id + diff[i+1]).

if there isn't a valid id on that iteration (id + diff[i..n]) I'm increasing id by 1, and check if id+1 is valid id, if not I'm checking again with id + diff[i..n], until I'll find the next valid id.

in each iteration I'm checking ids I've already checked in the previous iteration (which costs a lot of time and is inefficient), I need to avoid checking the IDs I've already checked and keep increasing the ID until I'll find the next valid ID.

as for now, if the id = 1 (and it's a valid id) and diff = [8,18,7,17,6,16,5,15].

first iteration will look like (I'm marking with bold the id's which I could avoid checking) - first - id = 1

9, 19, 8, 18, 7, 17, 6, 16, 2

second - id = 2

10, 20, 9, 19, 8, 18, 7, 17, 3

third - id = 3

11, 21, 10, 20, 9, 19, 8, 18, 4

fourth - id = 4

12, 22 - BINGO, next valid ID is 22!

that cost me 29 requests, instead of - 17, and that's a small example, I have ranges that are 300-600 ids from the last valid id.

I can't get my code to avoid checking previously checked ids with a smart and efficient way.

Thanks!


I think I've got it.

First, the code based on your idea:

import time

lastResult = 100

def checkNextID(ID, lastResult = lastResult, diff = [8,18,7,17,6,16,5,15]):
    runs = 0
    SEEN = set()

    while True:
        if ID>lastResult:
            print ('\n=========================='
                   '\nID==%s'
                   '\n   ID>lastResult is %s : program STOPS')\
                  % (ID,ID>lastResult,)
            break
        runs += 1
        if runs % 10 == 0:  time.sleep(0.5)
        if ID in SEEN:
            print '-----------------\nID=='+str(ID)+'  already seen, not examined'
            ID += 1
        else:
            curRes = isValid(ID)
            if curRes:
                print '--------------------------\nID=='+str(ID)+'  vaaaalid'
                while True:
                    for i in diff:
                        runs += 1
                        if runs % 10 == 0:  time.sleep(0.5)
                        curRes = isValid(ID+i)
                        SEEN.add(ID+i)
                        if curRes:
                            print '   i==%2s   ID+i==%s   valid' % (i,ID+i)
                            ID += i
                            print '--------------------------\nID==%s' % str(ID)
                            break
                        else:
                            print '   i==%2s   ID+i==%s   not valid' % (i,ID+i)
                    else:
                        ID += 1
                        break
            else:
                print '--------------------------\nID==%s  not valid' % ID
                ID += 1


def isValid(ID, valid_ones = (1,9,17,25,50,52,60,83,97,98)):
    return ID in valid_ones


checkNextID(0)

result

--------------------------
ID==0  not valid
--------------------------
ID==1  vaaaalid
   i== 8   ID+i==9   valid
--------------------------
ID==9
   i== 8   ID+i==17   valid
--------------------------
ID==17
   i== 8   ID+i==25   valid
--------------------------
ID==25
   i== 8   ID+i==33   not valid
   i==18   ID+i==43   not valid
   i== 7   ID+i==32   not valid
   i==17   ID+i==42   not valid
   i== 6   ID+i==31   not valid
   i==16   ID+i==41   not valid
   i== 5   ID+i==30   not valid
   i==15   ID+i==40   not valid
--------------------------
ID==26  not valid
--------------------------
ID==27  not valid
--------------------------
ID==28  not valid
--------------------------
ID==29  not valid
-----------------
ID==30  already seen, not examined
-----------------
ID==31  already seen, not examined
-----------------
ID==32  already seen, not examined
-----------------
ID==33  already seen, not examined
--------------------------
ID==34  not valid
--------------------------
ID==35  not valid
--------------------------
ID==36  not valid
--------------------------
ID==37  not valid
--------------------------
ID==38  not valid
--------------------------
ID==39  not valid
-----------------
ID==40  already seen, not examined
-----------------
ID==41  already seen, not examined
-----------------
ID==42  already seen, not examined
-----------------
ID==43  already seen, not examined
--------------------------
ID==44  not valid
--------------------------
ID==45  not valid
--------------------------
ID==46  not valid
--------------------------
ID==47  not valid
--------------------------
ID==48  not valid
--------------------------
ID==49  not valid
--------------------------
ID==50  vaaaalid
   i== 8   ID+i==58   not valid
   i==18   ID+i==68   not valid
   i== 7   ID+i==57   not valid
   i==17   ID+i==67   not valid
   i== 6   ID+i==56   not valid
   i==16   ID+i==66   not valid
   i== 5   ID+i==55   not valid
   i==15   ID+i==65   not valid
--------------------------
ID==51  not valid
--------------------------
ID==52  vaaaalid
   i== 8   ID+i==60   valid
--------------------------
ID==60
   i== 8   ID+i==68   not valid
   i==18   ID+i==78   not valid
   i== 7   ID+i==67   not valid
   i==17   ID+i==77   not valid
   i== 6   ID+i==66   not valid
   i==16   ID+i==76   not valid
   i== 5   ID+i==65   not valid
   i==15   ID+i==75   not valid
--------------------------
ID==61  not valid
--------------------------
ID==62  not valid
--------------------------
ID==63  not valid
--------------------------
ID==64  not valid
-----------------
ID==65  already seen, not examined
-----------------
ID==66  already seen, not examined
-----------------
ID==67  already seen, not examined
-----------------
ID==68  already seen, not examined
--------------------------
ID==69  not valid
--------------------------
ID==70  not valid
--------------------------
ID==71  not valid
--------------------------
ID==72  not valid
--------------------------
ID==73  not valid
--------------------------
ID==74  not valid
-----------------
ID==75  already seen, not examined
-----------------
ID==76  already seen, not examined
-----------------
ID==77  already seen, not examined
-----------------
ID==78  already seen, not examined
--------------------------
ID==79  not valid
--------------------------
ID==80  not valid
--------------------------
ID==81  not valid
--------------------------
ID==82  not valid
--------------------------
ID==83  vaaaalid
   i== 8   ID+i==91   not valid
   i==18   ID+i==101   not valid
   i== 7   ID+i==90   not valid
   i==17   ID+i==100   not valid
   i== 6   ID+i==89   not valid
   i==16   ID+i==99   not valid
   i== 5   ID+i==88   not valid
   i==15   ID+i==98   valid
--------------------------
ID==98
   i== 8   ID+i==106   not valid
   i==18   ID+i==116   not valid
   i== 7   ID+i==105   not valid
   i==17   ID+i==115   not valid
   i== 6   ID+i==104   not valid
   i==16   ID+i==114   not valid
   i== 5   ID+i==103   not valid
   i==15   ID+i==113   not valid
-----------------
ID==99  already seen, not examined
-----------------
ID==100  already seen, not examined

==========================
ID==101
   ID>lastResult is True : program STOPS

.

And here's the code based on my idea:

import time

lastResult = 100

def checkNextID(ID, lastResult = lastResult, diff = [8,18,7,17,6,16,5,15]):
    runs = 0
    maxdiff = max(diff)
    others = [x for x in xrange(1,maxdiff) if x not in diff]
    lastothers = others[-1]
    SEEN = set()

    while True:
        if ID>lastResult:
            print ('\n=========================='
                   '\nID==%s'
                   '\n   ID>lastResult is %s : program STOPS')\
                  % (ID,ID>lastResult,)
            break
        runs += 1
        if runs % 10 == 0:  time.sleep(0.5)
        if ID in SEEN:
            print '-----------------\nID=='+str(ID)+'  already seen, not examined'
            ID += 1
        else:
            curRes = isValid(ID)
            if curRes:
                print '------------------------------------\nID=='+str(ID)+'  vaaaalid'
                while True:
                    for i in diff:
                        runs += 1
                        if runs % 10 == 0:  time.sleep(0.5)
                        curRes = isValid(ID+i)
                        SEEN.add(ID+i)
                        if curRes:
                            print '   i==%2s   ID+i==%s   valid' % (i,ID+i)
                            ID += i
                            print '--------------------------\nID==%s' % str(ID)
                            break
                        else:
                            print '   i==%2s   ID+i==%s   not valid' % (i,ID+i)
                    else:
                        for j in others:
                            if ID+j>lastResult:
                                print '\n   j==%2s   %s+%s==%s>lastResult==%s is %s' \
                                      % (j,ID,j,ID+j,lastResult,ID+j>lastResult)
                                ID += j
                                print '\n--------------------------\nnow ID==',ID
                                break
                            runs += 1
                            if runs % 10 == 0:  time.sleep(0.5)
                            curRes = isValid(ID+j)
                            SEEN.add(ID+j)
                            if curRes:
                                print '   j==%2s   ID+j==%s   valid' % (j,ID+j)
                                ID += j
                                print '--------------------------\nID=='+str(ID)
                                break
                            else:
                                print '   j==%2s   ID+j==%s   not valid' % (j,ID+j)

                        if j==lastothers:
                            ID += maxdiff + 1
                            print '   ID += %s + 1 ==> ID==%s' % (maxdiff,ID)
                            break
                        elif ID>lastResult:
                            print '   ID>lastResult==%s>%s is %s ==> WILL STOP' % (ID,lastResult,ID>lastResult)
                            break

            else:
                print '-------------------------\nID=='+str(ID)+'  not valid'
                ID += 1




def isValid(ID, valid_ones = (1,9,17,25,50,52,60,83,97,98)):
    return ID in valid_ones


checkNextID(0)

result

-------------------------
ID==0  not valid
------------------------------------
ID==1  vaaaalid
   i== 8   ID+i==9   valid
--------------------------
ID==9
   i== 8   ID+i==17   valid
--------------------------
ID==17
   i== 8   ID+i==25   valid
--------------------------
ID==25
   i== 8   ID+i==33   not valid
   i==18   ID+i==43   not valid
   i== 7   ID+i==32   not valid
   i==17   ID+i==42   not valid
   i== 6   ID+i==31   not valid
   i==16   ID+i==41   not valid
   i== 5   ID+i==30   not valid
   i==15   ID+i==40   not valid
   j== 1   ID+j==26   not valid
   j== 2   ID+j==27   not valid
   j== 3   ID+j==28   not valid
   j== 4   ID+j==29   not valid
   j== 9   ID+j==34   not valid
   j==10   ID+j==35   not valid
   j==11   ID+j==36   not valid
   j==12   ID+j==37   not valid
   j==13   ID+j==38   not valid
   j==14   ID+j==39   not valid
   ID += 18 + 1 ==> ID==44
-------------------------
ID==44  not valid
-------------------------
ID==45  not valid
-------------------------
ID==46  not valid
-------------------------
ID==47  not valid
-------------------------
ID==48  not valid
-------------------------
ID==49  not valid
------------------------------------
ID==50  vaaaalid
   i== 8   ID+i==58   not valid
   i==18   ID+i==68   not valid
   i== 7   ID+i==57   not valid
   i==17   ID+i==67   not valid
   i== 6   ID+i==56   not valid
   i==16   ID+i==66   not valid
   i== 5   ID+i==55   not valid
   i==15   ID+i==65   not valid
   j== 1   ID+j==51   not valid
   j== 2   ID+j==52   valid
--------------------------
ID==52
   i== 8   ID+i==60   valid
--------------------------
ID==60
   i== 8   ID+i==68   not valid
   i==18   ID+i==78   not valid
   i== 7   ID+i==67   not valid
   i==17   ID+i==77   not valid
   i== 6   ID+i==66   not valid
   i==16   ID+i==76   not valid
   i== 5   ID+i==65   not valid
   i==15   ID+i==75   not valid
   j== 1   ID+j==61   not valid
   j== 2   ID+j==62   not valid
   j== 3   ID+j==63   not valid
   j== 4   ID+j==64   not valid
   j== 9   ID+j==69   not valid
   j==10   ID+j==70   not valid
   j==11   ID+j==71   not valid
   j==12   ID+j==72   not valid
   j==13   ID+j==73   not valid
   j==14   ID+j==74   not valid
   ID += 18 + 1 ==> ID==79
-------------------------
ID==79  not valid
-------------------------
ID==80  not valid
-------------------------
ID==81  not valid
-------------------------
ID==82  not valid
------------------------------------
ID==83  vaaaalid
   i== 8   ID+i==91   not valid
   i==18   ID+i==101   not valid
   i== 7   ID+i==90   not valid
   i==17   ID+i==100   not valid
   i== 6   ID+i==89   not valid
   i==16   ID+i==99   not valid
   i== 5   ID+i==88   not valid
   i==15   ID+i==98   valid
--------------------------
ID==98
   i== 8   ID+i==106   not valid
   i==18   ID+i==116   not valid
   i== 7   ID+i==105   not valid
   i==17   ID+i==115   not valid
   i== 6   ID+i==104   not valid
   i==16   ID+i==114   not valid
   i== 5   ID+i==103   not valid
   i==15   ID+i==113   not valid
   j== 1   ID+j==99   not valid
   j== 2   ID+j==100   not valid

   j== 3   98+3==101>lastResult==100 is True

--------------------------
now ID== 101
   ID>lastResult==101>100 is True ==> WILL STOP

==========================
ID==101
   ID>lastResult is True : program STOPS

There is

    if ID in SEEN:
        print '-----------------\nID=='+str(ID)+'  already seen, not examined'
        ID += 1

in this code, but the message 'already seen' is never printed during the execution; however, the detection of valids ID has the same result; that means that the use of a set SEEN isn't necessary in my code.

.

Edit 1

The code #1 with instruction to regularly empty SEEN

import time

lastResult = 100

def checkNextID(ID, lastResult = lastResult, diff = [8,18,7,17,6,16,5,15]):
    runs = 0
    SEEN = set()
    while True:
        if ID>lastResult:
            print ('\n=========================='
                   '\nID==%s'
                   '\n   ID>lastResult is %s : program STOPS')\
                  % (ID,ID>lastResult,)
            break
        runs += 1
        if runs % 10 == 0:  time.sleep(0.5)
        if ID in SEEN:
            print '-----------------\n%s\nID==%s  already seen, not examined' % (SEEN,ID)
            ID += 1
        else:
            curRes = isValid(ID)
            if curRes:
                print '--------------------------\n%s\nID==%s  vaaaalid'  % (SEEN,ID)
                while True:
                    for i in diff:
                        runs += 1
                        if runs % 10 == 0:  time.sleep(0.5)
                        curRes = isValid(ID+i)
                        print '   '+str(SEEN)
                        if i==diff[0]:
                            SEEN = set([ID+i])
                        else:
                            SEEN.add(ID+i)
                        if curRes:
                            print '   i==%2s   ID+i==%s   valid' % (i,ID+i)
                            ID += i
                            print '--------------------------\nID==%s' % str(ID)
                            break
                        else:
                            print '   i==%2s   ID+i==%s   not valid' % (i,ID+i)
                    else:
                        ID += 1
                        break
            else:
                print '--------------------------\n%s\nID==%s  not vaaaaalid' % (SEEN,ID)
                ID += 1


def isValid(ID, valid_ones = (1,9,17,25,30,50,52,60,83,97,98)):
    return ID in valid_ones


checkNextID(0)

result

--------------------------
set([])
ID==0  not vaaaaalid
--------------------------
set([])
ID==1  vaaaalid
   set([])
   i== 8   ID+i==9   valid
--------------------------
ID==9
   set([9])
   i== 8   ID+i==17   valid
--------------------------
ID==17
   set([17])
   i== 8   ID+i==25   valid
--------------------------
ID==25
   set([25])
   i== 8   ID+i==33   not valid
   set([33])
   i==18   ID+i==43   not valid
   set([33, 43])
   i== 7   ID+i==32   not valid
   set([32, 33, 43])
   i==17   ID+i==42   not valid
   set([32, 33, 42, 43])
   i== 6   ID+i==31   not valid
   set([32, 33, 42, 43, 31])
   i==16   ID+i==41   not valid
   set([32, 33, 41, 42, 43, 31])
   i== 5   ID+i==30   valid
--------------------------
ID==30
   set([32, 33, 41, 42, 43, 30, 31])
   i== 8   ID+i==38   not valid
   set([38])
   i==18   ID+i==48   not valid
   set([48, 38])
   i== 7   ID+i==37   not valid
   set([48, 37, 38])
   i==17   ID+i==47   not valid
   set([48, 37, 38, 47])
   i== 6   ID+i==36   not valid
   set([48, 36, 37, 38, 47])
   i==16   ID+i==46   not valid
   set([36, 37, 38, 46, 47, 48])
   i== 5   ID+i==35   not valid
   set([35, 36, 37, 38, 46, 47, 48])
   i==15   ID+i==45   not valid
--------------------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==31  not vaaaaalid
--------------------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==32  not vaaaaalid
--------------------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==33  not vaaaaalid
--------------------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==34  not vaaaaalid
-----------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==35  already seen, not examined
-----------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==36  already seen, not examined
-----------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==37  already seen, not examined
-----------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==38  already seen, not examined
--------------------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==39  not vaaaaalid
--------------------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==40  not vaaaaalid
--------------------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==41  not vaaaaalid
--------------------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==42  not vaaaaalid
--------------------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==43  not vaaaaalid
--------------------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==44  not vaaaaalid
-----------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==45  already seen, not examined
-----------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==46  already seen, not examined
-----------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==47  already seen, not examined
-----------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==48  already seen, not examined
--------------------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==49  not vaaaaalid
--------------------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==50  vaaaalid
   set([35, 36, 37, 38, 45, 46, 47, 48])
   i== 8   ID+i==58   not valid
   set([58])
   i==18   ID+i==68   not valid
   set([58, 68])
   i== 7   ID+i==57   not valid
   set([57, 58, 68])
   i==17   ID+i==67   not valid
   set([57, 58, 67, 68])
   i== 6   ID+i==56   not valid
   set([56, 57, 58, 67, 68])
   i==16   ID+i==66   not valid
   set([66, 67, 68, 56, 57, 58])
   i== 5   ID+i==55   not valid
   set([66, 67, 68, 55, 56, 57, 58])
   i==15   ID+i==65   not valid
--------------------------
set([65, 66, 67, 68, 55, 56, 57, 58])
ID==51  not vaaaaalid
--------------------------
set([65, 66, 67, 68, 55, 56, 57, 58])
ID==52  vaaaalid
   set([65, 66, 67, 68, 55, 56, 57, 58])
   i== 8   ID+i==60   valid
--------------------------
ID==60
   set([60])
   i== 8   ID+i==68   not valid
   set([68])
   i==18   ID+i==78   not valid
   set([68, 78])
   i== 7   ID+i==67   not valid
   set([67, 68, 78])
   i==17   ID+i==77   not valid
   set([67, 68, 77, 78])
   i== 6   ID+i==66   not valid
   set([66, 67, 68, 77, 78])
   i==16   ID+i==76   not valid
   set([66, 67, 68, 76, 77, 78])
   i== 5   ID+i==65   not valid
   set([65, 66, 67, 68, 76, 77, 78])
   i==15   ID+i==75   not valid
--------------------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==61  not vaaaaalid
--------------------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==62  not vaaaaalid
--------------------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==63  not vaaaaalid
--------------------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==64  not vaaaaalid
-----------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==65  already seen, not examined
-----------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==66  already seen, not examined
-----------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==67  already seen, not examined
-----------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==68  already seen, not examined
--------------------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==69  not vaaaaalid
--------------------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==70  not vaaaaalid
--------------------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==71  not vaaaaalid
--------------------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==72  not vaaaaalid
--------------------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==73  not vaaaaalid
--------------------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==74  not vaaaaalid
-----------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==75  already seen, not examined
-----------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==76  already seen, not examined
-----------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==77  already seen, not examined
-----------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==78  already seen, not examined
--------------------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==79  not vaaaaalid
--------------------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==80  not vaaaaalid
--------------------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==81  not vaaaaalid
--------------------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==82  not vaaaaalid
--------------------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==83  vaaaalid
   set([65, 66, 67, 68, 75, 76, 77, 78])
   i== 8   ID+i==91   not valid
   set([91])
   i==18   ID+i==101   not valid
   set([91, 101])
   i== 7   ID+i==90   not valid
   set([90, 91, 101])
   i==17   ID+i==100   not valid
   set([90, 91, 100, 101])
   i== 6   ID+i==89   not valid
   set([89, 90, 91, 100, 101])
   i==16   ID+i==99   not valid
   set([99, 100, 101, 89, 90, 91])
   i== 5   ID+i==88   not valid
   set([99, 100, 101, 88, 89, 90, 91])
   i==15   ID+i==98   valid
--------------------------
ID==98
   set([98, 99, 100, 101, 88, 89, 90, 91])
   i== 8   ID+i==106   not valid
   set([106])
   i==18   ID+i==116   not valid
   set([106, 116])
   i== 7   ID+i==105   not valid
   set([105, 106, 116])
   i==17   ID+i==115   not valid
   set([105, 106, 115, 116])
   i== 6   ID+i==104   not valid
   set([104, 105, 106, 115, 116])
   i==16   ID+i==114   not valid
   set([104, 105, 106, 114, 115, 116])
   i== 5   ID+i==103   not valid
   set([103, 104, 105, 106, 114, 115, 116])
   i==15   ID+i==113   not valid
--------------------------
set([103, 104, 105, 106, 113, 114, 115, 116])
ID==99  not vaaaaalid
--------------------------
set([103, 104, 105, 106, 113, 114, 115, 116])
ID==100  not vaaaaalid

==========================
ID==101
   ID>lastResult is True : program STOPS

This above code shows the process concerning the recording and emptying the values of ID already seen. It is a code as good as it is conceivable, since the algorithm comprises the periodical emptying of SEEN, because it 's true that emptying is necessary, given the number of IDs to te be tested.

But from the beginning, my opinion was that this process concernin SEEN in this algorithm weighs on the performance, because recording and testing instructions concerning a SEEN are repeatedly executed at every step of the program.

That's why I imagined there should be another algorithm having not this drawback. I finally succeeded to write such an alternative code, and now we have two codes having two different algorithms.

Concerning your question "are you sure it's not necessary to use the SEEN logic in the second?"
I answer 'yes, I think I can be sure'. The intent of runing my code #2 with instructions managing SEEN was to make me sure after verifying what was a mental idea and a conceptual algorithm. If you want to become sure, you must do the same: - study conceptually and precisely the algorithm - do as many essays of execution of the two codes and comparison of their results as you need to be experimentally convinced, varying the values of lastResult, valid_ones and diff For me , this point is closed as long as there won't be a contradictory practical case proving that my conclusion is wrong.

I continue in the other answer, because the number of characters is limited in this one


Stating an identifier as being global is done when one wants to provoke a modification concerning it when it is outside the function from which the modification is performed.

Hence it is an aberration to make lastResult and curRes global:

  • the first, lastResult, because it is a constant in your complete code. The best is to define a parameter lastResult of the function checkNextID() with the value of lastResult as default argument.

  • the second, curRes, because there is no modification concerning this identifier in checkNextID()

Now, defining curRes as global in the function isValid() is also a bad practice, though not an aberration: 1) a new value for curRes is sent from the inside of isValid() to the outside of it; 2) then, the program goes outside the function checkNextID() to search for the value of curRes. That's a weird and useless detour, you could let curRes be a free variable (see doc ) in the function checkNextID() and this one will automatically go outside to resolve this name and to obtain its value.

.

Personally , I prefer to reorganise the general algorithm. In my following code, curRes is defined as a local object, taking directly its value from the return of the function isValid() . That requires to redefine isValid(): in my code isValid() returns the object soup or False

I hope I understood your need. Say me what's wrong in my approach, please.

def checkNextID(ID, lastResult = lastResult, diff = [0,1,5,6,7,8,15,16,17,18]):
    runs = 0
    maxdiff = max(diff)
    diff.extend(x for x in xrange(maxdiff) if x not in diff)
    while True:
        for i in diff:
            if ID+i==lastResult:  break
            runs += 1
            if runs % 10 == 0:  time.sleep(6)
            curRes = isValid(ID+i):
            if cuRes:
                parseHTML(curRes, ID+i)
                ID = ID + i
                break
        else:
            runs += 1
            ID += maxdiff + 1
            if ID==lastResult:  break






def isValid(ID, urlhead = urlPath):
    # this function return either False OR a BeautifulSoup instance
    try:
        page = getPAGE(urlhead + str(ID))
        if page == False:  return False
    except Exception, e:
        print "An error occured in the first Exception block of parseHTML : " + str(e) +' address: ' + address
    else:
        try:
            soup = BeautifulSoup(page)
        except TypeError, e:
            print "An error occured in the second Exception block of parseHTML : " + str(e) +' address: ' + address
            return False
        try:
            companyID = soup.find('span',id='lblCompanyNumber').string
            if (companyID == None): #if lblCompanyNumber is None we can assume that we don't have the content we want, save in the bad log file
                saveToCsv(ID, isEmpty = True)
                return False
            else:
                return soup #we have the data we need, save the soup obj to a global variable
        except Exception, e:
            print "Error while parsing this page, third exception block: " + str(e) + ' id: ' + address
            return False

.

Also, to speed up your program:

  • you should use regex tool (module re) instead of BeautifulSoup that is roughly 10 times slower than the use of a regex

  • you shouldn't define and use all these functions in checkNextID (saveToCSV, parseHTML, isValid) : each call to a function takes an additional amount of time comparatively to a direct code

.

Final Edit

To conclude this long study of your problem, I did a benchmark. Here after follow the codes and the results that show that my intuition is borne out: my code #2 takes at least 20 % less time to run than your code #1 . Your code #1:

from time import clock

lastResult = 200

def checkNextID(ID, lastResult = lastResult, diff = [8,18,7,17,6,16,5,15]):
    SEEN = set()
    li = []

    while True:
        if ID>lastResult:
            break
        if ID in SEEN:
            ID += 1
        else:
            curRes = isValid(ID)
            if curRes:
                li.append(ID)
                while True:
                    for i in diff:
                        curRes = isValid(ID+i)
                        if i==diff[0]:
                            SEEN = set([ID+i])
                        else:
                            SEEN.add(ID+i)
                        if curRes:
                            li.append(ID+i)
                            ID += i
                            break
                    else:
                        ID += 1
                        break
            else:
                ID += 1

    return li


def isValid(ID, valid_ones = (1,9,17,25,30,50,52,60,83,97,98,114,129,137,154,166,175,180,184,200)):
    return ID in valid_ones

te = clock()
for i in xrange(10000):
    checkNextID(0)
print clock()-te,'seconds'
print checkNextID(0)

My code #2

from time import clock

lastResult = 200

def checkNextID(ID, lastResult = lastResult, diff = [8,18,7,17,6,16,5,15]):
    maxdiff = max(diff)
    others = [x for x in xrange(1,maxdiff) if x not in diff]
    lastothers = others[-1]

    li = []

    while True:
        if ID>lastResult:
            break
        else:
            curRes = isValid(ID)
            if curRes:
                li.append(ID)
                while True:
                    for i in diff:
                        curRes = isValid(ID+i)
                        if curRes:
                            li.append(ID+i)
                            ID += i
                            break
                    else:
                        for j in others:
                            if ID+j>lastResult:
                                ID += j
                                break
                            curRes = isValid(ID+j)
                            if curRes:
                                li.append(ID+j)
                                ID += j
                                break

                        if j==lastothers:
                            ID += maxdiff + 1
                            break
                        elif ID>lastResult:
                            break

            else:
                ID += 1

    return li




def isValid(ID, valid_ones = (1,9,17,25,30,50,52,60,83,97,98,114,129,137,154,166,175,180,184,200)):
    return ID in valid_ones

te = clock()
for i in xrange(10000):
    checkNextID(0)
print clock()-te,'seconds'
print checkNextID(0)

Results:

your code
0.398804596674 seconds
[1, 9, 17, 25, 30, 50, 52, 60, 83, 98, 114, 129, 137, 154, 166, 184, 200]

my code
0.268061164198 seconds
[1, 9, 17, 25, 30, 50, 52, 60, 83, 98, 114, 129, 137, 154, 166, 184, 200]

0.268061164198 / 0.398804596674 = 67.3 %

I've tried also with lastResult = 100 , I got 72 % .
And with lastResult = 480, I got 80 %.


First of all, you should break up the job into two processes. One to determine valid ids, and the other to retrieve data.

The one that determines valid ids only needs to use http HEAD commands and can work faster than the one that retrieves pages.

For checking pages, after you check the id increments in diff, then add 18 to the id that caused you to start using diff. You can even record the ranges that were only partially checked by using diff, and come back later, at the end of the process, and check all of them as well.

If you can't skip any ids, then keep a cache of the last n ids that were checked where n is equal to len(diff). Use a ring buffer something like this:

nextelem = 0
...
# check before retrieving
if not id in ringbuff:
    #retrieve an id
    ringbuf[nextelem] = id
    nextelem += 1
    if nextelem > len[ringbuff]:
        nextelem = 0

...

On the surface of it, a simple loop like this should check all ids:

for id in xrange(1000000):
    checkpage(id)

This would check every possible page. But you want to read ahead when you get a hit, and also backtrack partially if I understand correctly. In any case, what you are fundamentally doing is changing the range of ids from the simple sequence returned by xrange, so I think you need to write a generator and do this instead:

for id in myrange(1000000):
    checkpage(id)

You might still want to use a ringbuffer, depending on what you do within that range of 18 possible additional hits. If you need to check all of the possibilities in diff and then go back to something less than the maximum element in diff, then the ring buffer would be useful in checkpage.

But the trick is to write myrange().

def myrange(maxnum):
    global hitfound
    global nextnum
    global diff
    curnum = 0
    while curnum < maxnum:
        yield curnum
        if hitfound:
            nextnum = curnum
            hitnum = curnum
            for e in diff:
                yield hitnum + e
            curnum = nextnum - 1
        curnum += 1

The three global variables let you influence the range of ids. If you set hitfound = True inside checkpage() whenever you get a good page, then you influence myrange to start applying the increments in diff. Then, you can set nextnum to influence where it starts incrementing after you start applying the diff increments. For instance you might decide to set it to 1 greater than the first (or last) hit that you find while checking diff increments. Or you could leave it alone, and use the ring buffer to ensure that you don't request any of the diff increment pages again.

I suggest that you extract the id incrementing logic and test it separately like my code above. Tweak the generator myrange() until it produces the right sequence, and then pop it into your web scraping program.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜