开发者

Storing a List into Python Sqlite3

I am trying to scrape form field IDs using Beautiful Soup like this

 for link in BeautifulSoup(content, parseOnlyThese=SoupStrainer('input')):
    if link.has_key('id'):
        print link['id']

Lets us assume that it returns something like

username
email
password
passwordagain
terms
button_register

I would like to write this into Sqlite3 DB.

What I will be doing down the line in my application is... Use these form fields' IDs and try to do a POST may be. The problem is.. there are plenty of sites like this whose form field IDs I have scraped. So the relation is like this...

Doma开发者_StackOverflow社区in1 - First list of Form Fields for this Domain1
Domain2 - Second list of Form Fields for this Domain2
.. and so on

What I am unsure here is... How should I design my column for this kind of purpose? Will it be OK if I just create a table with two columns - say

COL 1 - Domain URL (as TEXT)
COL 2 - List of Form Field IDs (as TEXT)

One thing to be remembered is... Down the line in my application I will need to do something like this...

Pseudocode

If Domain is "http://somedomain.com":
    For ever item in the COL2 (which is a list of form field ids):
         Assign some set of values to each of the form fields & then make a POST request

Can any one guide, please?

EDITed on 22/07/2011 - Is My Below Database Design Correct?

I have decided to have a solution like this. What do you guys think?

I will be having three tables like below

Table 1

Key Column (Auto Generated Integer) - Primary Key
Domain as TEXT

Sample Data would be something like:

1   http://url1.com
2   http://url2.com
3   http://url3.com

Table 2

Domain (Here I will be using the Key Number from Table 1)
RegLink - This will have the registeration link (as TEXT)
Form Fields (as Text)

Sample Data would be something like:

1   http://url1.com/register    field1
1   http://url1.com/register    field2
1   http://url1.com/register    field3
2   http://url2.com/register    field1
2   http://url2.com/register    field2
2   http://url2.com/register    field3
3   http://url3.com/register    field1
3   http://url3.com/register    field2
3   http://url3.com/register    field3

Table 3

Domain (Here I will be using the Key Number from Table 1)
Status (as TEXT)
User (as TEXT)
Pass (as TEXT)

Sample Data would be something like:

1   Pass    user1   pass1
2   Fail    user2   pass2
3   Pass    user3   pass3

Do you think this table design is good? Or are there any improvements that can be made?


There is a normalization problem in your table.

Using 2 tables with

TABLE domains
int id primary key
text name

TABLE field_ids
int id primary key
int domain_id foreign key ref domains
text value

is a better solution.


Proper database design would suggest you have a table of URLs, and a table of fields, each referenced to a URL record. But depending on what you want to do with them, you could pack lists into a single column. See the docs for how to go about that.

Is sqlite a requirement? It might not be the best way to store the data. E.g. if you need random-access lookups by URL, the shelve module might be a better bet. If you just need to record them and iterate over the sites, it might be simpler to store as CSV.


Try this to get the ids:

ids = (link['id'] for link in
        BeautifulSoup(content, parseOnlyThese=SoupStrainer('input')) 
         if link.has_key('id'))

And this should show you how to save them, load them, and do something to each. This uses a single table and just inserts one row for each field for each domain. It's the simplest solution, and perfectly adequate for a relatively small number of rows of data.

from itertools import izip, repeat
import sqlite3

conn = sqlite3.connect(':memory:')
c = conn.cursor()
c.execute('''create table domains
(domain text, linkid text)''')

domain_to_insert = 'domain_name'
ids = ['id1', 'id2']
c.executemany("""insert into domains
      values (?, ?)""", izip(repeat(domain_to_insert), ids))
conn.commit()

domain_to_select = 'domain_name'
c.execute("""select * from domains where domain=?""", (domain_to_select,))

# this is just an example
def some_function_of_row(row):
    return row[1] + ' value'

fields = dict((row[1], some_function_of_row(row)) for row in c)
print fields
c.close()
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜