开发者

Must apply database normalization when have a lot more insertions than queries?

I did a webcrawler and it inserts various pages and links in the database. At the moment, the domain of the URL crawled is a attribute in the page and i开发者_开发百科n the links table.

I'm thinking about creating a table for the domains, but I fear that this slow the insertion.

At the moment, I have 1,200,000 links downloaded and 70,000 pages in database and this will increase.

What is the better solution to do? Create the domain table? Create a index in the domain attribute(it's a varchar)?

PS: A other program that I developing will do queries in this database.


If I understood correctly you have two tables: "links" and "pages". You say nothing about the fields within those tables. More information would be nice.

Anyhow, a fully normalized database tends to erode the performance. I would suggest keeping the domains as attribute in both tables. A little redundancy might improve your performance.

One more advice, instead of having one database, you might want to have two: one for inserts and updates only; and the other one for read-only access(selects).

In the first DB remove all indexes and constrains. This will give you fast insert/update operations.

In the read-only DB, design indexes properly to make the retrieval operations faster.

Of course, you need to synchronize the two databases somehow. This might require some extra coding.


You will probably have to do some playing around to see what kind of results you get from the different methods. How many different domains do you have?

Keep in mind that if you create an index on the domain attribute it will actually slow down your inserts. Indexes are good for improving select performance but they slow down update/delete/insert operations because it is an extra thing that needs to get updated.

I would personally go the domains in a separate table if there are a relatively small number.


Assuming your database design is like so:

Page: 
Id | URL

Link:
Id | Page_Id | URL

If there is a lot of re-use of URLs (like for TVTropes), I would most likely reformat the design to:

Domain:
Id | URL

Page:
Id | URL_Id

Link:
Id | Page_Id | URL_Id

When you go to do your datamining, I'd then recommend an index on URL, in addition to all the usual ones.

If space is becoming an issue (more than insert or retrieval times), and there are numerous levels to your URLs (deep folder structures), you could try this -

Domain:
Id | Parent_Id | URL_Part

Page:
Id | URL_Id

Link:
Id | Page_Id | URL_Id

This will of course need a recursive query to assemble the URL, but the datamining prospects for this are immense. Without knowing more about your actual design (and your intended use), there's not a lot more I can really propse, though.


I don't see why you wouldn't normalize.
Certainly, this will affect, slightly, the performance of the insertions, but I would hope that the bottleneck (and or the throttling) would be at the level of the page dowloads. If it were not the case, this would indicate that you're whacking the h' out of the Internet! ;-)
Typical crawlers [outside of these used by big SEs of course], even when run on multiple threads and even on several machines only produce, in total and sustained fashion, a few dozen pages per second, which is well below the capability of most DBMS servers, even with a bit of contention.

Also one would expect the domains table to be relatively small and accessed frequently, mostly reading, and hence generally cached.

I would only consider denormalization and other tricks in the case of

  • much higher sustained insertion rate
  • bigger database (say, if it is expected to grow in above, 100 million rows).
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜