Must apply database normalization when have a lot more insertions than queries?

2023-03-20 16:53 问答作者：

I did a webcrawler and it inserts various pages and links in the database. At the moment, the domain of the URL crawled is a attribute in the page and i开发者_开发百科n the links table.

I'm thinking about creating a table for the domains, but I fear that this slow the insertion.

At the moment, I have 1,200,000 links downloaded and 70,000 pages in database and this will increase.

What is the better solution to do? Create the domain table? Create a index in the domain attribute(it's a varchar)?

PS: A other program that I developing will do queries in this database.

If I understood correctly you have two tables: "links" and "pages". You say nothing about the fields within those tables. More information would be nice.

Anyhow, a fully normalized database tends to erode the performance. I would suggest keeping the domains as attribute in both tables. A little redundancy might improve your performance.

One more advice, instead of having one database, you might want to have two: one for inserts and updates only; and the other one for read-only access(selects).

In the first DB remove all indexes and constrains. This will give you fast insert/update operations.

In the read-only DB, design indexes properly to make the retrieval operations faster.

Of course, you need to synchronize the two databases somehow. This might require some extra coding.

You will probably have to do some playing around to see what kind of results you get from the different methods. How many different domains do you have?

Keep in mind that if you create an index on the domain attribute it will actually slow down your inserts. Indexes are good for improving select performance but they slow down update/delete/insert operations because it is an extra thing that needs to get updated.

I would personally go the domains in a separate table if there are a relatively small number.

Assuming your database design is like so:

Page: 
Id | URL

Link:
Id | Page_Id | URL

If there is a lot of re-use of URLs (like for TVTropes), I would most likely reformat the design to:

Domain:
Id | URL

Page:
Id | URL_Id

Link:
Id | Page_Id | URL_Id

When you go to do your datamining, I'd then recommend an index on URL, in addition to all the usual ones.

If space is becoming an issue (more than insert or retrieval times), and there are numerous levels to your URLs (deep folder structures), you could try this -

Domain:
Id | Parent_Id | URL_Part

Page:
Id | URL_Id

Link:
Id | Page_Id | URL_Id

This will of course need a recursive query to assemble the URL, but the datamining prospects for this are immense. Without knowing more about your actual design (and your intended use), there's not a lot more I can really propse, though.

I don't see why you wouldn't normalize.
Certainly, this will affect, slightly, the performance of the insertions, but I would hope that the bottleneck (and or the throttling) would be at the level of the page dowloads. If it were not the case, this would indicate that you're whacking the h' out of the Internet! ;-)
Typical crawlers [outside of these used by big SEs of course], even when run on multiple threads and even on several machines only produce, in total and sustained fashion, a few dozen pages per second, which is well below the capability of most DBMS servers, even with a bit of contention.

Also one would expect the domains table to be relatively small and accessed frequently, mostly reading, and hence generally cached.

I would only consider denormalization and other tricks in the case of

much higher sustained insertion rate
bigger database (say, if it is expected to grow in above, 100 million rows).

继续阅读：database normalization postgresql

Must apply database normalization when have a lot more insertions than queries?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？