How to put a file path or a URL into a database?
The naive way would be to put the whole path into the DB as a string and it wo开发者_如何学编程uld work for toy DBs. However, this approach have a couple of flaws. For example, say I have 100K files under /var/www/sites/ then storing /var/www/sites 100K times in the DB is very inefficient. I am sure there is a much better way to do this.
I would like to index only the file paths on a DVD and later search for mp3 files or for directories, etc. The preferred RDBMS is SQLite (perhaps FTS Tables?). My goal is to learn, I know there are a bunch of desktop search engines for this.
The naive way would be to put the whole path into the DB as a string and it would work for toy DBs. However this approach yields a non-normalized DB.
Who told you that ? That is the most ridiculous thing I have heard in a long time. Get rid of them as soon as you can, and do not pay them for such absurd "advice".
Short Answer
That's as absurd as saying, if you store phone numbers or addresses in the database in their raw form, it is naïve and not normalised.
Put your URLs in a single column in the database (high end or low end). It does not break Normalisation rules. (Assuming of course that the database is normalised in other respects.)
Long Answer
Let's look at two counterpoints.
Some people do not understand that Normalisation is a Principle. Sure, in the application of that principle in databases, we have Normal Forms, and you either comply with or break Normal Forms. But that isn't the whole principle. You could just as easily have a shocking database because it is not Normalised, even though it can be in 3NF.
Let's say you have a Customer table that has a set of columns that make up "address". And a Supplier table that also has the same (hopefully exactly the same) columns that make up "address". As long as the Functional Dependencies have been resolved, that is correct, there is nothing is the Normal Forms that will identify that it does not meet 3NF or 5NF. Such a database will be fine. But a good designer (as opposed to a qualified but inexperienced one) will Normalise the "address" columns into a separate Address table, and place an FK to it in the Customer and Supplier tables. That designer gives you a more Normalised database, that is even easier to maintain, but it is still in the same 3NF or 5NF as before.
For the newbie Normalisator, they need to Normalise everything. They forget the purpose of the database, and Normalise to a degree that is beyond its purpose. By the same reasoning of the person who told you that, the "address" columns and contents of those columns are "not Normalised". As long as you have Washington St, Washington Blvd, Washington Lane, holy moley, "that is naïve and the database is not normalised". Absolute nonsense.
For the purpose of most databases, storing the street name and the street type in a single column is quite adequate. And if you had a good designer, sure they would implement a separate Address table. The multiple occurrences of "Washington" in the street names cannot be said to be "duplicates". But if you were the city council or the electricity utility, you would have a different purpose, in which case that would not be good enough, and yes, there you would Normalise the "address" column group to the nth degree, such the "Washington" or "Street" never repeats as a data value. And for that you need a very experienced designer. Only true for a small minority with a different purpose.
Therefore, if your database has the purpose of carefully analysing the entire content of the URLS, and reconstructing a tree or explorer style view, then by all means, build a directory structure in tables, which allows storage of each component of the URL, and a hierarchy, and never duplicate any component. But if your purpose is simply to store URLS like most people store addresses or phone numbers, then just store raw URLs like addresses or phone numbers. You can perform quite reasonable searches and match on component parts of the raw URL, to find MP3 files or whatever.
There is no "best" without a measurement of what the criteria is. There is no one-size-fits-all. The electricity utility database is "too complex" (too Normalised) for most purposes; the usual database is "inadequate" for the electricity utility. If you identify the purpose, the types of searches you require, that identifies the criteria against which "best" or "better" or "fails" can be measured.
Response to Comments
Your Edit has changed the landscape. While the usual level of Normalisation may be the adequate level for most people (it is therefore not "naïve"), you need something more, you are closer to the electricity utility, you need a Normalised Directory structure to store URLS or full Paths, and you need to remove duplication from the data values. Eg. /var, /www, /sites
, etc stored once.
Normalised Directory
No problem. That too, has been done many times. I have posted the exact requirement in another answer.
Rest assured that that exact structure runs in two large Enterprise class servers, and that generic structure runs in virtually every SQL database I have written over more than 25 years. It may look complex, but once you get your head around it, it is simple and flexible. Allows full recursion, etc.
You can ask questions in the comments here.
However this approach yields a non-normalized DB.
So what? 3rd NF isn't holy. Some forms of denormalization result in easier to understand datamodels. As long as the duplication doesn't cause problems in terms of database size or CPU-load while transforming/parsing the unnormalized values, I wouldn't worry about it.
This problem has a name: storing hierarchical data and has well-estabilished solutions
Managing Hierarchical Data in MySQL
One more Nested Intervals vs. Adjacency List comparison
How to store directory / hierarchy / tree structure in the database?
Optimized SQL for tree structures
I think a generally sensible approach is to remove any commonality from the stored paths. For example, if you only cared about items within /media/ then there's obviously not point in storing the /media/ portion of the path, as you can trivially add this when you need to re-constitute the full path.
精彩评论