开发者

relational VS parametrized Data modeling when building semantic web applications?

Here is the summary of my question then i'll describe it in more details :

I read about using the parametrized data modeling method instead of using the standard relational data modeling when building semantic web application,i think we'll lose 90% of normalization if we used this method,If I want to design the database of my semantic web application should i use this way? what is the practical value ?


In More Details :

I've read a lot of articles around this, in this book "Programming the semantic web - Toby Segaran, Colin Evans, and Jamie Taylor" at page 14 they tell us to use parametrized Data modeling to get Semantic Relationships instead of the standard relational database described by this example:

in the standard Relational Database :

Venue : [ ID(PK), Name, Address ]

Restaurant : [ ID(PK), VenueID(FK), CuisineID]

Bar : [ ID(PK), VenueID(FK), DJ?, Specialty ]

Hours : [ VenueID(FK), Day, Open, Close ]

For Semantic Relationships : One table only !!! Fully parameterized venues

Properties : [ VenueID,Field, Value ] Example:

VenueID _ Field____Value

1__Cuisine__Deli

1__Price__ $

1__Name__Deli Llama

1__Address__Peachtree Rd

2__Cuisine__Chinese

2__Price__ $$$

2__Specialty Cocktail __ Scorpion Bowl

2__DJ?__No

2__Name__ Peking Inn

2__Address Lake St

3__Live Music? __ Yes

3__Music Genre__ Jazz

3__Name__ Thai Tanic

3__Address__Branch Dr

Then the authors Says :

Now each datum is described alongside the property that defines 开发者_开发问答it. In doing this, we’ve taken the semantic relationships that previously were inferred from the table and column and made them data in the table. This is the essence of semantic data modeling: flexible schemas where the relationships are described by the data itself.

If I want to design the database of my semantic web application should i use this way? what is the practical value ?


What you lose in immediate clarity, you gain in flexibly. Notice with your more parametrized approach you gain the ability to easily add fields without altering any tables. This allows you give different fields to different venues as it suites your application. By association, this also makes it easy to extend your web application via your creation or future maintainer/modification authors (if you intend to release) down the road.

Just be careful when it comes to performance. Don't adopt a fully parametrized design when it is easier to a standard relational design. Let's say, for a moment, you have a two different users tables, one relational the other parametrized:

Table: users_relational 
+---------+----------+------------------+----------+
| user_id | username | email            | password | 
+---------+----------+------------------+----------+
|       1 | Sam      | sam@example.com  | ******** |
|       2 | John     | john@example.com | ******** |
|       3 | Jane     | jane@example.com | ******** |
+---------+----------+------------------+----------+

Table: users_parametrized
+---------+----------+------------------+
| user_id | field    | value            |
+---------+----------+------------------+
|       1 | username | Sam              |
|       1 | email    | sam@example.com  |
|       1 | password | ********         |
|       2 | username | John             |
|       2 | email    | john@example.com |
|       2 | password | ********         |
|       3 | username | Jane             |
|       3 | email    | jane@example.com |
|       3 | password | ********         |
+---------+----------+------------------+

Now you want to select a single user. With your relational table, you will only select one row, while your parametrized version will select the number of rows that there are fields associated with that user, in this case 3.

The next issue is searchability (at times). Say you have that same users table from the example above, but instead of knowing the user ID, you only know the username. You may be using two queries, one to find the user id and the other to get the data associated with the user.

Your last con stems from selecting only a few rows at a time. Taking the users tables example again, we can limit the number of fields easily with the relational one:

SELECT username, email FROM users_relational WHERE user_id = 2

We should get a single result with two columns. Now, for the parametrized table:

SELECT field, value FROM users_parametrized WHERE user_id = 2 AND field IN('username','email')

It's a little more verbose and will become less readable than the first one, especially if you start taking on more and more fields to select.

Additionally, the parametrized will be slower for a few reasons. It now has to do text comparisons from the varchar in the field column, instead of a single, numerically indexed user_id. With the first query, it knows when to stop looking for the record because you're selecting by a primary key. In the parametrized, you are not selecting by a primary key, so you will take a performance hit because your database must look through all the records.

This leads me into the final real difference (as far as your DBMS sees it). There is no primary key in the parametrized, which (as you saw above) can be a performance issue, especially if you already have a considerable number of records. For something like a users table where you can have thousands of records, your record count would be that number times 3 (as we have three non-user_id fields) in this case alone. That's a lot of data for the database to search through.

There are quite a few things to consider when designing your application. Don't be afraid to mix your database with parametrized and relational style - it just has to make sense practically. In the case you gave, it makes perfect sense to do so; in the case I displayed, it would be pointless.


It is possible to stay fully relational while pursuing the intent of storing data in a parameterized fashion. The following is a greatly oversimplified demonstration, but should suffice to show the main tricks that are needed -- in a nutshell, additional levels of abstraction, some surrogate primary keys, and some tables with composite primary keys. I will leave out exact description of foreign key constraints assuming the reader can grasp the obvious relations between tables below.

Your first table is only to establish the entities you want to store information about, and a key to look up what sorts of information will be stored:

entity_id | entity_type
---------------------------
        1 | lawn mower
        2 | toothbrush
        3 | bicycle
        4 | restaurant
        5 | person


The next table relates entity type to the fields you wish to store for each entity type:

entity_type | attribute
------------------------
lawn mower  | horsepower
lawn mower  | retail price
lawn mower  | gas_or_electric
lawn mower  | ...etc
toothbrush  | bristle stiffness
toothbrush  | weight
toothbrush  | head size
toothbrush  | retail price
toothbrush  | ...etc
person      | name
person      | email
person      | birth date
person      | ...etc


This is expandable to as many fields as you like for each entity type. It's still relational; this table does have a primary key, it's just a composite key composed of both columns.

This example is oversimplified for brevity; in actual practice you have to confront the namespacing issues with attributes and you probably want certain attribute names to be per-entity-type in case the same name means something different on an entirely different kind of entity. Use a surrogate primary key for the attributes in order to solve the namespacing issue, if you don't mind the decrease in readability when looking directly at the tables.

Meanwhile, and opposite of the preceding point, it's useful to make common and unambiguous attributes (such as "weight in grams" or "retail price in USD" available for reuse across multiple entity types. To handle this, add a level of abstraction between attributes and entity types. Make a table of "attribute sets", with each set linked to 1..n attributes. Then each entity type in the table above would be linked not directly to attributes, but to one or more attribute sets.

You'll need to either guarantee that attribute sets do not overlap in what attributes they point to, or create a means of resolving conflicts by hierarchy, composition, set union, or whatever fits your needs.

So at this point a lookup for a particular entity goes as follows. From the entity id we get the entity type. From entity type we get 1..n attribute sets, which yield a resulting attribute set that is held by the entity. Finally there is the big table with the actual data in it as follows:

entity_id | attribute_id | value
---------------------------------------
      923 |      1049272 | green
      923 |      1049273 | 206.55
      924 |      1049274 | 843-219-2862
      924 |      1049275 | Smith
      929 |      1049276 | soft
      929 |      1049277 | ...etc


As with all of these tables, this one has a primary key, in this case composed of the entity_id and attribute_id columns. The values are stored in a plain-text column without units. The units are stored in a separate table linking attributes to units. More tables can be established if you need to get more specific on that; you can set up additional levels of abstraction to establish an "attribute type" system similar to the entity type system described above.

If needed, you can go as far as storing relationships such as "attribute X is numerically convertible to attribute Y by the following formula", for numerical attributes. Or for non-numerical attributes you can establish equivalence tables to manage alternate spellings or formats for the allowed values of an attribute.

As you can imagine, the farther you go with your "attribute types and units" system, and the more you use that additional machinery in computation, the slower this all will be. In the worst case you're looking at many joins. But that problem can be addressed with caching and views, if your situation allows you to make tradeoffs such as slowing write speed to gain a great increase in read speed. Also, many of your queries to the database will be in situations where you already know what entity type you're working with at the moment and what its resulting attributes are and their types; so you only have to grab the literal values out of the entity/attribute/value table, and that is plenty fast.

In conclusion, hopefully I have shown how you can get as parameterized as you wish while remaining fully relational. It just requires more tables for more levels of abstraction than some of the simpler approaches do; yet it avoids the disadvantages of the "one-big-table" style. This style of entity>type>attribute>value storage is powerful, flexible, and can be extended as far as you need.

And thanks to a relational/normalized table setup, you can do all sorts of reorganizing along the way as your entity schema evolves, without losing data. The additional levels of abstraction allow you to re-parent attributes from one attribute set to another, change their names if needed, and change which sets of attributes an entity type makes use of, without losing stored values, as long as you write appropriate migrations. The other day I realized I needed to store a certain product attribute on a per-brand basis instead of per-product, and was able to make the schema change in five minutes with only a couple of updated rows in the database. In many other setups, particularly in a one-big-table setup, it could have been a lot more work, requiring as much as one or more updated rows per entity affected by the change.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜