开发者

string categorization strategies

I'm the one-man dev team on a fledgling military history website. One aspect of the site is a catalog of ~1,200 individual battles, including the nations & formations (regiments, divisions, etc) which took part.

The formation information (as well as the other battle info) was manually imported from a series of books by a 10-man volunteer team. The formations were listed in groups with varying formatting and abbreviation patterns. At the time I set up the data collection forms I couldn't think of a good way to process that data... and elected to store it all as strings in the MySQL database and sort it out later.

Well, "later" - as it tends to happen - has arrived. :-)

Each battle has 2+ records in the database - one for each nation that participated. Each record has a formations text string listing the formations present as the volunteer chose to add them.

Some real examples:

  • 39th Grenadier Rgmt, 26th Volksgrenadier Division
  • 2nd Luftwaffe Field Division, 246th Infantry Division
  • 247th Rifle Division, 255th Tank Brigade
  • 2nd Luftwaffe Field Division, SS Cavalry Division
  • 28th Tank Brigade, 158th Rifle Division, 135th Rifle Division, 81st Tank Brigade, 242nd Tank Brigade
  • 78th Infantry Division
  • 3rd Kure Special Naval Landing Force, Tulagi Seaplane Base personnel
  • 1st Battalion 505th Infantry Regiment

The ultimate goal is for each individual force to have an ID, so that its participation can be traced throughout the battle database. Formation hierarchy, such as the final item above 1st Battalion (of the) 505th Infantry Regiment a开发者_StackOverflowlso needs to be preserved. In that case, 1st Battalion and 505th Infantry Regiment would be split, but 1st Battalion would be flagged as belonging to the 505th.

In database terms, I think I want to pull the formation field out of the current battle info table and create three new tables:

FORMATION
[id] [name]

FORMATION_HIERARCHY
[id] [parent] [child]

FORMATION_BATTLE
[f_id] [battle_id]

It's simple to explain, but complicated to enact.

What I'm looking for from the SO community is just some tips on how best to tackle this problem. Ideally there's some sort of method to solving this that I'm not aware of. However, as a last resort, I could always code a classification framework and call my volunteers back to sort through 2,500+ records...


You've tagged your question as PHP related - but it's not.

You are proposing substituting the real identifiers with surrogate keys (ids) however the real identifiers are intrinsically unique - so you're just making your data structure more complicated than it needs to be. Having said that, the leaf part of the hierarchy may only be unique within the scope of the parent node.

The most important question you need to address is whether the formation tree is always going to be two levels. I suspect that sometimes it may be one and sometimes it may be more than 2. The structure you propose is not going to work very well with variable depth trees.

This may help:

http://articles.sitepoint.com/article/hierarchical-data-database

C.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜