Algorithm problem: select two stories per topic so that the same story is never selected for two different topics

2023-03-13 18:05 问答作者：

In my workplace I have stumbled upon the following problem that I am asked to solve. A solution is preferred although not absolutely required.

There is a database with a set of stories, and each story has a set of topics associated with it. The topics are stored in a separate table of the form (storyid, topicid).

The question is how to select ideally 5 topics (or at least 2, if 5 is impossible) such that each topic has 2 stories (or 1, if 2 is impossible) that are not repeated in any of the other selected topics. The algorithm must also return which exactly are the "proper" stories associated with each topic.

Is this actually an NP-complete problem that has no efficient solution that goes beyond simple enumeration of all possibilities or does it have an efficient solution?

I开发者_如何学编程f it does not have an efficient solution, please try to prove it (although not absolutely necessary).

If it does have an efficient solution, please be kind and post it.

A more general version of the problem is to select for all topics (or at least as many as possible) two stories so that the same story is never selected for two different topics.

Mark the stories with S₁...S_m, and the topics with T₁...T_n. Duplicate each topic, that is, introduce the new stories T'₁...T'_n, where T'_i contains S_j if and only if T_i contains it. The problem can now be rephrased this way: select a different story for all topics (or as many as possible). Once you have your topic-story pairs, you can join the duplicated topics again, and each topic will have two stories.

The problem of finding the largest number of pairs possible while never selecting any element twice is called the maximum bipartite matching problem. (You can think of it as selecting the maximum number of non-connected edges from a bipartite graph.) There is a simple algorithm called augmenting path which solves it in O((m+n)e) steps (e being the number of edges) and some more sophisticated ones (such as the Hopcroft–Karp algorithm) which solve it in around O((m+n)^2.5) steps. The augmenting path algorithm consists of searching for "alternating" paths in the graph where the first edge of the path is not selected, the second is, the third isn't and so on, than inverting the selections on the path. This can be probably adapted to your case without actually doing the splitting and joining of topics.

This is a bit of an overkill because it will return two stories for all the topics, not only five; you can probably do a lot better when you only need to find stories for a limited number of topics. (Except for some edge cases, you can just select the five topics which have the largest number of stories, discard the stories not contained by them, and run the algorithm then.) At any rate it shows that the problem is far from being NP-hard.

Suppose we have a SQL table TopicStory that relates TopicID and StoryID, with that pair of columns forming a compound primary key.

The goal is to find a certain kind of set of TopicID's. At most five in a set are required by the Question posted. A depth-first search for these is outlined below.

With the depth of search bounded at five, the stated problem cannot be worse than polynomial complexity. However a generalized problem that asks for the largest set of topics that could be found with like constraints (each topic chosen has at least two stories not related to any of the other chosen topics) is probably NP-complete.

The use of the word "search" suggests a backtracking algorithm. Below we have effected backtracking through nested loops, where each loop is defined and parameterized by the loops that are outer to it.

Before we give the gritty details, it may be motivational to describe a "brute force" approach, next to which the more complicated approach may be more easily appreciated.

BRUTE_FORCE:

Generate all possible subsets of five topics.
Test each of these sets for feasibility (each topic has
at least two stories unrelated to any of the other topics).

Our sketch of depth-first search assumes topics have a total ordering, e.g. ordered by integer values for TopicID. This allows sets of topics to be generated without repetition (due to permutation of topics).

NESTED_LOOPS:

(Loop_1) Select into List_1 all topics with at least two stories.
         Iterate through List_1, choosing the first topic %T1%.
         PASS control into Loop_2.
         CONTINUE in Loop_1.
         If the end of List_1 is reached, EXIT with failure.

(Loop_2) Select into List_2 all topics > %T1% with at least two
         stories unrelated to %T1%.
         Iterate through List_2, choosing the second topic %T2%.
         If topic %T1% still has at least two stories unrelated
         to %T2%, PASS control into Loop_3.
         CONTINUE in Loop_2.
         If the end of List_2 is reached, go BACK to Loop_1.

(Loop 3) Select into List_3 all topics > %T2% with at least two
         stories unrelated to %T1% or %T2%.
         Iterate through List_3, choosing the third topic %T3%.
         If topic %T1% still has at least two stories unrelated
         to %T2% or %T3%,
         and topic %T2% still has at least two stories unrelated
         to %T1% or %T3%, PASS control into Loop_4.
         CONTINUE in Loop_3.
         If the end of List_3 is reached, go BACK to Loop_2.

(Loop 4) Select into List_4 all topics > %T3% with at least two
         stories unrelated to %T1%, %T2%, or %T3%.
         Iterate through List_4, choosing the fourth topic %T4%.
         If topic %T1% still has at least two stories unrelated
         to %T2%, %T3%, or %T4%,
         and topic %T2% still has at least two stories unrelated
         to %T1%, %T3%, or %T4%,
         and topic %T3% still has at least two stories unrelated
         to %T1%, %T2%, or %T4%, PASS control into Loop_5.
         CONTINUE in Loop_4.
         If the end of List_4 is reached, go BACK to Loop_3.

(Loop 5) Select into List_5 all topics > %T4% with at least two
         stories unrelated to %T1%, %T2%, %T3%, or %T4%.
         Iterate through List_5, choosing the fifh topic %T5%.
         If topic %T1% still has at least two stories unrelated
         to %T2%, %T3%, %T4%, or %T5%,
         and topic %T2% still has at least two stories unrelated
         to %T1%, %T3%, %T4%, or %T5%,
         and topic %T3% still has at least two stories unrelated
         to %T1%, %T2%, %T4%, or %T5%,
         and topic %T4% still has at least two stories unrelated
         to %T1%, %T2%, %T3%, or %T5%, EXIT with success
         returning five topics %T1%, %T2%, %T3%, %T4%, and %T5%.
         CONTINUE in Loop_5.
         If the end of List_5 is reached, go BACK to Loop_4.

The use of "select" at the opening of each nested loop is meant to evoke the possibility of SQL queries to implement much of the logic. For example the outermost loop is basically just getting the result set for this query:

SELECT   TS1.TopicID, Count(*)
 From    TopicStory TS1
Group By TS1.TopicID
Having   Count(*) > 1

The corresponding lists of the inner loops can be constructed similarly by SQL queries depending on parametric values of topics chosen in the outer loops. To illustrate without unnecessary repetition let's jump right to the innermost loop and give an appropriate query for List_5:

SELECT   TS5.TopicID, Count(*)
 From    TopicStory TS5
 Where   TS5.TopicID > %T4%
  and    NOT EXISTS ( SELECT *
                       From    TopicStory TSX
                       Where   TSX.TopicID in (%T1%,%T2%,%T3%,%T4%)
                        and    TSX.StoryID = TS5.StoryID
                    )
Group By TS5.TopicID
Having   Count(*) > 1

This would be followed by checking that %T5% in List_5 produces a count of at least two stories left for topic %T1%:

SELECT Count(*)
 From  TopicStory TZ1
 Where TZ1.TopicID = %T1%
  and  NOT EXISTS ( SELECT *
                     From    TopicStory TX1
                     Where   TX1.StoryID = TZ1.StoryID
                      and    TX1.TopicID in (%T2%,%T3%,%T4%,TS5.TopicID)
                  )

and mutatis mutandi for the other prior topic choices.

Although it might slow performance unacceptably, the additional logic for restricting topics related to %T5% (so that earlier topic choices still retain at least two story choices) could be pushed into one query. It would look like this:

/*
   Given %T1%, %T2%, %T3$, and %T4% from queries above, find all topics %T5% > %T4%
   with at least 2 stories not related to %T1%, %T2%, %T3%, or %T4% and such that
   %T1% still has at least 2 stories not related to %T2%, %T3%, %T4%, or %T5% and
   %T2% still has at least 2 stories not related to %T1%, %T3%, %T4%, or %T5% and
   %T3% still has at least 2 stories not related to %T1%, %T2%, %T4%, or %T5% and
   %T4% still has at least 2 stories not related to %T1%, %T2%, %T3%, or %T5%
*/

SELECT   TS5.TopicID, Count(*)
 From    TopicStory TS5
 Where   TS5.TopicID > %T4%
  and    NOT EXISTS ( SELECT *
                       From    TopicStory TSX
                       Where   TSX.TopicID in (%T1%,%T2%,%T3%,%T4%)
                        and    TSX.StoryID = TS5.StoryID
                    )
  and    ( SELECT Count(*)
            From  TopicStory TZ1
            Where TZ1.TopicID = %T1%
             and  NOT EXISTS ( SELECT *
                                From    TopicStory TX1
                                Where   TX1.StoryID = TZ1.StoryID
                                 and    TX1.TopicID in (%T2%,%T3%,%T4%,TS5.TopicID)
                             )
         ) > 1
  and    ( SELECT Count(*)
            From  TopicStory TZ2
            Where TZ2.TopicID = %T2%
             and  NOT EXISTS ( SELECT *
                                From    TopicStory TX2
                                Where   TX2.StoryID = TZ2.StoryID
                                 and    TX2.TopicID in (%T1%,%T3%,%T4%,TS5.TopicID)
                             )
         ) > 1
  and    ( SELECT Count(*)
            From  TopicStory TZ3
            Where TZ3.TopicID = %T3%
             and  NOT EXISTS ( SELECT *
                                From    TopicStory TX3
                                Where   TX3.StoryID = TZ3.StoryID
                                 and    TX3.TopicID in (%T1%,%T2%,%T4%,TS5.TopicID)
                             )
         ) > 1
  and    ( SELECT Count(*)
            From  TopicStory TZ4
            Where TZ4.TopicID = %T4%
             and  NOT EXISTS ( SELECT *
                                From    TopicStory TX3
                                Where   TX3.StoryID = TZ3.StoryID
                                 and    TX3.TopicID in (%T1%,%T2%,%T3%,TS5.TopicID)
                             )
         ) > 1
Group By TS5.TopicID
Having   Count(*) > 1

The feature set of MySQL is a work in progress, so conceivably an efficient implementation in stored procedures is possible, where cursors can take the role of topic lists. However I'd be confident of good performance if the "cursors" are externally managed lists (e.g. in PHP) and database queries are kept as simple as possible.

Try adapting this to suit your needs:

SELECT topic, story 
FROM story_topic 
WHERE story IN (SELECT story FROM story_topic GROUP BY story HAVING COUNT(*) = 1);

The key here is to know what stories only occur in only one topic. You might want to pre-compute the number of topics to eradicate the subselect.

How about this? (If I understood your question)

(I haven't actually run it - just a thought - so... could be errors, or I could have blatantly missed something. But - for the moment, my tired head thinks it would work :)

$num_topics = 5;
$stories_per = 5;
$stories = array();  //array to store story ids

//select 5 topics
$query = mysql_query("SELECT * FROM topics ORDER BY RAND() LIMIT ".$num_topics);

//run repeat as many times as you want stories
for($i=0; $i<$stories_per; $i++) {

    //repeat through each selected topic
    while($row = mysql_fetch_array($query)) {

        $q_addon = "";
        foreach($stories as $value) {
            $q_addon .= "id <> '".$value."' AND ";
        }

        //find a story not yet chosen for each topic
        $q = mysql_query("SELECT storyid FROM stories_topics WHERE ".$q_addon." topicid='".$row['id']."' LIMIT 1");

        //add that id to your $stories array
        $tmp_id = mysql_result($q,0,'storyid');
        array_push($stories, $tmp_id);

    }
}

if you wona choose similar 5 topics similar to different stories as I understood: so you can make a join for the 2 tables and use top 5 in your query with where condition topic title="topic you want "

if that doesn't help kindly make it clear for me >>>

继续阅读：algorithm combinations php selection

Algorithm problem: select two stories per topic so that the same story is never selected for two different topics

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？