开发者

fuzzy implementation for capturing specific strings

I am going to develop a web crawler using java to capture hotel room prices from hotel websites.

In this case I want to capture room price with the room type and the meal type, so my algorithm should be intelligent to handle that.

For example:

Room type: Deluxe
Meal type开发者_高级运维: HalfBoad
price : $20.00

The main problem is room prices can be in different ways in different hotel sites. So my algorithm should be independent from hotel sites.

I am plan to use above room types and meal types as a fuzzy sets and compare the words in webpage with above fuzzy sets using a suitable membership function.

Anyone experienced with this? or have an idea for my problem?


There are two ways to approach this problem:

  1. You can customize your crawler to understand the formats used by different Websites; or

  2. You can come up with a general ("fuzzy") solution.

(1) will, by far, be the easiest. Ideally you want to create some tools that make this easier so you can create a filter for any new site in minimal time. IMHO your time will be best spent with this approach.

(2) has lots of problems. Firstly it will be unreliable. You will come across formats you don't understand or (worse) get wrong. Second, it will require a substantial amount of development to get something working. This is the sort of thing you use when you're dealing with thousands or millions of sites.

With hundreds of sites you will get better and more predictable results with (1).


As with all problems, design can let you deliver value adapt to situations you haven't considered much more quickly than the general solution.

Start by writing something that parses the data from one provider - the one with the simplest format to handle. Find a way to adapt that handler into your crawler. Be sure to encapsulate construction - you should always do this anyway...

public class RoomTypeExtractor
{
  private RoomTypeExtractor() { }

  public static RoomTypeExtractor GetInstance()
  {
    return new RoomTypeExtractor();
  }

  public string GetRoomType(string content)
  {
    // BEHAVIOR #1
  }
}

The GetInstance() ,ethod lets you promote to a Strategy pattern for practically free.

Then add your second provider type. Say, for instance, that you have a slightly more complex data format which is a little more prevalent than the first format. Start by refactoring what was your concrete room type extractor class into an abstraction with a single variation behind it and have the GetInstance() method return an instance of the concrete type:

public abstract class RoomTypeExtractor
{
  public static RoomTypeExtractor GetInstance()
  {
    return SimpleRoomTypeExtractor.GetInstance();
  }

  public abstract string GetRoomType(string content);
}

public final class SimpleRoomTypeExtractor extends RoomTypeExtractor
{
  private SimpleRoomTypeExtractor() { }

  public static SimpleRoomTypeExtractor GetInstance()
  {
    return new SimpleRoomTypeExtractor();
  }

  public string GetRoomType(string content)
  {
    // BEHAVIOR #1
  }
}

Create another variation that implements the Null Object pattern...

public class NullRoomTypeExtractor extends RoomTypeExtractor
{
  private NullRoomTypeExtractor() { }

  public static NullRoomTypeExtractor GetInstance()
  {
    return new NullRoomTypeExtractor();
  }

  public string GetRoomType(string content)
  {
    // whatever "no content" behavior you want... I chose returning null
    return null;
  }
}

Add a base class that will make it easier to work with the Chain of Responsibility pattern that is in this problem:

public abstract class ChainLinkRoomTypeExtractor extends RoomTypeExtractor
{
  private final RoomTypeExtractor next_;

  protected ChainLinkRoomTypeExtractor(RoomTypeExtractor next)
  {
    next_ = next;
  }

  public final string GetRoomType(string content)
  {
    if (CanHandleContent(content))
    {
      return GetRoomTypeFromUnderstoodFormat(content);
    }
    else
    {
      return next_.GetRoomType(content);
    }
  }

  protected abstract bool CanHandleContent(string content);
  protected abstract string GetRoomTypeFromUnderstoodFormat(string content);
}

Now, refactor the original implementation to have a base class that joins it into a Chain of Responsibility...

public final class SimpleRoomTypeExtractor extends ChainLinkRoomTypeExtractor
{
  private SimpleRoomTypeExtractor(RoomTypeExtractor next)
  {
    super(next);
  }

  public static SimpleRoomTypeExtractor GetInstance(RoomTypeExtractor next)
  {
    return new SimpleRoomTypeExtractor(next);
  }

  protected string CanHandleContent(string content)
  {
    // return whether or not content contains the right format
  }

  protected string GetRoomTypeFromUnderstoodFormat(string content)
  {
    // BEHAVIOR #1
  }
}

Be sure to update RoomTypeExtractor.GetInstance():

  public static RoomTypeExtractor GetInstance()
  {
    RoomTypeExtractor extractor = NullRoomTypeExtractor.GetInstance();

    extractor = SimpleRoomTypeExtractor.GetInstance(extractor);

    return extractor;
  }

Once that's done, create a new link for the Chain of Responsibility...

public final class MoreComplexRoomTypeExtractor extends ChainLinkRoomTypeExtractor
{
  private MoreComplexRoomTypeExtractor(RoomTypeExtractor next)
  {
    super(next);
  }

  public static MoreComplexRoomTypeExtractor GetInstance(RoomTypeExtractor next)
  {
    return new MoreComplexRoomTypeExtractor(next);
  }

  protected string CanHandleContent(string content)
  {
    // Check for presence of format #2
  }

  protected string GetRoomTypeFromUnderstoodFormat(string content)
  {
    // BEHAVIOR #2
  }
}

Finally, add the new link to the chain, if this is a more common format, you might want to give it higher priority by putting it higher in the chain (the real forces that govern the order of the chain will become apparent when you do this):

  public static RoomTypeExtractor GetInstance()
  {
    RoomTypeExtractor extractor = NullRoomTypeExtractor.GetInstance();

    extractor = SimpleRoomTypeExtractor.GetInstance(extractor);
    extractor = MoreComplexRoomTypeExtractor.GetInstance(extractor);

    return extractor;
  }

As time passes, you may want to add ways to dynamically add new links to the Chain of Responsibility, as pointed out by Cletus, but the fundamental principle here is Emergent Design. Start with high quality. Keep quality high. Drive with tests. Do those three things and you will be able to use the fuzzy logic engine between your ears to overcome almost any problem...

EDIT

Translated to Java. Hope I did that right; I'm a little rusty.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜