开发者

Partial matching in Oracle Database

I am having very big table (more than 1 millions rows) , These rows are having product name and price from different sources.

There are many products which are same name but they differ in price.

Here is the problem ,

We have same products many times in row but their name will not be same for example

Row    Product name             price
-----  -----------------------  ---- 
Row 1 : XYZ - size information   $a
Row 2. XYZ -Brand information    $b
Row 3. 开发者_如何学Pythonxyz                       $c

I want to get all products which price differ. If name are same in row then I can easily go for self join as Table1.Product_Name = Table1.Product_name and Table1.Price != Table2.Price

But this will not work in this case :(

Can any one suggest a solution for it ?


You could try to use regexp_replace to go into the right direction:

create table tq84_products (
  name   varchar2(50),
  price  varchar2( 5)
);

Three products:

  • xyz
  • ABCD
  • efghi

of which ABCD has two records with the same price and all others have a different price.

insert into tq84_products values (' XYZ - size information', '$a');
insert into tq84_products values ('XYZ - brand information', '$b');
insert into tq84_products values ('xyz'                    , '$c');

insert into tq84_products values ('Product ABCD'           , '$d');
insert into tq84_products values ('Abcd is the best'       , '$d');

insert into tq84_products values ('efghi is cheap'         , '$f');
insert into tq84_products values ('no, efghi is expensive' , '$g');

The select statement with stop words to remove words that are typically found in product names.

with split_into_words as (
      select 
        name,
        price,
        upper (
        regexp_replace(name,
                             '\W*'  ||
                       '(\w+)?\W?+' ||
                       '(\w+)?\W?+' ||
                       '(\w+)?\W?+' ||
                       '(\w+)?\W?+' ||
                       '(\w+)?\W?+' ||
                       '(\w+)?\W?+' ||
                       '(\w+)?\W?+' ||
                       '(\w+)?\W?+' ||
                       '(\w+)?'     ||
                       '.*',
                       '\' || submatch.counter
                     ) 
        )                          word
         from
           tq84_products,
           (select
              rownum counter
            from 
              dual
            connect by
              level < 10
           ) submatch
  ),
  stop_words as (
    select 'IS'          word from dual union all
    select 'BRAND'       word from dual union all
    select 'INFORMATION' word from dual 
  )
  select
    w1.price,
    w2.price,
    w1.name,  
    w2.name
--  substr(w1.word, 1, 30)               common_word,
--  count(*) over (partition by w1.name) cnt
  from
    split_into_words w1,
    split_into_words w2
  where
    w1.word   = w2.word and
    w1.name  <  w2.name and
    w1.word is not null and
    w2.word is not null and
    w1.word not in (select word from stop_words) and
    w2.word not in (select word from stop_words) and
    w1.price != w2.price;

This then selects

$a    $b     XYZ - size information                            XYZ - brand information
$b    $c    XYZ - brand information                            xyz
$a    $c     XYZ - size information                            xyz
$f    $g    efghi is cheap                                     no, efghi is expensive

So, abcd is not returned while the others are.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜