Partial matching in Oracle Database
I am having very big table (more than 1 millions rows) , These rows are having product name and price from different sources.
There are many products which are same name but they differ in price.
Here is the problem ,
We have same products many times in row but their name will not be same for example
Row Product name price
----- ----------------------- ----
Row 1 : XYZ - size information $a
Row 2. XYZ -Brand information $b
Row 3. 开发者_如何学Pythonxyz $c
I want to get all products which price differ. If name are same in row then I can easily go for self join as Table1.Product_Name = Table1.Product_name and Table1.Price != Table2.Price
But this will not work in this case :(
Can any one suggest a solution for it ?
You could try to use regexp_replace
to go into the right direction:
create table tq84_products (
name varchar2(50),
price varchar2( 5)
);
Three products:
- xyz
- ABCD
- efghi
of which ABCD has two records with the same price and all others have a different price.
insert into tq84_products values (' XYZ - size information', '$a');
insert into tq84_products values ('XYZ - brand information', '$b');
insert into tq84_products values ('xyz' , '$c');
insert into tq84_products values ('Product ABCD' , '$d');
insert into tq84_products values ('Abcd is the best' , '$d');
insert into tq84_products values ('efghi is cheap' , '$f');
insert into tq84_products values ('no, efghi is expensive' , '$g');
The select statement with stop words to remove words that are typically found in product names.
with split_into_words as (
select
name,
price,
upper (
regexp_replace(name,
'\W*' ||
'(\w+)?\W?+' ||
'(\w+)?\W?+' ||
'(\w+)?\W?+' ||
'(\w+)?\W?+' ||
'(\w+)?\W?+' ||
'(\w+)?\W?+' ||
'(\w+)?\W?+' ||
'(\w+)?\W?+' ||
'(\w+)?' ||
'.*',
'\' || submatch.counter
)
) word
from
tq84_products,
(select
rownum counter
from
dual
connect by
level < 10
) submatch
),
stop_words as (
select 'IS' word from dual union all
select 'BRAND' word from dual union all
select 'INFORMATION' word from dual
)
select
w1.price,
w2.price,
w1.name,
w2.name
-- substr(w1.word, 1, 30) common_word,
-- count(*) over (partition by w1.name) cnt
from
split_into_words w1,
split_into_words w2
where
w1.word = w2.word and
w1.name < w2.name and
w1.word is not null and
w2.word is not null and
w1.word not in (select word from stop_words) and
w2.word not in (select word from stop_words) and
w1.price != w2.price;
This then selects
$a $b XYZ - size information XYZ - brand information
$b $c XYZ - brand information xyz
$a $c XYZ - size information xyz
$f $g efghi is cheap no, efghi is expensive
So, abcd is not returned while the others are.
精彩评论