Correlating varchar values
Is there a built-in way in Oracle 11 to check correlation of values in a varchar2 field? For example, given a simple table such as this:
MEAL_NUM INGREDIENT
--------------------
1 BEEF
1 CHEESE
1 PASTA
2 CHEESE
2 PASTA
2 FISH
3 CHEESE
3 CHICKEN
I want to get a numerical indication that based on MEAL_NUM, CHEESE is paired mostly with PASTA and to lessening degrees with BEEF, CHICKEN, and FISH.
My first inclination is to use the CORR functio开发者_如何学Pythonn and transform the strings into a number perhaps by either enumerating them beforehand or grabbing the rownum from a unique select.
Any suggestions how to go about this?
You won't want to use CORR
-- if you create a "food number" and assign Beef = 1, Chicken = 2, and Pasta = 3, then a correlation coefficient will tell you whether increased cheese correlates with increased "food number." But the "food number" being higher or lower doesn't mean anything since you made it up. So, don't use CORR
unless your foods are actually ordered in some way, like numbers are.
The way statisticians talk about this is with levels of measurement. In the language of the linked article, MEAL_NUM
is a nominal measure -- or maybe an ordinal measure if the meals happened in order, but either way, it's a really bad idea to use correlation coefficients on it.
You'll probably instead want to find something like "what percentage of Beef meals also have Cheese?" The following will return, for each ingredient, the number of meals containing it and also the number of meals containing it AND cheese. The trick is that COUNT
only counts non-null values.
SELECT Other.Ingredient,
COUNT(*) AS TotalMeals,
COUNT(Cheese.Ingredient) AS CheesyMeals
FROM table Other
LEFT JOIN table Cheese
ON (Cheese.Ingredient = 'Cheese'
AND Cheese.Meal_Num = Other.Meal_Num)
GROUP BY Other.Ingredient
Warning: returns wrong results if you include an ingredient twice in any one meal.
Edit: It turns out you aren't interested in Cheese specifically. You really want all the pairs of "correlations." So, we can abstract "Cheese" out and call them just the First and Second ingredients. I've added a "PossibleScore" to this one which tries to act like a percentage-of-meals but doesn't give a strong score if there are very few instances of the ingredient.
SELECT First.Ingredient,
Second.Ingredient,
COUNT(*) AS MealsWithFirst,
COUNT(First.Ingredient) AS MealsWithBoth,
COUNT(First.Ingredient) / (COUNT(*) + 3) AS PossibleScore,
FROM table First
LEFT JOIN table Second
ON (First.Meal_Num = Second.Meal_Num)
GROUP BY First.Ingredient, Second.Ingredient
When sorted by score, this should return
PASTA CHEESE 2 2 0.400
CHEESE PASTA 3 2 0.333
BEEF CHEESE 1 1 0.250
BEEF PASTA 1 1 0.250
FISH CHEESE 1 1 0.250
FISH PASTA 1 1 0.250
CHICKEN CHEESE 1 1 0.250
PASTA BEEF 2 1 0.200
PASTA FISH 2 1 0.200
CHEESE BEEF 3 1 0.167
CHEESE FISH 3 1 0.167
CHEESE CHICKEN 3 1 0.167
Do a self join to get all the in ingredient combinations, then corr by the two meal_nums
SELECT t1.INGREDIENT, t2.INGREDIENT, CORR(t1.MEAL_NUM, t2.MEAL_NUM)
FROM TheTable t1, TheTable t2
WHERE t1.INGREDIENT < t2.INGREDIENT
GROUP BY t1.INGREDIENT, t2.INGREDIENT
Should give you something like:
BEEF CHEESE 0.999
BEEF PASTA 0.998
CHEESE PASTA 0.977
UPDATE: as Chris points out, this won't work as is. What I was hoping is that there might be some way to fudge a mapping from the ordinal meal_num to an interval (@Chris, thanks for the link) value. That may not be possible, in which case this answer wouldn't help.
Try DBMS_FREQUENT_ITEMSET:
--Create sample data
create table meals(meal_num number, ingredient varchar2(10));
insert into meals
select 1, 'BEEF' from dual union all
select 1, 'CHEESE' from dual union all
select 1, 'PASTA' from dual union all
select 2, 'CHEESE' from dual union all
select 2, 'PASTA' from dual union all
select 2, 'FISH' from dual union all
select 3, 'CHEESE' from dual union all
select 3, 'CHICKEN' from dual;
commit;
--Create nested table type to hold results
CREATE OR REPLACE TYPE fi_varchar_nt AS TABLE OF VARCHAR2(10);
/
--Find the items most frequently combined with CHEESE.
select bt.setid, nt.column_value, support occurances_of_itemset
,length, total_tranx
from
(
select
cast(itemset as fi_varchar_nt) itemset, rownum setid
,support, length, total_tranx
from table(dbms_frequent_itemset.fi_transactional(
tranx_cursor => cursor(select meal_num, ingredient from meals),
support_threshold => 0,
itemset_length_min => 2,
itemset_length_max => 2,
including_items => cursor(select 'CHEESE' from dual),
excluding_items => null))
) bt,
table(bt.itemset) nt
where column_value <> 'CHEESE'
order by 3 desc;
SETID COLUMN_VAL OCCURANCES_OF_ITEMSET LENGTH TOTAL_TRANX
---------- ---------- --------------------- ---------- -----------
4 PASTA 2 2 3
3 FISH 1 2 3
1 BEEF 1 2 3
2 CHICKEN 1 2 3
what about a query like that?
select t1.INGREDIENT, count(*)a
from table t1,
(select meal_num
from table
where INGREDIENT = 'CHEESE') t2
where t1.INGREDIENT <> 'CHEESE'
and t1.meal_num=t2.mealnum
group by t1.INGREDIENT;
the result should be the number of time each ingredient share a meal_num with CHEESE.
精彩评论