开发者

Understanding SQL joins within WHERE clause

I have a query in SQL that I'm trying to translate into Pig Latin (for use on a Hadoop cluster). Most of the time I have no problem moving the queries over to Pig, but I've encountered something I can't seem to figure out.

Imagine a select statement like this:

SELECT a.f1, b.f2, b.f3 -- just for example
FROM tableA a, tableB b
WHERE( 
(
    a.f1 = b.f2
    AND
    (
        a.f2 = b.f1
        OR
        (
            (a.f2 = 'somestring1' OR a.f2 = 'somestring2')
            AND
            (b.f1 is null OR b.f1 ='somestring3' OR b.f1 = 'somestring4')
        )
    )
OR
(
    a.f3 = b.f4
    AND
    (
        a.f4 = b.f3
        OR
        (
            (a.f4 = 'somestring5' OR a.f4 = 'somestring6')
            AND
            (b.f3 is null OR b.f3 ='somestring7' OR b.f3 = 'somestring8')
        )
    )
)
)
GROUP BY -- some other stuff

Now, I know that a direct translation to Pig might not be possible开发者_JS百科. If that's the case, I'm wondering how this statement gets decomposed into the JOINs (and filters) required to compute it (as it may be easier to see how to construct a Pig query out of that).

I've gone to the Pig mailing list but I haven't found a good solution yet, as Pig doesn't really do "OR" with JOINs. I am aware of how odd this question is.

(If anyone does have an idea of how this be done in Pig, I wouldn't be opposed to looking at it...)

Edit: Does anyone know if this would be easier in Hive, maybe?


UNION is another way of spelling OR ... sort of.

Maybe:

SELECT a.f1, b.f2, b.f3 -- just for example
  FROM tableA a JOIN tableB b
    ON a.f1 = b.f2 AND
       (a.f2 = b.f1 OR
           ((a.f2 = 'somestring1' OR a.f2 = 'somestring2') AND
            (b.f1 is null OR b.f1 ='somestring3' OR b.f1 = 'somestring4')
           )
       )
UNION
SELECT a.f1, b.f2, b.f3 -- just for example
  FROM tableA a JOIN tableB b
    ON a.f3 = b.f4 AND
       (a.f4 = b.f3 OR
           ((a.f4 = 'somestring5' OR a.f4 = 'somestring6') AND
            (b.f3 is null OR b.f3 ='somestring7' OR b.f3 = 'somestring8')
           )
       )
GROUP BY -- some other stuff


I dunno about Pig, but in SQL, your statement is equivalent to pretty what you've written:

select ...
from a
join b on [your enormous condition]
group by ...

It'll be slow, too, because short of having an optimizer (e.g. Postgres) that decomposes the OR clauses and tries each of them with the appropriate index, you end up cross joining the two tables in their entirety.

You could rewrite the statement as select from (select ... union select ...) group by ... instead, if Pig makes more sense of the latter.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜