开发者

Loading from mysqldump with PIG

I have a mysqldump of the format:

INSERT INTO `MY_TABLE` VALUES (893024968,'342903068923468','o03gj8ip234qgj9u23q59u','testing123','HTTP','1','4213883b49b74d3eb9bd57b7','blahblash','2011-04-19 00:00:00','448','206',NULL,'GG');

How do I load this data using pig? I have tried;

A = LOAD 'pig-test/test.log' USING PigStorage(',') AS (ID: chararray, USER_ID: chararray, TOKEN: chararray, NODE: chararray, CHANNEL: chararray, CODE: float, KEY: chararray, AGENT: chararray, TIME: chararray, DURATION: float, RESPONSE: chararray, MESSAGE: chararray, TARGET: chararray);

Using , as a delimiter works fine, but I want the ID to be an int and I cannot figure out how to chop off the leading "INSERT INTO MY_TABLE VALUES (" and the trailing ");开发者_如何转开发" when loading.

Also how should I load datetime information so that I can query it?

Any help you can give would be great.


You could load each record as a line of text and then try to regex/extract the field with MyRegExLoader or REGEX_EXTRACT_ALL:

A = LOAD 'data' AS (record: CHARARRAY);
B = FOREACH A GENERATE REGEX_EXTRACT_ALL(record, 'INSERT INTO...., \'(\d+)\', ...');

It is a kind of a hack but you can use REPLACE for chopping off the extra text too:

B = FOREACH A
    GENERATE
      (INT) REPLACE(ID, 'INSERT INTO MY_TABLE VALUES (', ''),
      ...
      REPLACE(TARGET, ');', '');    

Currently there is a problem with semicolon so you might need to do your own REPLACE.

There is not native date type in Pig but you can jungle with the date utils in PiggyBank or build your own UDF in order to convert it to a Unix long.

Another way would also be doing a simple script (Python...) for preparing the data for loading.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜