Using sorted tables in Hive

2023-03-24 21:21 问答作者：

In summary: I feel that my system is ignoring the concept of pre-sorted tables. - I expected to save time on the sorting step because I was using pre-sorted data, but the query plan seem to indicate an intermediate sorting step.

The sordid details follow:

The Setup =======

I have set the following flags:=============

set hive.enforce.bucketing = true;
set mapred.reduce.tasks=8;
set mapred.map.tasks=8;

Here I create a table to hold a temporary copy of data on disk ========

CREATE TABLE trades
      (symbol STRING, exchange STRING, price FLOAT, volume INT, cond
INT, bid FLOAT, ask FLOAT, time STRING)
PARTITIONED BY (dt STRING)
CLUSTERED BY (symbol) SORTED BY (symbol, time) INTO 8 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
 STORED AS TEXTFILE;

Here I copy the data on disk into the table BTW, the data here is clustered by symbol and sorted by time. I can't seem to get Hive to use this concept ... i.e avoid sorting again

LOAD DATA LOCAL INPATH '%(dir)s2010-05-07'
INTO TABLE trades
partition (dt='2010-05-07');

I use the following final table to enforce bucketing =========== and to impose sort order ===========

CREATE TABLE alltrades
      (symbol STRING, exchange STRING, price FLOAT, volume INT, cond
INT, bid FLOAT, ask FLOAT, time STRING)
CLUSTERED BY (symbol) SORTED BY (symbol, time) INTO 8 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
 STORED AS TEXTFILE;

data is loaded from a hive table ==========

insert overwrite table alltrades
select symbol, exchange, price, volume, cond, bid, ask, time
from trades
distribute by symbol sort by symbol, time;

It is disappointing to see that any query on alltrades that requires sorted symbol,time does the sorting all over again ... is there a way around this? Also, is there a way to make this whole process work in 1 query step instead of 2 ?

Why SORTING seems to not work =======

Note that the table was constructed and populated with the sort by clause. I'm afraid that dropping these would lead a future reducer to behave as if no sorting was required.

Here is the plan for a query that in my opinion should not involve sorting ... but actually does.========

hive> explain select symbol, time, price from alltrades sort by symbol, time;
OK
ABSTRACT SYNTAX TREE:
 (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME alltrades)))
(TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT
(TOK_SELEXPR (TOK_TABLE_OR_COL symbol)) (TOK_SELEXPR (TOK_TABLE_OR_COL
time)) (TOK_SELEXPR (TOK_TABLE_OR_COL price))) (TOK_SORTBY
(TOK_TABSORTCOLNAMEASC (TOK_TABLE_OR_COL symbol))
(TOK_TABSORTCOLNAMEASC (TOK_TABLE_OR_COL time)))))

STAGE DEPENDENCIES:
 Stage-1 is a root stage
 Stage-0 is a root stage

STAGE PLANS:
 Stage: Stage-1
   Map Reduce
     Alias -> Map Operator Tree:
       alltrades
         TableScan
           alias: alltrades
           Select Operator
        开发者_Python百科     expressions:
                   expr: symbol
                   type: string
                   expr: time
                   type: string
                   expr: price
                   type: float
             outputColumnNames: _col0, _col1, _col2
             Reduce Output Operator
               key expressions:
                     expr: _col0
                     type: string
                     expr: _col1
                     type: string
               sort order: ++
               tag: -1
               value expressions:
                     expr: _col0
                     type: string
                     expr: _col1
                     type: string
                     expr: _col2
                     type: float
     Reduce Operator Tree:
       Extract
         File Output Operator
           compressed: false
           GlobalTableId: 0
           table:
               input format: org.apache.hadoop.mapred.TextInputFormat
               output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

 Stage: Stage-0
   Fetch Operator
     limit: -1

Have you checked out the effect of set hive.enforce.bucketing=true? From http://svn.apache.org/repos/asf/hive/branches/branch-0.7/conf/hive-default.xml

<property>
  <name>hive.enforce.sorting</name>
  <value>false</value>
  <description>Whether sorting is enforced. If true, while inserting into the table, sorting is enforced. </description>
</property>

You may also find reading the implementation of org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer#genBucketingSortingDest useful:

http://svn.apache.org/repos/asf/hive/branches/branch-0.7/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java

hive.enforce.bucketing does not do a global sort of the data set. Instead it writes the data sorted within the buckets (in your case 8/partition). Thus it requires a global sort step to satisfy the query you are looking for.

Hope this helps, Nat

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL

The CLUSTERED BY and SORTED BY creation commands do not affect how
data is inserted into a table – only how it is read. This means that
users must be careful to insert data correctly by specifying the
number of reducers to be equal to the number of buckets, and using
CLUSTER BY and SORT BY commands in their query.

Also look at https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy

继续阅读：hive

Using sorted tables in Hive

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？