Thursday 14 February 2013

Hadoop Hive insert into partition table example script

Here's an example of creating Hadoop hive daily summary partitions and loading data from a Hive transaction table into newly created partitioned summary table.

The Hadoop Hive Manual has the insert syntax covered neatly but sometimes it's good to see an example.

Background

Colleagues wanted us to produce a smaller query set based on a large (billion rows per day) transaction table called big_txns that was partitioned by load date (load_dt).
They wanted the following in the smaller query set (1,000s of records):

  • Transaction Day - the day the transaction occurred (txn_dt was datetime yyyymmddHHMMSS so needed substringing for yyyymmdd)
  • Transaction Type - an interesting attribute
  • Counts - totals x transaction day x txn_type


Hive Create Partitioned External Table


DROP TABLE IF EXISTS txn_summ_x_txn_type;

CREATE EXTERNAL TABLE txn_summ_x_txn_type
    ( 
txn_dt     STRING,
        txn_type   STRING,
        cnt        BIGINT
     )
    COMMENT 'Transaction summary table showing counts x txn_dt x txn_type partitioned by load_dt'
    PARTITIONED BY( load_dt STRING )
    ROW FORMAT DELIMITED FIELDS TERMINATED by '|'
    STORED AS TEXTFILE
    LOCATION '/data/txn/summ/txn_summ_x_txn_type';


Hive Insert into Daily Partitions

Here is a very basic shell script to build this summary table.
[Aside - originally I had the hive statement using self-consuming input a la ...
hive <
...
EOF
but when I ran it in the background it kept stopping for I don't know what reason.
So resorted to hive -e "hive cmds".
Need to go back and look at this]


#!/bin/bash
# Descrption: Insert the txn_type counts x txn_dt partitioned by load_dt
# Usage:      ins_txn_summ_x_txn_type.sh START_DAY=YYYYMMDD END_DAY=YYYYMMDD
# --------------------------------------------------------------------------


dohive() {

load_dt=$1

hive -e "
SET mapred.child.java.opts=-Xmx1024M;

alter table txn_summ_x_txn_type add if not exists partition (load_dt='${load_dt}');

insert overwrite table txn_summ_x_txn_type
partition ( load_dt = '${load_dt}' )
select SUBSTR(t.txn_dt,1,8) as txn_dt,
       t.txn_type, 
       count(*) as cnt
from big_txns t
where t.load_dt='${load_dt}'
group by SUBSTR(t.txn_dt,1,8),
         t.txn_type;
"

}

#START_DAY=20121001
#END_DAY=20121031

# Allow one to change the START_DAY and END_DAY
eval $@

if [ $# -ne 2 ]
then
  echo "ERROR: usage: ins_txn_summ_x_txn_type.sh START_DAY=YYYYMMDD END_DAY=YYYYMMDD (do not cross month boundaries - if non-Linux - see comment below)"
  exit 1
fi

DAY=$START_DAY

while [ $DAY -le $END_DAY ]
do
  echo DAY $DAY
  dohive $DAY 
  # DAY=$(($DAY+1)) # use this and don't span months if non-Linux
  # on linux this DAY increment works a treat - tx Matt Livesey for reminding
  DAY=`date --date=${DAY}' +1 day' +%Y%m%d`
done

No comments: