Saturday 14 April 2012

Hadoop Hive Partitioned External Table - notes to self

Hive Partitioned External Table 

I had the same issue as Chris Zheng (see http://zhengdong.me/2012/02/22/hive-external-table-with-partitions/) re not being able to select anything out of my Hive external table partitions.
In fact, I solved this problem several weeks ago without realising when I was moving data from one directory to another and altered partition definitions for the move.

My data is being loaded in a simplistic way into the following directory structure - i.e. each day loads in load_dt=YYYYMMDD:

hdfs://data/myfeed/stg/load_dt=YYYYMMDD

E.g. given the following files:

cat 20120301_myfeed.dat

20120301|001500|test|A|35
20120301|003000|test|B|85
20120301|004500|test|A|25
20120301|010000|test|C|35
20120301|011500|test|A|95
20120301|013000|test|D|55
cat 20120301_myfeed.dat
20120302|001500|test|A|35
20120302|003000|test|B|85
20120302|004500|test|A|25
20120302|010000|test|C|35
20120302|011500|test|A|95
20120302|013000|test|D|55

Load them like this:

hadoop fs -put 20120301_myfeed.dat /data/myfeed/stg/load_dt=20120301
hadoop fs -put 20120302_myfeed.dat /data/myfeed/stg/load_dt=20120302


Create an external table (with load_dt partition) as follows:

set myfeed_stg_location=/data/myfeed/stg
set myfeed_stg_location;

set myfeed_stg=myfeed_stg_ext;
set myfeed_stg;

-- Suppose myfeed stg data had records like this
-- 20120301|001500|test|A|35
CREATE EXTERNAL TABLE IF NOT EXISTS ${hiveconf:myfeed_stg}
    ( event_dt            STRING
    , event_tm            STRING
    , category            STRING
    , code                STRING
    , num_hits            INT
     )
    COMMENT 'Table for myfeed staging data'
    PARTITIONED BY( load_dt STRING )
    ROW FORMAT DELIMITED FIELDS TERMINATED by '|'
    STORED AS TEXTFILE
    LOCATION '${hiveconf:myfeed_stg}';

I found as Chris Zheng did, that I got nothing when I selected anything from my myfeed_stg_ext table.

Turns out you need to add the partitions explicitly :( like so:

hive> alter table myfeed_stg_ext add partition (load_dt=20120301);
OK
Time taken: 0.751 seconds
hive> alter table myfeed_stg_ext add partition (load_dt=20120302);
OK
Time taken: 0.279 seconds
hive> select * from myfeed_stg_ext;
OK
20120301 001500 test A 35 20120301
20120301 003000 test B 85 20120301
20120301 004500 test A 25 20120301
20120301 010000 test C 35 20120301
20120301 011500 test A 95 20120301
20120301 013000 test D 55 20120301
20120302 001500 test A 35 20120302
20120302 003000 test B 85 20120302
20120302 004500 test A 25 20120302
20120302 010000 test C 35 20120302
20120302 011500 test A 95 20120302
20120302 013000 test D 55 20120302
Time taken: 0.501 seconds
hive> select * from myfeed_stg_ext where load_dt = 20120301;     
OK
20120301 001500 test A 35 20120301
20120301 003000 test B 85 20120301
20120301 004500 test A 25 20120301
20120301 010000 test C 35 20120301
20120301 011500 test A 95 20120301
20120301 013000 test D 55 20120301
Time taken: 0.314 seconds
hive> 



Here's a simple shell script to move the data from a existing directory structure /data/myfeed/stg to /data/myfeed/stg/load_dt=YYYYMMDD. Make sure it runs per month or change to handle month/year boundaries.



#!/bin/bash


day=20120301
while [ $day -le 20120331 ]
do


  echo "hadoop fs -mv /data/myfeed/stg/${day} /data/myfeed/stg/load_dt=${day}"
  hadoop fs -mv /data/myfeed/stg/${day} /data/myfeed/stg/load_dt=${day}
  if [ $? -ne 0 ]
  then
    echo "ERROR: hadoop mv failed"
    exit 1
  fi
  sleep 1  # don't need these sleeps - used during testing


  hive -e "ALTER TABLE myfeed_stg_pext ADD PARTITION (load_dt=${day}); select * from myfeed_stg_pext where load_dt = '$day' limit 10;"
  sleep 2  # don't need these sleeps - used during testing


  day=$(($day+1))


done


TBC ...

Read up on dynamic partitions ... could this be a more elegant approach?
And compression - lzo, others?
(http://www.mrbalky.com/2011/02/24/hive-tables-partitions-and-lzo-compression/)