Friday, 1 February 2013

Move data from one Hadoop cluster to another using distcp

Cluster Specs
I have 2 HDFS clusters in the same network domain (1G networking).
The source cluster has 9 data nodes (16 core, 48G RAM, 4x2TB disks) and the target cluster has 18 data nodes (16 core, 48 G RAM, 4x2TB disks).

Task
I needed to transfer a week's worth of data from one to the other - 80G per day. 

Attempt 1 - Slow and Fiddly
Initially I used the following which was slow and fiddly:
  • copyToLocal - HDFS to NFS - 45 mins
  • copyFromLocal - NFS to HDFS - 1 hr 45 mins
Attempt 2 - Fast and Simple
Then I used hadoop distcp
  • hadoop distcp - 4mins
  • Rename path in HDFS cluster #2  (to be in line with naming convention there) - 1 sec
  • Create hive partitions in HDFS cluster 2 - 1 min

Conclusion
Hadoop distcp is pretty impressive running across 2 quiet clusters.

Misc Notes
Read more about hadoop distcp here.

Command used (note - just happened that hdfs nn was listening on different ports in the 2 clusters):

nohup hadoop distcp hdfs://hdfssvr1:8020/data/abc/stg/load_dt=20121127 hdfs://hdfssvr2:54310/user/hive/warehouse/abc > /var/tmp/distcp_20121127.log 2>&1 &

(Note the above command copies the dir load_dt=20121127 [full of subdirs in my case] into /user/hive/warehouse/abc dir in the remote hdfs cluster)

Logging generated by the command:

13/02/01 18:41:07 INFO tools.DistCp: srcPaths=[hdfs://hdfssvr1:8020/data/abc/stg/load_dt=20121127]
13/02/01 18:41:07 INFO tools.DistCp: destPath=hdfs://hdfssvr2:54310/user/hive/warehouse/abc
13/02/01 18:41:09 INFO tools.DistCp: sourcePathsCount=4901
13/02/01 18:41:09 INFO tools.DistCp: filesToCopyCount=4311
13/02/01 18:41:09 INFO tools.DistCp: bytesToCopyCount=80.1g
13/02/01 18:41:10 INFO mapred.JobClient: Running job: job_201302011505_0006
13/02/01 18:41:11 INFO mapred.JobClient:  map 0% reduce 0%
13/02/01 18:41:26 INFO mapred.JobClient:  map 1% reduce 0%
13/02/01 18:41:27 INFO mapred.JobClient:  map 2% reduce 0%
13/02/01 18:41:28 INFO mapred.JobClient:  map 3% reduce 0%
13/02/01 18:41:29 INFO mapred.JobClient:  map 4% reduce 0%
13/02/01 18:41:30 INFO mapred.JobClient:  map 5% reduce 0%
13/02/01 18:44:05 INFO mapred.JobClient:  map 93% reduce 0%
13/02/01 18:44:07 INFO mapred.JobClient:  map 94% reduce 0%
13/02/01 18:44:10 INFO mapred.JobClient:  map 95% reduce 0%
13/02/01 18:44:12 INFO mapred.JobClient:  map 96% reduce 0%
13/02/01 18:44:14 INFO mapred.JobClient:  map 97% reduce 0%
13/02/01 18:44:17 INFO mapred.JobClient:  map 98% reduce 0%
13/02/01 18:44:20 INFO mapred.JobClient:  map 99% reduce 0%
13/02/01 18:44:31 INFO mapred.JobClient:  map 100% reduce 0%
13/02/01 18:44:32 INFO mapred.JobClient: Job complete: job_201302011505_0006
13/02/01 18:44:32 INFO mapred.JobClient: Counters: 20
13/02/01 18:44:32 INFO mapred.JobClient:   Job Counters 
13/02/01 18:44:32 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=12572408
13/02/01 18:44:32 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
13/02/01 18:44:32 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
13/02/01 18:44:32 INFO mapred.JobClient:     Launched map tasks=186
13/02/01 18:44:32 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
13/02/01 18:44:32 INFO mapred.JobClient:   distcp
13/02/01 18:44:32 INFO mapred.JobClient:     Files copied=4311
13/02/01 18:44:32 INFO mapred.JobClient:     Bytes copied=85969881779
13/02/01 18:44:32 INFO mapred.JobClient:     Bytes expected=85969881779
13/02/01 18:44:32 INFO mapred.JobClient:   FileSystemCounters
13/02/01 18:44:32 INFO mapred.JobClient:     HDFS_BYTES_READ=85971609725
13/02/01 18:44:32 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=8955232
13/02/01 18:44:32 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=85969881779
13/02/01 18:44:32 INFO mapred.JobClient:   Map-Reduce Framework
13/02/01 18:44:32 INFO mapred.JobClient:     Map input records=4900
13/02/01 18:44:32 INFO mapred.JobClient:     Physical memory (bytes) snapshot=25844953088
13/02/01 18:44:32 INFO mapred.JobClient:     Spilled Records=0
13/02/01 18:44:32 INFO mapred.JobClient:     CPU time spent (ms)=3410430
13/02/01 18:44:32 INFO mapred.JobClient:     Total committed heap usage (bytes)=35171532800
13/02/01 18:44:32 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=201811800064
13/02/01 18:44:32 INFO mapred.JobClient:     Map input bytes=1473253
13/02/01 18:44:32 INFO mapred.JobClient:     Map output records=0
13/02/01 18:44:32 INFO mapred.JobClient:     SPLIT_RAW_BYTES=25854


No comments: