Sunil S. Ranka's Weblog

Superior Data Analytics is the antidote to Business Failure

More Animals in Big Data Zoo – Big Data Landscape for 2016

Posted by sranka on March 26, 2016

Hi All

While surfing net stumbled upon Big Data Landscape for 2016 image and it was very impressive to see many more new Animals in Big Data Zoo.


New Animals

Hope This Helps

Sunil S Ranka

Posted in Big Data, Hadoop, Me | Tagged: , , , | Leave a Comment »

Big Data – Tez, MR, Spark Execution Engine : Performance Comparison

Posted by sranka on February 25, 2016

There is no question that massive data is being generated in greater volumes than ever before. Along with the traditional data set, new data sources as sensors, application logs, IOT devices, and social networks are adding to data growth. Unlike traditional ETL platforms like Informatica, ODI, DataStage that are largely proprietary commercial products, the majority of Big ETL platforms are powered by open source.

With many execution engines, customers are always curious about their usage and performance.

To put it into perspective, In this post I am running set of query against 3 key Query Engines namely Tez, MapReduce, Spark (MapReduce) to compare the query execution timings.

create external table sensordata_csv
ts string,
deviceid int,
sensorid int,
val double
row format delimited
fields terminated by '|'
stored as textfile
location '/user/sranka/MachineData/sensordata'

drop table sensordata_part;

create table sensordata_part
deviceid int,
sensorid int,
val double
partitioned by (ts string)
clustered by (deviceid) sorted by (deviceid) into 10 buckets
stored as orc

"** 1) Baseline: Read a csv without Tez"
" set hive.execution.engine=mr"
" select count(*) from sensordata_csv where ts = '2014-01-01'"
2016-02-25 02:57:27,444 Stage-1 map = 0%,  reduce = 0%
2016-02-25 02:57:35,880 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.84 sec
2016-02-25 02:57:44,420 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 4.99 sec
MapReduce Total cumulative CPU time: 4 seconds 990 msec
Ended Job = job_1456183816302_0046
MapReduce Jobs Launched:
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 4.99 sec   HDFS Read: 3499156 HDFS Write: 6 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 990 msec
Time taken: 32.524 seconds, Fetched: 1 row(s)

"** 2) Read a csv with Tez"
" set hive.execution.engine=tez"
" select count(*) from sensordata_csv where ts = '2014-01-01'"
Total jobs = 1
Launching Job 1 out of 1

Status: Running (application id: application_1456183816302_0047)

Map 1: -/-    Reducer 2: 0/1
Map 1: 0/1    Reducer 2: 0/1
Map 1: 0/1    Reducer 2: 0/1
Map 1: 0/1    Reducer 2: 0/1
Map 1: 1/1    Reducer 2: 0/1
Map 1: 1/1    Reducer 2: 1/1
Status: Finished successfully
Time taken: 16.905 seconds, Fetched: 1 row(s)

"** 3) Read a partition with Tez"
" select count(*) from sensordata_part where ts = '2014-01-01'"
Total jobs = 1
Launching Job 1 out of 1
Status: Running (application id: application_1456183816302_0047)

Map 1: -/-    Reducer 2: 0/1
Map 1: 0/2    Reducer 2: 0/1
Map 1: 1/2    Reducer 2: 0/1
Map 1: 2/2    Reducer 2: 0/1
Map 1: 2/2    Reducer 2: 1/1
Status: Finished successfully
Time taken: 6.503 seconds, Fetched: 1 row(s)

"** 4) Read a partition with Spark"
" select count(*) from sensordata_part where ts = '2014-01-01'"

Time taken: took 5.8 seconds

"** 5) Read a csv with Spark"
" select count(*) from sensordata_csv where ts = '2014-01-01'"
Time taken: took 4.5 seconds

Query 1select count(*) from sensordata_csv where ts = ‘2014-01-01’

Query 2select count(*) from sensordata_part where ts = ‘2014-01-01’

Below tables shows the execution timings :
Screen Shot 2016-02-24 at 11.07.03 PM

Conclusion Which Engine is right :

Spark being In memory execution engine comes out to be a clear winner, but in certain scenario especially in the current scenario of running query on partition table TEZ execution engines comes closer to spark.

With this you can not conclude that you Spark will solve your — World Hunger Problem — of Big ETL, being continuously growing product Spark has its own challenges when it comes to productization of the Spark workload, same holds True with TEZ. In all MR engine has been around for the most time and its been the core of HDFS framework, for mission critical workloads which are not time bound, MR could be the best choice.

Hope This Helps

Sunil S Ranka

About Spark :

About MapReduce :

About Tez :

Posted in Hadoop | Tagged: , , , , , , , , , , | 1 Comment »

Map Reduce: File compression and Processing cost

Posted by sranka on August 19, 2015

Recently while working with a customer we ran into an interesting situation concerning file compression and processing time. For a system like Hadoop, file compression has been always a good way to save on space, especially when Hadoop replicates the data multiple times.

All Hadoop compression algorithms exhibit a space/time trade-off: faster compression and decompression speeds usually come at the expense of space savings.  For more details about how compression is used, see There are many file compression formats, but below we only mention some of the commonly used compression methods in Hadoop. 

The type of compression plays an important role — the true power of MapReduce is realized when input can be split, and not all compression formats are splittable, resulting in an unexpected number of map tasks.

In the case of splittable formats, the number of mappers will correspond to number of block-sized chunks into which the file has been stored, whereas in case of a non-splittable format a single map task will process all the blocks. Although the table above shows that LZO is not splittable, in fact it is possible to index LZO files so that performance can be greatly improved. At Altiscale, our experience has shown that by indexing LZO format files, it will make LZO-compressed files splittable and you can gain in performance. For more information, see on how to do it.

Format Codec Extension Splittable Hadoop HDInsight
DEFLATE .deflate N Y Y
Gzip .gz N Y Y
Bzip2 .bz2 Y Y Y
LZO com.hadoop.compression.lzo.LzopCodec .lzo N Y N
LZ4 .Lz4 N Y N
Snappy .Snappy N Y N

To measure the performance and processing costs for the compression method, we did a simple wordcount example on Altiscale’s in-house production cluster by following the steps mentioned in the below link.  Although there are other more detailed methodologies to measure performance, we choose this example just to demonstrate the benefit and performance tradeoff of a splittable solution:
The following  table represents the results.

# File Name Compression Option size


Comments # Of


# Of


Processing Time
1 input.txt n/a 5.9 No Compression 24 1 1min 16sec
2 input.txt.gz default 1.46 Normal Gzip compression 1 1 11min 19sec
3 input1.txt.gz -1 1.42 gzip with -1 option, means optimize for speed 1 1 11min 41sec
4 input9.txt.gz -9 1.14 gzip with -9 option, means optimize for space 1 1 11min 21sec

The following shows how you can use the Resource Manager Web UI to find the relevant job statistics that show the processing time as well as the number of mappers and reducers that were used.  



In our test #1 scenario the uncompressed file size was 5.9 GB when stored in HDFS. With a HDFS block size of 256 MB, the file was stored as ~24 blocks, and a MapReduce job using this file as input created 24 input splits, each processed independently as input to a separate map task taking only 1 min 16 sec.

In the rest of the test scenarios, due to gzip, the file could not be split, resulting in a single input split and taking an average time of approximately 11 min. Even the gzip -1 option, meant for optimize speed, or  -9 option, meant for optimize space, did not  help much.

Gzip compression is an important aspect of the Hadoop ecosystem; it helps save space at a trade off of processing time. If the data processing is time sensitive, then a splittable compression format, or even uncompressed files would be recommended.

Posted in Uncategorized | Tagged: , , , , , , | Leave a Comment »

Accessing HDFS files on local File system using mountableHDFS – FUSE

Posted by sranka on April 9, 2015

Hi All

Recently we had one requirement wherein we had to merge the files post Map and Reducer job. Since the file needed to be given to the outbound team outside of Hadoop development team, having these files on local system would have been ideal. The customer IT team worked with cloudera and gave us a mount point using a utility/concept called “mountableHDFS” aka FUSE (Filesystem in Userspace)¬† .

mountableHDFS, helps¬†allowing HDFS to be mounted (on most flavors of Unix) as a standard file system using the mount command. Once mounted, the user can operate on an instance of hdfs using standard Unix utilities such as ‘ls’, ‘cd’, ‘cp’, ‘mkdir’, ‘find’, ‘grep’

For more details on mountableHDFS :

For how to configure on cloudera :


Special thanks to Aditi Hedge for bringing to my attention.

Hope This Helps,

Sunil S Ranka

‚ÄúSuperior BI is the antidote to Business Failure‚ÄĚ

Posted in Uncategorized | Tagged: , , , , | Leave a Comment »

How to read HDFS fsImage file

Posted by sranka on April 2, 2015

During one of the sizing exercise the  ask for server capacity  was more than the actual usage of cluster . Knowing the data and usage, I was not convinced that we should be asking for more memory space. That triggered the thought of

Conceptually FSIMG file is the balancesheet of all the file and their existence and location. If somehow we could read the metadata withing the file and make sence out of it, than it could help us as follow :

  • how to keep the cluster clean.
  • how to manage the space on server by means of knowing file duplication, last access time
  • To Know which are longest running jobs

To more about the files and attributes :

STEP 1: Download the latest fsimage copy.

$ hdfs dfsadmin -fetchImage /tmp

$ ls -ltr /tmp | grep -i fsimage
-rw-r–r– 1 root root 22164 Aug 15¬†17:27 fsimage_0000000000000004389

$ hdfs oiv -i /tmp/fsimage_0000000000000001386 -o /tmp/fsimage.txt

This would¬†launche a HTTP server which exposes read-only WebHDFS API by default at port “5978”.

For more detail on oiv, you can visit :


Hope This Helps

Sunil S Ranka

‚ÄúSuperior BI is the antidote to Business Failure‚ÄĚ

Posted in Uncategorized | Tagged: , , , , , , | Leave a Comment »

Permissions for both HDFS and local fileSystem paths

Posted by sranka on July 18, 2014

Hi All,

Permission issues is one of the key error , while setting up Hadoop Cluster, while debugging some error found below table on¬† . It’s a good scorecard to keep handy.


Permissions for both HDFS and local fileSystem paths

The following table lists various paths on HDFS and local filesystems (on all nodes) and recommended permissions:

Filesystem Path User:Group Permissions
local hdfs:hadoop drwx——
local hdfs:hadoop drwx——
local $HADOOP_LOG_DIR hdfs:hadoop drwxrwxr-x
local $YARN_LOG_DIR yarn:hadoop drwxrwxr-x
local yarn.nodemanager.local-dirs yarn:hadoop drwxr-xr-x
local yarn.nodemanager.log-dirs yarn:hadoop drwxr-xr-x
local container-executor root:hadoop –Sr-s—
local conf/container-executor.cfg root:hadoop r——–
hdfs / hdfs:hadoop drwxr-xr-x
hdfs /tmp hdfs:hadoop drwxrwxrwxt
hdfs /user hdfs:hadoop drwxr-xr-x
hdfs yarn.nodemanager.remote-app-log-dir yarn:hadoop drwxrwxrwxt
hdfs mapreduce.jobhistory.intermediate-done-dir mapred:hadoop drwxrwxrwxt
hdfs mapreduce.jobhistory.done-dir mapred:hadoop drwxr-x—

Hope this helps

Sunil S Ranka

‚ÄúSuperior BI is the antidote to Business Failure‚ÄĚ

This table was taken directly from


Posted in 11g, Big Data | Tagged: , , , , , , , | Leave a Comment »

Need for Defining Reference Architecture For Big Data

Posted by sranka on May 7, 2014

Hi Fellow Big Data Admirers ,

With big data and analytics playing an influential role helping organizations achieve a competitive advantage, IT managers are advised not to deploy big data in silos but instead to take a holistic approach toward it and define a base reference architecture even before contemplating positioning the necessary tools. 

My latest print media article (5th in the series) for CIO magazine (ITNEXT) talks extensively about need of reference architecture in Big Data

Click Here For : Need For Defining Big Data Reference Architecture


Hope you Enjoy Reading this.

Hope this helps

Sunil S Ranka

‚ÄúSuperior BI is the antidote to Business Failure‚ÄĚ

For copy of May 2014 IT Next Magazine please visit ( My Article is on Page 37 )



Posted in Big Data | Tagged: , , , , , , , , , , , , | Leave a Comment »

How to find out a table type in Hive Metastore.

Posted by sranka on April 10, 2014

Hi All

As Hive metastore is getting into the center of nervous system for the different type of  SQL engines like Shark and Impala. It getting equally difficult to distinguish type of table created in Hive metastore. Eg. if we create a impala table using impala shell you will see the same table on hive prompt and vice versa. See the below example


Step 1 : “Create Table” in Impala Shell and “Show Table” On HIVE Shell

[] > create table impala_table ( id bigint);

[] > show tables 'impala_table';

Query: show tables 'impala_table'
Query finished, fetching results ...
| name             |
| impala_table |
Returned 1 row(s) in 0.01s

hive> show tables 'impala_table';
Time taken: 0.073 seconds

Step 2 : “Create Table” in Hive Shell and “Show Table” On Impala Shell

hive> create table hive_table ( id bigint);
Time taken: 0.058 seconds

Step 3 : Invalidate Metadata on Impala Shell ( This may not be needed always )

[] > invalidate metadata;
Query: invalidate metadata
Query finished, fetching results ...

Returned 0 row(s) in 5.11s

Step 4 : “Show Table” On Impala Shell


[] > show tables 'hive_table';
Query: show tables 'hive_table'
Query finished, fetching results ...
| name       |
| hive_table |
Returned 1 row(s) in 0.01s

In short this proves that tables are visible in both shells. Use describe formatted <table name>¬† command to find out the details. Storage Desc Params will show a value “serialization.format” for hive table, where in for Impala table, we will not have any value.


hive> describe formatted hive_table;
# col_name              data_type               comment

id                      bigint                  None

# Detailed Table Information
Database:               default
Owner:                  rsunil
CreateTime:             Thu Apr 10 13:13:09 PDT 2014
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://
Table Type:             MANAGED_TABLE
Table Parameters:
transient_lastDdlTime   1397160789

# Storage Information
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:            org.apache.hadoop.mapred.TextInputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:

serialization.format    1

Time taken: 0.115 seconds


hive> describe formatted impala_table;
# col_name data_type comment

id bigint None

# Detailed Table Information
Database: default
Owner: rsunil
CreateTime: Thu Apr 10 13:10:30 PDT 2014
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://
Table Parameters:
transient_lastDdlTime 1397160630

# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
Compressed: No
Num Buckets: 0
Bucket Columns: []
Sort Columns: []
Time taken: 0.185 seconds



For tables created in impala with Parquet format will give below class exception.

hive> describe formatted parquet_ob_mdm_et28;
FAILED: RuntimeException java.lang.ClassNotFoundException: com.cloudera.impala.hive.serde.ParquetInputFormat</pre>

Hope this helps

Sunil S Ranka

‚ÄúSuperior BI is the antidote to Business Failure‚ÄĚ

Posted in Big Data | Tagged: , , , , , , , , | Leave a Comment »

How To Create External Hive Table on HBase

Posted by sranka on March 28, 2014

Hi All,

While building a data flow for replacing one of the EDW’ workflow using Big Data technology stack , came across some interesting findings and issues. ¬†Due to ¬†UPSERT (¬†INSERT new records or UPDATE existing records depending)¬†nature of data we had to use Hbase, but to expose the outbound feed we need to do some calculation on HBase and publish that to Hive as external. Even though conceptually , its easy to create an external hive table on HBase is simple, but I had to go through some hoop.


Table Creation in hbase
hbase(main):002:0> create 'mytable', 'cf'
hbase(main):004:0> put 'mytable', 'first', 'cf:message', 'hello HBase'
hbase(main):005:0> put 'mytable', 'second', 'cf:foo', 0x0
0 row(s) in 0.0130 seconds
hbase(main):006:0> put 'mytable', 'third', 'cf:bar', 3.14159
0 row(s) in 0.0080 second

hbase(main):002:0> describe 'mytable'
'2147483647', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65536', IN_MEMORY => 'false'
, ENCODE_ON_DISK => 'true', BLOCKCACHE => 'true'}
1 row(s) in 0.9610 seconds

hbase(main):003:0> scan 'mytable'
first column=cf:foo, timestamp=1395167684857, value=0
first column=cf:message, timestamp=1395167407496, value=hello HBase
second column=cf:foo, timestamp=1395167483988, value=0
third column=cf:bar, timestamp=1395167493639, value=3.14159
3 row(s) in 0.0760 seconds

Table Creation in Hive

Hive >  CREATE EXTERNAL TABLE hbase_table_3(key string, value string,value1 string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf:foo,cf:message") TBLPROPERTIES ("" = "mytable");

Table Access in Hive

hive> set hbase.client.scanner.caching=50000;
hive> desc hbase_table_3;
key     string  from deserializer
value   string  from deserializer
value1  string  from deserializer
Time taken: 0.428 seconds

Zookeeper ,  Aux Path and hbase.client.scanner.caching 

Zookeeper is an important part of  Hadoop ecosystem, it works as a Resource Management service. You would need to make sure that it has a quorum with odd numbers (1,3,5) of instances. For accessing external table you need to have zookeeper services up and running.  Along with zookeeper you will need to make few changes.

If you have big HBase table, you will need to set higher hbase.client.scanner.caching property before running the Hbase query. In our case we use 50000.

See below changes needed to hive-site.xml and hbase-site.xml.

Changes in hive-site.xml



Changes in hbase-site.xml


 Linux Performance Tuning

Some of the following commands have helped enhancing performance.

echo 1 > /proc/sys/vm/drop_caches
echo 2 > /proc/sys/vm/drop_caches
echo 3 > /proc/sys/vm/drop_caches


Special Thanks to Aditi Hedge, Rathinavel Sivaswamy and Anurag Gupta for their inputs.

Hope this helps

Sunil S Ranka

‚ÄúSuperior BI is the antidote to Business Failure”

Posted in Big Data, HBase | Tagged: , , , , , , , , | Leave a Comment »

Hbase : Co-relation between RegionServer and Region

Posted by sranka on March 21, 2014

Hi All

While looking into HBase performance issue, one of the suggestion was to have more region for a larger table. There was some confusion around, “Region” vs “RegionServer” .¬†While doing some digging, found a simple text written below.

The basic unit of scalability and load balancing in HBase is called a region. Regions are essentially contiguous ranges of rows stored together. They are dynamically split by the system when they become too large. Alternatively, they may also be merged to reduce their number and required storage files.*

The HBase regions are equivalent to range partitions as used in database sharding. They can be spread across many physical servers, thus distributing the load, and therefore providing scalability

Initially there is only one region for a table, and as you start adding data to it, the system¬†is monitoring it to ensure that you do not exceed a configured maximum size. If you¬†exceed the limit, the region is split into two at the middle key‚ÄĒthe row key in the middle¬†of the region‚ÄĒcreating two roughly equal halves.

Each region is served by exactly one region server, and each of these servers can serve many regions at any time. The logical view of a table is actually a set of regions hosted by many region servers.

The default split policy for HBase 0.94 and trunk is IncreasingToUpperBoundRegionSplitPolicy, which does more aggressive splitting based on the number of regions hosted in the same region server. The split policy uses the max store file size based on Min (R^2 * ‚Äúhbase.hregion.memstore.flush.size‚ÄĚ, ‚Äúhbase.hregion.max.filesize‚ÄĚ), where R is the number of regions of the same table hosted on the same regionserver. So for example, with the default memstore flush size of 128MB and the default max store size of 10GB, the first region on the region server will be split just after the first flush at 128MB. As number of regions hosted in the region server increases, it will use increasing split sizes: 512MB, 1152MB, 2GB, 3.2GB, 4.6GB, 6.2GB, etc. After reaching 9 regions, the split size will go beyond the configured ‚Äúhbase.hregion.max.filesize‚ÄĚ, at which point, 10GB split size will be used from then on. For both of these algorithms, regardless of when splitting occurs, the split point used is the rowkey that corresponds to the mid point in the ‚Äúblock index‚ÄĚ for the largest store file in the largest store.

¬† The above text has been taken from Chapter 1 – Introduction, section –¬†Building Blocks of¬†“HBase The Definitive Guide” book and “HortonWorks Blog “.

Hope This Helps

Sunil S Ranka

“Superior BI is the antidote to Business Failure

Posted in Uncategorized | Tagged: , , , , , , , | Leave a Comment »


Get every new post delivered to your Inbox.

Join 50 other followers

%d bloggers like this: