Sunil S. Ranka's Weblog

Superior Data Analytics is the antidote to Business Failure

Posts Tagged ‘Hadoop’

More Animals in Big Data Zoo – Big Data Landscape for 2016

Posted by sranka on March 26, 2016

Hi All

While surfing net stumbled upon Big Data Landscape for 2016 image and it was very impressive to see many more new Animals in Big Data Zoo.

 

New Animals

Hope This Helps

Sunil S Ranka

Posted in Big Data, Hadoop, Me | Tagged: , , , | Leave a Comment »

Big Data – Tez, MR, Spark Execution Engine : Performance Comparison

Posted by sranka on February 25, 2016

There is no question that massive data is being generated in greater volumes than ever before. Along with the traditional data set, new data sources as sensors, application logs, IOT devices, and social networks are adding to data growth. Unlike traditional ETL platforms like Informatica, ODI, DataStage that are largely proprietary commercial products, the majority of Big ETL platforms are powered by open source.

With many execution engines, customers are always curious about their usage and performance.

To put it into perspective, In this post I am running set of query against 3 key Query Engines namely Tez, MapReduce, Spark (MapReduce) to compare the query execution timings.

create external table sensordata_csv
(
ts string,
deviceid int,
sensorid int,
val double
)
row format delimited
fields terminated by '|'
stored as textfile
location '/user/sranka/MachineData/sensordata'
;

drop table sensordata_part;

create table sensordata_part
(
deviceid int,
sensorid int,
val double
)
partitioned by (ts string)
clustered by (deviceid) sorted by (deviceid) into 10 buckets
stored as orc
;

"**********************************************"
"** 1) Baseline: Read a csv without Tez"
" set hive.execution.engine=mr"
" select count(*) from sensordata_csv where ts = '2014-01-01'"
"**********************************************"
2016-02-25 02:57:27,444 Stage-1 map = 0%,  reduce = 0%
2016-02-25 02:57:35,880 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.84 sec
2016-02-25 02:57:44,420 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 4.99 sec
MapReduce Total cumulative CPU time: 4 seconds 990 msec
Ended Job = job_1456183816302_0046
MapReduce Jobs Launched:
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 4.99 sec   HDFS Read: 3499156 HDFS Write: 6 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 990 msec
OK
16733
Time taken: 32.524 seconds, Fetched: 1 row(s)

"**********************************************"
"** 2) Read a csv with Tez"
" set hive.execution.engine=tez"
" select count(*) from sensordata_csv where ts = '2014-01-01'"
"**********************************************"
Total jobs = 1
Launching Job 1 out of 1

Status: Running (application id: application_1456183816302_0047)

Map 1: -/-    Reducer 2: 0/1
Map 1: 0/1    Reducer 2: 0/1
Map 1: 0/1    Reducer 2: 0/1
Map 1: 0/1    Reducer 2: 0/1
Map 1: 1/1    Reducer 2: 0/1
Map 1: 1/1    Reducer 2: 1/1
Status: Finished successfully
OK
16733
Time taken: 16.905 seconds, Fetched: 1 row(s)

"**********************************************"
"** 3) Read a partition with Tez"
" select count(*) from sensordata_part where ts = '2014-01-01'"
"**********************************************"
Total jobs = 1
Launching Job 1 out of 1
Status: Running (application id: application_1456183816302_0047)

Map 1: -/-    Reducer 2: 0/1
Map 1: 0/2    Reducer 2: 0/1
Map 1: 1/2    Reducer 2: 0/1
Map 1: 2/2    Reducer 2: 0/1
Map 1: 2/2    Reducer 2: 1/1
Status: Finished successfully
OK
16733
Time taken: 6.503 seconds, Fetched: 1 row(s)

"**********************************************"
"** 4) Read a partition with Spark"
" select count(*) from sensordata_part where ts = '2014-01-01'"
"**********************************************"

Time taken: took 5.8 seconds

"**********************************************"
"** 5) Read a csv with Spark"
" select count(*) from sensordata_csv where ts = '2014-01-01'"
"**********************************************"
Time taken: took 4.5 seconds

Query 1select count(*) from sensordata_csv where ts = ‘2014-01-01’

Query 2select count(*) from sensordata_part where ts = ‘2014-01-01’

Below tables shows the execution timings :
Screen Shot 2016-02-24 at 11.07.03 PM

Conclusion Which Engine is right :

Spark being In memory execution engine comes out to be a clear winner, but in certain scenario especially in the current scenario of running query on partition table TEZ execution engines comes closer to spark.

With this you can not conclude that you Spark will solve your — World Hunger Problem — of Big ETL, being continuously growing product Spark has its own challenges when it comes to productization of the Spark workload, same holds True with TEZ. In all MR engine has been around for the most time and its been the core of HDFS framework, for mission critical workloads which are not time bound, MR could be the best choice.

Hope This Helps

Sunil S Ranka

About Spark : http://spark.apache.org/

About MapReduce : https://en.wikipedia.org/wiki/MapReduce

About Tez : https://tez.apache.org/

Posted in Hadoop | Tagged: , , , , , , , , , , | 1 Comment »

Map Reduce: File compression and Processing cost

Posted by sranka on August 19, 2015

Recently while working with a customer we ran into an interesting situation concerning file compression and processing time. For a system like Hadoop, file compression has been always a good way to save on space, especially when Hadoop replicates the data multiple times.

All Hadoop compression algorithms exhibit a space/time trade-off: faster compression and decompression speeds usually come at the expense of space savings.  For more details about how compression is used, see https://documentation.altiscale.com/when-and-why-to-use-compression. There are many file compression formats, but below we only mention some of the commonly used compression methods in Hadoop. 

The type of compression plays an important role — the true power of MapReduce is realized when input can be split, and not all compression formats are splittable, resulting in an unexpected number of map tasks.

In the case of splittable formats, the number of mappers will correspond to number of block-sized chunks into which the file has been stored, whereas in case of a non-splittable format a single map task will process all the blocks. Although the table above shows that LZO is not splittable, in fact it is possible to index LZO files so that performance can be greatly improved. At Altiscale, our experience has shown that by indexing LZO format files, it will make LZO-compressed files splittable and you can gain in performance. For more information, see https://documentation.altiscale.com/compressing-and-indexing-your-data-with-lzo on how to do it.

Format Codec Extension Splittable Hadoop HDInsight
DEFLATE org.apache.hadoop.io.compress.DefaultCodec .deflate N Y Y
Gzip org.apache.hadoop.io.compress.GzipCodec .gz N Y Y
Bzip2 org.apache.hadoop.io.compress.BZip2Codec .bz2 Y Y Y
LZO com.hadoop.compression.lzo.LzopCodec .lzo N Y N
LZ4 org.apache.hadoop.io.compress.Lz4Codec .Lz4 N Y N
Snappy org.apache.hadoop.io.compress.SnappyCodec .Snappy N Y N

To measure the performance and processing costs for the compression method, we did a simple wordcount example on Altiscale’s in-house production cluster by following the steps mentioned in the below link.  Although there are other more detailed methodologies to measure performance, we choose this example just to demonstrate the benefit and performance tradeoff of a splittable solution:

https://documentation.altiscale.com/wordcount-example
The following  table represents the results.

# File Name Compression Option size

(GB)

Comments # Of

Maps

# Of

Reducers

Processing Time
1 input.txt n/a 5.9 No Compression 24 1 1min 16sec
2 input.txt.gz default 1.46 Normal Gzip compression 1 1 11min 19sec
3 input1.txt.gz -1 1.42 gzip with -1 option, means optimize for speed 1 1 11min 41sec
4 input9.txt.gz -9 1.14 gzip with -9 option, means optimize for space 1 1 11min 21sec

The following shows how you can use the Resource Manager Web UI to find the relevant job statistics that show the processing time as well as the number of mappers and reducers that were used.  

input-gz

Conclusion

In our test #1 scenario the uncompressed file size was 5.9 GB when stored in HDFS. With a HDFS block size of 256 MB, the file was stored as ~24 blocks, and a MapReduce job using this file as input created 24 input splits, each processed independently as input to a separate map task taking only 1 min 16 sec.

In the rest of the test scenarios, due to gzip, the file could not be split, resulting in a single input split and taking an average time of approximately 11 min. Even the gzip -1 option, meant for optimize speed, or  -9 option, meant for optimize space, did not  help much.

Gzip compression is an important aspect of the Hadoop ecosystem; it helps save space at a trade off of processing time. If the data processing is time sensitive, then a splittable compression format, or even uncompressed files would be recommended.

Posted in Uncategorized | Tagged: , , , , , , | 1 Comment »

Need for Defining Reference Architecture For Big Data

Posted by sranka on May 7, 2014

Hi Fellow Big Data Admirers ,

With big data and analytics playing an influential role helping organizations achieve a competitive advantage, IT managers are advised not to deploy big data in silos but instead to take a holistic approach toward it and define a base reference architecture even before contemplating positioning the necessary tools. 

My latest print media article (5th in the series) for CIO magazine (ITNEXT) talks extensively about need of reference architecture in Big Data

Click Here For : Need For Defining Big Data Reference Architecture

 

Hope you Enjoy Reading this.

Hope this helps

Sunil S Ranka

“Superior BI is the antidote to Business Failure”

For copy of May 2014 IT Next Magazine please visit http://www.itnext.in/digital_assets/330/IT-NEXT-Vol-05-Issue-04-May-2014.pdf ( My Article is on Page 37 )

 

 

Posted in Big Data | Tagged: , , , , , , , , , , , , | Leave a Comment »

Big Data : Right Approach Right Solution

Posted by sranka on February 1, 2014

Hi All,

Past few months I have been meeting with clients and discussing their potential need of Big Data. The discuss gets to the bottom of , do they really need the Big Data ? The below link to my ITNext article talks about As big data goes bigger,IT managers are challenged with the task of identifying data that qualifies for big and finding appropriate solutions to process it.

Click Here To Read Full Article  : Right Approach, Right Solution 

Hope This Helps

Sunil S Ranka

“Superior BI is the antidote to Business Failure”

Posted in Uncategorized | Tagged: , , , , , , , , | Leave a Comment »

How To Find Size Of Table In Hive / HDFS

Posted by sranka on November 19, 2013

Hi All

Volume on BigData being the constant challenge, as an administrator, you will have to keep a tab on the data growth, at the same time you need to make sure there is spurge growth of unwanted objects or folders. Typically you would want to be worried about the data growth in GB size. Hence below is the script which you could use to translate your current folder size to GB. Anything below GB would be shown as 0. This is a simple script, you can modify to track MB level details as well. Just change the multiplier factor of 1024.

sudo -u hdfs hadoop fs -du /app/hadoop/hive/warehouse/ | awk '/^[0-9]+/ { print int($1/(1024**3)) " [GB]\t" $2 }'

Hope This Helps

Sunil S Ranka

“Superior BI is the antidote to Business Failure”

Posted in Big Data | Tagged: , , , , , , , , , | Leave a Comment »

Behind The Scene Of MapReduce Job

Posted by sranka on October 28, 2013

Hi All

Recently I have spending most of my time on Big Data projects,using CDH 4.X. Understanding key component of hadoop infrastruture  is very necessary, But the MapReduce (MR) is the most important for processing and aggregrating data. For getting the best of the performance, one needs to know the details of MapReduce job. After reading several white papers and few books, in my opinion below paragraph summarizes the MapReduce THE BEST !!!!!

All About Map Reduce 

The execution of a MapReduce job is broken down into map tasks and reduce tasks. Subsequently, map task execution is divided into the phases: Read (reading map inputs), Map (map function pro-cessing), Collect (serializing to buffer and partitioning), Spill (sorting, combining, compressing, and writing map outputs to local disk), and Merge (merging sorted spill files). Reduce task execution is divided into the phases: Shuffle (transferring map outputs to reduce tasks, with decompression if needed), Merge (merging sorted map outputs), Reduce (reduce function processing), and Write (writing reduce outputs to the distributed file-system). Each phase represents an important part of
the job’s overall execution in Hadoop.

In the MapReduce model, computation is divided into a map function and a reduce function. The map function takes a key/value pair and produces one or more intermediate key/value pairs. The reduce function then takes these intermediate key/value pairs and merges all values corresponding to a single key. The map function can run independently on each key/value pair, exposing enormous amounts of parallelism. Similarly, the reduce function can run independently on each intermediate key, also exposing significant parallelism. In Hadoop, a centralized JobTracker service is responsible for splitting the input data into pieces for processing by independent map and reduce tasks, scheduling each task on a cluster node for execution, and recovering from failures by re-running tasks. On each node, a TaskTracker service runs MapReduce tasks and periodically contacts the JobTracker to report task completions and request new tasks. By default, when a new task is received, a new JVM instance will be spawned to execute it.

The about text is taken from

The Hadoop Distributed Filesystem: Balancing Portability and Performance by Jeffrey Shafer, Scott Rixner, and Alan L. Cox :Rice University

Technical Report : Hadoop Performance Models By Herodotos Herodotou 

 

Hope This helps

Sunil S Ranka

“Superior BI is the antidote to Business Failure”

Posted in Big Data, sunil s ranka | Tagged: , , , , , , , , | Leave a Comment »

How To Run Hadoop Benchmarking TestDFSIO on Cloudera Clusters

Posted by sranka on October 9, 2013

Hi All

Out of the box hadoop provides a benchmarking mechanism for your cluster. While doing the same on Cloudera cluster, it was a fun ride, hence thought will share the same to reduce the pain and increase the fun.

Before you begin anything, set the HADOOP_HOME.The below command would work for RHEL.

HADOOP_HOME=/opt/cloudera/parcels/CDH/lib/hadoop/

For CDH “TestDFSIO” resides in — hadoop-mapreduce-client-jobclient-<version>-cdh<version>-tests.jar — in “lib/hadoop-mapreduce/” under “Cloudera Home Directory” in my case :

/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-2.0.0-cdh4.3.0-tests.jar

You will need to run read and write Test Benchmark as below :

hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-2.0.0-cdh4.3.0-tests.jar TestDFSIO -write -nrFiles 10 -fileSize 1000

hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-2.0.0-cdh4.3.0-tests.jar TestDFSIO -read -nrFiles 10 -fileSize 1000

Once you run the test you will see “TestDFSIO_results.log”  file in the same directory. The content of the file would look below :

----- TestDFSIO ----- : write
 Date & time: Wed Oct 09 14:56:14 PDT 2013
 Number of files: 10
Total MBytes processed: 10000.0
 Throughput mb/sec: 5.382930941302368
Average IO rate mb/sec: 5.390388488769531
 IO rate std deviation: 0.20763769922620628
 Test exec time sec: 211.457

----- TestDFSIO ----- : read
 Date & time: Wed Oct 09 14:57:47 PDT 2013
 Number of files: 10
Total MBytes processed: 10000.0
 Throughput mb/sec: 48.88230607167124
Average IO rate mb/sec: 49.50707244873047
 IO rate std deviation: 5.8465670196729596
 Test exec time sec: 39.954

Based on the numbers aboove, below would be the read and write Throughput across the cluster.

Total Read Throughput Across Clusters (Number of files * Throughput mb/sec) = 488.8MB/Sec
Total Write Throughput Across Clusters(Number of files * Throughput mb/sec) = 53.82 MB/Sec<br />

Hope This helps

Happy Benchmarking !!!

Sunil S Ranka
“Superior BI is the antidote to Business Failure”

Posted in Uncategorized | Tagged: , , , , , , , , , , , | 1 Comment »

Permission Denied Exception When ran count(*) query on Hive/Impala

Posted by sranka on October 8, 2013

Hi All

While while running a simple

beeline > select count(*) from samvi_test_table;

Got the following error.

<br />Job Submission failed with exception 'org.apache.hadoop.security.AccessControlException(Permission denied: user=root, access=WRITE, inode="/user":hdfs:supergroup:drwxr-xr-x<
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:224)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:204)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:149)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:4716)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:4698)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAncestorAccess(FSNamesystem.java:4672)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:3035)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:2999)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:2980)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:648)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:419)

After debugging into the details, found out the to read the count() the user needs to read the blocks on the HDFS. Since the user didn’t have needed privileges, I was getting the error. The issue was resolved using below ::


sudo -u hdfs hadoop fs -mkdir /user/root
sudo -u hdfs hadoop fs -chown root:root /user/root<br />

Hope this helps.

Sunil S Ranka

“Superior BI is the antidote to Business Failure”

Posted in Uncategorized | Tagged: , , , , , , , , , , | Leave a Comment »

 
%d bloggers like this: