Sunil S. Ranka's Weblog

Superior Data Analytics is the antidote to Business Failure

Archive for the ‘Big Data’ Category

More Animals in Big Data Zoo – Big Data Landscape for 2016

Posted by sranka on March 26, 2016

Hi All

While surfing net stumbled upon Big Data Landscape for 2016 image and it was very impressive to see many more new Animals in Big Data Zoo.

 

New Animals

Hope This Helps

Sunil S Ranka

Posted in Big Data, Hadoop, Me | Tagged: , , , | Leave a Comment »

Permissions for both HDFS and local fileSystem paths

Posted by sranka on July 18, 2014

Hi All,

Permission issues is one of the key error , while setting up Hadoop Cluster, while debugging some error found below table on http://hadoop.apache.org/ . It’s a good scorecard to keep handy.

 

Permissions for both HDFS and local fileSystem paths

The following table lists various paths on HDFS and local filesystems (on all nodes) and recommended permissions:

Filesystem Path User:Group Permissions
local dfs.namenode.name.dir hdfs:hadoop drwx——
local dfs.datanode.data.dir hdfs:hadoop drwx——
local $HADOOP_LOG_DIR hdfs:hadoop drwxrwxr-x
local $YARN_LOG_DIR yarn:hadoop drwxrwxr-x
local yarn.nodemanager.local-dirs yarn:hadoop drwxr-xr-x
local yarn.nodemanager.log-dirs yarn:hadoop drwxr-xr-x
local container-executor root:hadoop –Sr-s—
local conf/container-executor.cfg root:hadoop r——–
hdfs / hdfs:hadoop drwxr-xr-x
hdfs /tmp hdfs:hadoop drwxrwxrwxt
hdfs /user hdfs:hadoop drwxr-xr-x
hdfs yarn.nodemanager.remote-app-log-dir yarn:hadoop drwxrwxrwxt
hdfs mapreduce.jobhistory.intermediate-done-dir mapred:hadoop drwxrwxrwxt
hdfs mapreduce.jobhistory.done-dir mapred:hadoop drwxr-x—

Hope this helps

Sunil S Ranka

“Superior BI is the antidote to Business Failure”

This table was taken directly from http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-common/SecureMode.html

 

Posted in 11g, Big Data | Tagged: , , , , , , , | 1 Comment »

Need for Defining Reference Architecture For Big Data

Posted by sranka on May 7, 2014

Hi Fellow Big Data Admirers ,

With big data and analytics playing an influential role helping organizations achieve a competitive advantage, IT managers are advised not to deploy big data in silos but instead to take a holistic approach toward it and define a base reference architecture even before contemplating positioning the necessary tools. 

My latest print media article (5th in the series) for CIO magazine (ITNEXT) talks extensively about need of reference architecture in Big Data

Click Here For : Need For Defining Big Data Reference Architecture

 

Hope you Enjoy Reading this.

Hope this helps

Sunil S Ranka

“Superior BI is the antidote to Business Failure”

For copy of May 2014 IT Next Magazine please visit http://www.itnext.in/digital_assets/330/IT-NEXT-Vol-05-Issue-04-May-2014.pdf ( My Article is on Page 37 )

 

 

Posted in Big Data | Tagged: , , , , , , , , , , , , | Leave a Comment »

How to find out a table type in Hive Metastore.

Posted by sranka on April 10, 2014

Hi All

As Hive metastore is getting into the center of nervous system for the different type of  SQL engines like Shark and Impala. It getting equally difficult to distinguish type of table created in Hive metastore. Eg. if we create a impala table using impala shell you will see the same table on hive prompt and vice versa. See the below example

 

Step 1 : “Create Table” in Impala Shell and “Show Table” On HIVE Shell


[samvi.saarth.dev.com:21000] > create table impala_table ( id bigint);

[samvi.saarth.dev.com:21000] > show tables 'impala_table';

Query: show tables 'impala_table'
Query finished, fetching results ...
+--------------+
| name             |
+--------------+
| impala_table |
+--------------+
Returned 1 row(s) in 0.01s

hive> show tables 'impala_table';
OK
impala_table
Time taken: 0.073 seconds

Step 2 : “Create Table” in Hive Shell and “Show Table” On Impala Shell

hive> create table hive_table ( id bigint);
OK
Time taken: 0.058 seconds

Step 3 : Invalidate Metadata on Impala Shell ( This may not be needed always )


[samvi.saarth.dev.com:21000] > invalidate metadata;
Query: invalidate metadata
Query finished, fetching results ...

Returned 0 row(s) in 5.11s

Step 4 : “Show Table” On Impala Shell

 

[samvi.saarth.dev.com:21000] > show tables 'hive_table';
Query: show tables 'hive_table'
Query finished, fetching results ...
+------------+
| name       |
+------------+
| hive_table |
+------------+
Returned 1 row(s) in 0.01s

In short this proves that tables are visible in both shells. Use describe formatted <table name>  command to find out the details. Storage Desc Params will show a value “serialization.format” for hive table, where in for Impala table, we will not have any value.

 

hive> describe formatted hive_table;
OK
# col_name              data_type               comment

id                      bigint                  None

# Detailed Table Information
Database:               default
Owner:                  rsunil
CreateTime:             Thu Apr 10 13:13:09 PDT 2014
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://samvi.saarth.dev.com:8020/app/hadoop/hive/warehouse/hive_table
Table Type:             MANAGED_TABLE
Table Parameters:
transient_lastDdlTime   1397160789

# Storage Information
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:            org.apache.hadoop.mapred.TextInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:

serialization.format    1

Time taken: 0.115 seconds

 




hive> describe formatted impala_table;
OK
# col_name data_type comment

id bigint None

# Detailed Table Information
Database: default
Owner: rsunil
CreateTime: Thu Apr 10 13:10:30 PDT 2014
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://samvi.saarth.dev.com:8020/app/hadoop/hive/warehouse/impala_table
Table Type: MANAGED_TABLE
Table Parameters:
transient_lastDdlTime 1397160630

# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: 0
Bucket Columns: []
Sort Columns: []
Time taken: 0.185 seconds

 

 

For tables created in impala with Parquet format will give below class exception.

hive> describe formatted parquet_ob_mdm_et28;
FAILED: RuntimeException java.lang.ClassNotFoundException: com.cloudera.impala.hive.serde.ParquetInputFormat</pre>
<pre>

Hope this helps

Sunil S Ranka

“Superior BI is the antidote to Business Failure”

Posted in Big Data | Tagged: , , , , , , , , | 1 Comment »

How To Create External Hive Table on HBase

Posted by sranka on March 28, 2014

Hi All,

While building a data flow for replacing one of the EDW’ workflow using Big Data technology stack , came across some interesting findings and issues.  Due to  UPSERT ( INSERT new records or UPDATE existing records depending) nature of data we had to use Hbase, but to expose the outbound feed we need to do some calculation on HBase and publish that to Hive as external. Even though conceptually , its easy to create an external hive table on HBase is simple, but I had to go through some hoop.

 


Table Creation in hbase
hbase(main):002:0> create 'mytable', 'cf'
hbase(main):004:0> put 'mytable', 'first', 'cf:message', 'hello HBase'
hbase(main):005:0> put 'mytable', 'second', 'cf:foo', 0x0
0 row(s) in 0.0130 seconds
hbase(main):006:0> put 'mytable', 'third', 'cf:bar', 3.14159
0 row(s) in 0.0080 second

hbase(main):002:0> describe 'mytable'
DESCRIPTION ENABLED
'mytable', {NAME => 'cf', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'NONE', REPLICA true
TION_SCOPE => '0', VERSIONS => '3', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL =>
'2147483647', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65536', IN_MEMORY => 'false'
, ENCODE_ON_DISK => 'true', BLOCKCACHE => 'true'}
1 row(s) in 0.9610 seconds

hbase(main):003:0> scan 'mytable'
ROW COLUMN+CELL
first column=cf:foo, timestamp=1395167684857, value=0
first column=cf:message, timestamp=1395167407496, value=hello HBase
second column=cf:foo, timestamp=1395167483988, value=0
third column=cf:bar, timestamp=1395167493639, value=3.14159
3 row(s) in 0.0760 seconds

Table Creation in Hive


Hive >  CREATE EXTERNAL TABLE hbase_table_3(key string, value string,value1 string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf:foo,cf:message") TBLPROPERTIES ("hbase.table.name" = "mytable");

Table Access in Hive

hive> set hbase.client.scanner.caching=50000;
hive> desc hbase_table_3;
OK
key     string  from deserializer
value   string  from deserializer
value1  string  from deserializer
Time taken: 0.428 seconds
hive>

Zookeeper ,  Aux Path and hbase.client.scanner.caching 

Zookeeper is an important part of  Hadoop ecosystem, it works as a Resource Management service. You would need to make sure that it has a quorum with odd numbers (1,3,5) of instances. For accessing external table you need to have zookeeper services up and running.  Along with zookeeper you will need to make few changes.

If you have big HBase table, you will need to set higher hbase.client.scanner.caching property before running the Hbase query. In our case we use 50000.

See below changes needed to hive-site.xml and hbase-site.xml.

Changes in hive-site.xml

<property>
  <name>hive.zookeeper.quorum</name>
   <value>devapphdp08.samvi.com,devapphdp09.samvi.com,devapphdp07.samvi.com</value>
</property>
<property>
 <name>hive.aux.jars.path</name>
 <value>file:///usr/lib/hive/lib/hive-hbase-handler-0.10.0-cdh4.6.0.jar,file:///usr/lib/hive/lib/hbase.jar,file:///usr/lib/zookeeper/zookeeper.jar</value>
</property>

 

Changes in hbase-site.xml

<property>
  <name>hive.zookeeper.quorum</name>
   <value>devapphdp08.samvi.com,devapphdp09.samvi.com,devapphdp07.samvi.com</value>
</property>

 Linux Performance Tuning

Some of the following commands have helped enhancing performance.

echo 1 > /proc/sys/vm/drop_caches
echo 2 > /proc/sys/vm/drop_caches
echo 3 > /proc/sys/vm/drop_caches

 

Special Thanks to Aditi Hedge, Rathinavel Sivaswamy and Anurag Gupta for their inputs.

Hope this helps

Sunil S Ranka

“Superior BI is the antidote to Business Failure”

Posted in Big Data, HBase | Tagged: , , , , , , , , | Leave a Comment »

How To Find Size Of Table In Hive / HDFS

Posted by sranka on November 19, 2013

Hi All

Volume on BigData being the constant challenge, as an administrator, you will have to keep a tab on the data growth, at the same time you need to make sure there is spurge growth of unwanted objects or folders. Typically you would want to be worried about the data growth in GB size. Hence below is the script which you could use to translate your current folder size to GB. Anything below GB would be shown as 0. This is a simple script, you can modify to track MB level details as well. Just change the multiplier factor of 1024.

sudo -u hdfs hadoop fs -du /app/hadoop/hive/warehouse/ | awk '/^[0-9]+/ { print int($1/(1024**3)) " [GB]\t" $2 }'

Hope This Helps

Sunil S Ranka

“Superior BI is the antidote to Business Failure”

Posted in Big Data | Tagged: , , , , , , , , , | Leave a Comment »

Behind The Scene Of MapReduce Job

Posted by sranka on October 28, 2013

Hi All

Recently I have spending most of my time on Big Data projects,using CDH 4.X. Understanding key component of hadoop infrastruture  is very necessary, But the MapReduce (MR) is the most important for processing and aggregrating data. For getting the best of the performance, one needs to know the details of MapReduce job. After reading several white papers and few books, in my opinion below paragraph summarizes the MapReduce THE BEST !!!!!

All About Map Reduce 

The execution of a MapReduce job is broken down into map tasks and reduce tasks. Subsequently, map task execution is divided into the phases: Read (reading map inputs), Map (map function pro-cessing), Collect (serializing to buffer and partitioning), Spill (sorting, combining, compressing, and writing map outputs to local disk), and Merge (merging sorted spill files). Reduce task execution is divided into the phases: Shuffle (transferring map outputs to reduce tasks, with decompression if needed), Merge (merging sorted map outputs), Reduce (reduce function processing), and Write (writing reduce outputs to the distributed file-system). Each phase represents an important part of
the job’s overall execution in Hadoop.

In the MapReduce model, computation is divided into a map function and a reduce function. The map function takes a key/value pair and produces one or more intermediate key/value pairs. The reduce function then takes these intermediate key/value pairs and merges all values corresponding to a single key. The map function can run independently on each key/value pair, exposing enormous amounts of parallelism. Similarly, the reduce function can run independently on each intermediate key, also exposing significant parallelism. In Hadoop, a centralized JobTracker service is responsible for splitting the input data into pieces for processing by independent map and reduce tasks, scheduling each task on a cluster node for execution, and recovering from failures by re-running tasks. On each node, a TaskTracker service runs MapReduce tasks and periodically contacts the JobTracker to report task completions and request new tasks. By default, when a new task is received, a new JVM instance will be spawned to execute it.

The about text is taken from

The Hadoop Distributed Filesystem: Balancing Portability and Performance by Jeffrey Shafer, Scott Rixner, and Alan L. Cox :Rice University

Technical Report : Hadoop Performance Models By Herodotos Herodotou 

 

Hope This helps

Sunil S Ranka

“Superior BI is the antidote to Business Failure”

Posted in Big Data, sunil s ranka | Tagged: , , , , , , , , | Leave a Comment »

Hadoop Data Replication Strategy

Posted by sranka on October 17, 2013

Hi All

With replication and fault tolerance, an inbuilt feature of Hadoop. I was always curious to know how blocks are replicated. Got this information while reading “Hadoop The Definitive Guide Edition – 3 ”  in chapter 3 “The Hadoop Distributed Filesystem”. Thought would be interesting to share.

  • How does the namenode choose which datanodes to store replicas on?

Hadoop’s default strategy is to place the first replica on the same node as the client (for clients running outside the cluster, a node is chosen at random, although the system tries not to pick nodes that are too full or too busy). The second replica is placed on a different rack from the first (off-rack), chosen at random. The third replica is placed on the same rack as the second, but on a different node chosen at random. Further replicas are placed on random nodes on the cluster, although the system tries to avoid placing too many replicas on the same rack.

 The above entire text has been taken from Chapter 3 of “Hadoop The Definitive Guide Edition – 3 “

Hope This helps

Sunil S Ranka

“Superior BI is the antidote to Business Failure”

Posted in Big Data, Hadoop | Tagged: , , , , , , , | Leave a Comment »

 
%d bloggers like this: