Sunil S. Ranka's Weblog

Superior BI is the antidote to Business Failure

How to find out a table type in Hive Metastore.

Posted by sranka on April 10, 2014

Hi All

As Hive metastore is getting into the center of nervous system for the different type of  SQL engines like Shark and Impala. It getting equally difficult to distinguish type of table created in Hive metastore. Eg. if we create a impala table using impala shell you will see the same table on hive prompt and vice versa. See the below example

 

Step 1 : “Create Table” in Impala Shell and “Show Table” On HIVE Shell


[samvi.saarth.dev.com:21000] > create table impala_table ( id bigint);

[samvi.saarth.dev.com:21000] > show tables 'impala_table';

Query: show tables 'impala_table'
Query finished, fetching results ...
+--------------+
| name             |
+--------------+
| impala_table |
+--------------+
Returned 1 row(s) in 0.01s

hive> show tables 'impala_table';
OK
impala_table
Time taken: 0.073 seconds

Step 2 : “Create Table” in Hive Shell and “Show Table” On Impala Shell

hive> create table hive_table ( id bigint);
OK
Time taken: 0.058 seconds

Step 3 : Invalidate Metadata on Impala Shell ( This may not be needed always )


[samvi.saarth.dev.com:21000] > invalidate metadata;
Query: invalidate metadata
Query finished, fetching results ...

Returned 0 row(s) in 5.11s

Step 4 : “Show Table” On Impala Shell

 

[samvi.saarth.dev.com:21000] > show tables 'hive_table';
Query: show tables 'hive_table'
Query finished, fetching results ...
+------------+
| name       |
+------------+
| hive_table |
+------------+
Returned 1 row(s) in 0.01s

In short this proves that tables are visible in both shells. Use describe formatted <table name>  command to find out the details. Storage Desc Params will show a value “serialization.format” for hive table, where in for Impala table, we will not have any value.

 

hive> describe formatted hive_table;
OK
# col_name              data_type               comment

id                      bigint                  None

# Detailed Table Information
Database:               default
Owner:                  rsunil
CreateTime:             Thu Apr 10 13:13:09 PDT 2014
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://samvi.saarth.dev.com:8020/app/hadoop/hive/warehouse/hive_table
Table Type:             MANAGED_TABLE
Table Parameters:
transient_lastDdlTime   1397160789

# Storage Information
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:            org.apache.hadoop.mapred.TextInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:

serialization.format    1

Time taken: 0.115 seconds

 




hive> describe formatted impala_table;
OK
# col_name data_type comment

id bigint None

# Detailed Table Information
Database: default
Owner: rsunil
CreateTime: Thu Apr 10 13:10:30 PDT 2014
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://samvi.saarth.dev.com:8020/app/hadoop/hive/warehouse/impala_table
Table Type: MANAGED_TABLE
Table Parameters:
transient_lastDdlTime 1397160630

# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: 0
Bucket Columns: []
Sort Columns: []
Time taken: 0.185 seconds

 

 

For tables created in impala with Parquet format will give below class exception.

hive> describe formatted parquet_ob_mdm_et28;
FAILED: RuntimeException java.lang.ClassNotFoundException: com.cloudera.impala.hive.serde.ParquetInputFormat</pre>
<pre>

Hope this helps

Sunil S Ranka

“Superior BI is the antidote to Business Failure”

Posted in Big Data | Tagged: , , , , , , , , | Leave a Comment »

How To Create External Hive Table on HBase

Posted by sranka on March 28, 2014

Hi All,

While building a data flow for replacing one of the EDW’ workflow using Big Data technology stack , came across some interesting findings and issues.  Due to  UPSERT ( INSERT new records or UPDATE existing records depending) nature of data we had to use Hbase, but to expose the outbound feed we need to do some calculation on HBase and publish that to Hive as external. Even though conceptually , its easy to create an external hive table on HBase is simple, but I had to go through some hoop.

 


Table Creation in hbase
hbase(main):002:0> create 'mytable', 'cf'
hbase(main):004:0> put 'mytable', 'first', 'cf:message', 'hello HBase'
hbase(main):005:0> put 'mytable', 'second', 'cf:foo', 0x0
0 row(s) in 0.0130 seconds
hbase(main):006:0> put 'mytable', 'third', 'cf:bar', 3.14159
0 row(s) in 0.0080 second

hbase(main):002:0> describe 'mytable'
DESCRIPTION ENABLED
'mytable', {NAME => 'cf', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'NONE', REPLICA true
TION_SCOPE => '0', VERSIONS => '3', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL =>
'2147483647', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65536', IN_MEMORY => 'false'
, ENCODE_ON_DISK => 'true', BLOCKCACHE => 'true'}
1 row(s) in 0.9610 seconds

hbase(main):003:0> scan 'mytable'
ROW COLUMN+CELL
first column=cf:foo, timestamp=1395167684857, value=0
first column=cf:message, timestamp=1395167407496, value=hello HBase
second column=cf:foo, timestamp=1395167483988, value=0
third column=cf:bar, timestamp=1395167493639, value=3.14159
3 row(s) in 0.0760 seconds

Table Creation in Hive


Hive >  CREATE EXTERNAL TABLE hbase_table_3(key string, value string,value1 string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf:foo,cf:message") TBLPROPERTIES ("hbase.table.name" = "mytable");

Table Access in Hive

hive> set hbase.client.scanner.caching=50000;
hive> desc hbase_table_3;
OK
key     string  from deserializer
value   string  from deserializer
value1  string  from deserializer
Time taken: 0.428 seconds
hive>

Zookeeper ,  Aux Path and hbase.client.scanner.caching 

Zookeeper is an important part of  Hadoop ecosystem, it works as a Resource Management service. You would need to make sure that it has a quorum with odd numbers (1,3,5) of instances. For accessing external table you need to have zookeeper services up and running.  Along with zookeeper you will need to make few changes.

If you have big HBase table, you will need to set higher hbase.client.scanner.caching property before running the Hbase query. In our case we use 50000.

See below changes needed to hive-site.xml and hbase-site.xml.

Changes in hive-site.xml

<property>
  <name>hive.zookeeper.quorum</name>
   <value>devapphdp08.samvi.com,devapphdp09.samvi.com,devapphdp07.samvi.com</value>
</property>
<property>
 <name>hive.aux.jars.path</name>
 <value>file:///usr/lib/hive/lib/hive-hbase-handler-0.10.0-cdh4.6.0.jar,file:///usr/lib/hive/lib/hbase.jar,file:///usr/lib/zookeeper/zookeeper.jar</value>
</property>

 

Changes in hbase-site.xml

<property>
  <name>hive.zookeeper.quorum</name>
   <value>devapphdp08.samvi.com,devapphdp09.samvi.com,devapphdp07.samvi.com</value>
</property>

 Linux Performance Tuning

Some of the following commands have helped enhancing performance.

echo 1 > /proc/sys/vm/drop_caches
echo 2 > /proc/sys/vm/drop_caches
echo 3 > /proc/sys/vm/drop_caches

 

Special Thanks to Aditi Hedge, Rathinavel Sivaswamy and Anurag Gupta for their inputs.

Hope this helps

Sunil S Ranka

“Superior BI is the antidote to Business Failure”

Posted in Big Data, HBase | Tagged: , , , , , , , , | Leave a Comment »

Hbase : Co-relation between RegionServer and Region

Posted by sranka on March 21, 2014

Hi All

While looking into HBase performance issue, one of the suggestion was to have more region for a larger table. There was some confusion around, “Region” vs “RegionServer” . While doing some digging, found a simple text written below.

The basic unit of scalability and load balancing in HBase is called a region. Regions are essentially contiguous ranges of rows stored together. They are dynamically split by the system when they become too large. Alternatively, they may also be merged to reduce their number and required storage files.*

The HBase regions are equivalent to range partitions as used in database sharding. They can be spread across many physical servers, thus distributing the load, and therefore providing scalability

Initially there is only one region for a table, and as you start adding data to it, the system is monitoring it to ensure that you do not exceed a configured maximum size. If you exceed the limit, the region is split into two at the middle key—the row key in the middle of the region—creating two roughly equal halves.

Each region is served by exactly one region server, and each of these servers can serve many regions at any time. The logical view of a table is actually a set of regions hosted by many region servers.

The default split policy for HBase 0.94 and trunk is IncreasingToUpperBoundRegionSplitPolicy, which does more aggressive splitting based on the number of regions hosted in the same region server. The split policy uses the max store file size based on Min (R^2 * “hbase.hregion.memstore.flush.size”, “hbase.hregion.max.filesize”), where R is the number of regions of the same table hosted on the same regionserver. So for example, with the default memstore flush size of 128MB and the default max store size of 10GB, the first region on the region server will be split just after the first flush at 128MB. As number of regions hosted in the region server increases, it will use increasing split sizes: 512MB, 1152MB, 2GB, 3.2GB, 4.6GB, 6.2GB, etc. After reaching 9 regions, the split size will go beyond the configured “hbase.hregion.max.filesize”, at which point, 10GB split size will be used from then on. For both of these algorithms, regardless of when splitting occurs, the split point used is the rowkey that corresponds to the mid point in the “block index” for the largest store file in the largest store.

  The above text has been taken from Chapter 1 – Introduction, section - Building Blocks of “HBase The Definitive Guide” book and “HortonWorks Blog “.

Hope This Helps

Sunil S Ranka

“Superior BI is the antidote to Business Failure

Posted in Uncategorized | Tagged: , , , , , , , | Leave a Comment »

HDFS Free Space Command

Posted by sranka on March 17, 2014

Hi All

With increasing data  volume , in HDFS space could be continued challenge. While running into some space related issue, following command came very handy, hence thought of sharing with extended virtual community.

At times it gets challenging to know how much of actual space a directory or a file is using.  Having a command which can give you human readable format of size is always useful.  Below command shows how to get actual human readable file size on HDFS

hdfs dfs -du -h /

241.3 G  /app
9.8 G    /benchmarks
309.6 G  /hbase
0        /system
59.6 G   /tmp
20.0 G   /user
[sranka@devHadoopSrvr06 ~]$

 

hadoop dfsadmin -report

Post running the command, below is the result, it takes all the nodes in the cluster and gives the detail break-up based on the space availability and spaces used.


Configured Capacity: 13965170479105 (12.70 TB)
Present Capacity: 4208469598208 (3.83 TB)
DFS Remaining: 2120881930240 (1.93 TB)
DFS Used: 2087587667968 (1.90 TB)
DFS Used%: 49.60%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0

-------------------------------------------------
Datanodes available: 5 (5 total, 0 dead)

Live datanodes:
Name: 160.33.148.202:50010 (devHadoopSrvr08.ps.am.mycompany.com)
Hostname: devHadoopSrvr08.ps.am.mycompany.com
Rack: /default
Decommission Status : Normal
Configured Capacity: 2793034095821 (2.54 TB)
DFS Used: 381953257472 (355.72 GB)
Non DFS Used: 1986904386765 (1.81 TB)
DFS Remaining: 424176451584 (395.05 GB)
DFS Used%: 13.68%
DFS Remaining%: 15.19%
Last contact: Mon Mar 17 12:43:05 PDT 2014

Name: 160.33.148.204:50010 (devHadoopSrvr10.ps.am.mycompany.com)
Hostname: devHadoopSrvr10.ps.am.mycompany.com
Rack: /default
Decommission Status : Normal
Configured Capacity: 2793034095821 (2.54 TB)
DFS Used: 402465816576 (374.83 GB)
Non DFS Used: 1966391827661 (1.79 TB)
DFS Remaining: 424176451584 (395.05 GB)
DFS Used%: 14.41%
DFS Remaining%: 15.19%
Last contact: Mon Mar 17 12:43:05 PDT 2014

Name: 160.33.148.203:50010 (devHadoopSrvr09.ps.am.mycompany.com)
Hostname: devHadoopSrvr09.ps.am.mycompany.com
Rack: /default
Decommission Status : Normal
Configured Capacity: 2793034095821 (2.54 TB)
DFS Used: 391020421120 (364.17 GB)
Non DFS Used: 1977837223117 (1.80 TB)
DFS Remaining: 424176451584 (395.05 GB)
DFS Used%: 14.00%
DFS Remaining%: 15.19%
Last contact: Mon Mar 17 12:43:06 PDT 2014

Name: 160.33.148.201:50010 (devHadoopSrvr07.ps.am.mycompany.com)
Hostname: devHadoopSrvr07.ps.am.mycompany.com
Rack: /default
Decommission Status : Normal
Configured Capacity: 2793034095821 (2.54 TB)
DFS Used: 389182472192 (362.45 GB)
Non DFS Used: 1979675172045 (1.80 TB)
DFS Remaining: 424176451584 (395.05 GB)
DFS Used%: 13.93%
DFS Remaining%: 15.19%
Last contact: Mon Mar 17 12:43:04 PDT 2014

Name: 160.33.148.59:50010 (devHadoopSrvr06.ps.am.mycompany.com)
Hostname: devHadoopSrvr06.ps.am.mycompany.com
Rack: /default
Decommission Status : Normal
Configured Capacity: 2793034095821 (2.54 TB)
DFS Used: 522965700608 (487.05 GB)
Non DFS Used: 1845892140237 (1.68 TB)
DFS Remaining: 424176254976 (395.04 GB)
DFS Used%: 18.72%
DFS Remaining%: 15.19%
Last contact: Mon Mar 17 12:43:05 PDT 2014

Hope This Helps

Sunil S Ranka

“Superior BI is the antidote to Business Failure”

Posted in Uncategorized | Tagged: , , , , , , | Leave a Comment »

Big Data : Right Approach Right Solution

Posted by sranka on February 1, 2014

Hi All,

Past few months I have been meeting with clients and discussing their potential need of Big Data. The discuss gets to the bottom of , do they really need the Big Data ? The below link to my ITNext article talks about As big data goes bigger,IT managers are challenged with the task of identifying data that qualifies for big and finding appropriate solutions to process it.

Click Here To Read Full Article  : Right Approach, Right Solution 

Hope This Helps

Sunil S Ranka

“Superior BI is the antidote to Business Failure”

Posted in Uncategorized | Tagged: , , , , , , , , | Leave a Comment »

Open Source Big Data Technologies

Posted by sranka on January 29, 2014

Hi All

While doing a comparison analysis for building a reference architecture for Big Data technology stumbled on a very impressive Open source Big Data Technology mashup . Thanks to http://www.bigdata-startups.com/ . The most impressive part of this mashup is breaking the whole Big Data operational paradigm into multiple stages and giving available opensource technology.

Open Source Big Data Techonologies

Hope This Helps

Sunil S Ranka

“Superior BI is the antidote to Business Failure”

Posted in Uncategorized | Tagged: , , , , | Leave a Comment »

How To Run Graph In Endeca Outside of Endeca Server

Posted by sranka on January 15, 2014

Hi All

Recently at one of the client,  we had a situation , where in Hadoop was taking lot longer that anticipated time to generate a file. The graph needed the file as an input, but since file was not getting generated on time, Endeca graph was picking up partially created file, causing data issue.  After looking into the issue, the best bet was to have a task dependency, we looked into running clover ETL from command line, but due to some product limitation, we were not able to run the graph from command line.

After discussing with Chris Lynskey from Oracle (original Endeca team) , I found that following simpleHttpApi could work:

http://<server>:<Port>/clover/simpleHttpApi/graph_run?sandbox=<sandbox>&graphID=<graphName>&nodeID=node01&verbose=MESSAGE

Parameter: graphID
Description: Text Id, which is unique in specified sandbox. File path relative to sandbox root .

e.g. graph%2FLoadViewDefinitions.grf , the “/” needs to encoded to “%2F”
Mandatory: Yes

Parameter: sandbox
Description: Sandbox code
Mandatory:Yes

Parameter:nodeID
Description:
In cluster mode it’s ID of node which should execute the job. However it’s not final. If the graph is distributed, or the node is disconnected, the graph may be executed on some another node.
Mandatory:No

Parameter:verbose
Description: MESSAGE | FULL
Mandatory: No
Default: MESSAGE

For more HTTP API please refere to Latitude Data Integrator Server Guide .

Hope this helps

Sunil S Ranka

“Superior BI is the antidote to Business Failure”

Posted in Uncategorized | Tagged: , , , , , , , , | Leave a Comment »

How To retrieve/backup Views In Endeca

Posted by sranka on January 3, 2014

Hi All,

Last few weeks I have been engaged with a customer, helping them them with remediation of Endeca project. During remediation, faced a typical challenge, where all the graphs and EQLs were erroring out. After doing some research found out that its a known issue . I spent good amount of time on this issue, hence thought of sharing this trivial but useful information .

Issue :

Endeca views gets lost during development, causing all the dependent Graphs, EQLs to error out.

Root Cause :

Unknown ( Could be a potential product Issue)

Solution :

After adding all the view definitions, run Export View definition Graph provided by endeca example, see below link for details.

https://wikis.oracle.com/display/endecainformationdiscovery/EID+3.0+Export+View+Configuration

after running the Export Graph, the view definitions gets stored in a XML file in view-manager directory under config-in , see the picture below :

Endeca

Take a back-up of the file  and store at secured location or version control the file. In case of a view definition lose, Go to integration server URL and click on following :

SandBox –> Project Name –>Config-in –> View-manager –> viewDefinition.xml –> fileEditor

Endeca1

Copy paste the content and hit UpdateFile button. Once you click on UpdateFile button, than run the Import View Definition , shown in the graph below.

https://wikis.oracle.com/display/endecainformationdiscovery/EID+3.0+Import+View+Configuration

Hope this helps

Sunil S Ranka

“Superior BI is the antidote to Business Failure”

Posted in Uncategorized | Tagged: , , , , , , , , , , | Leave a Comment »

Bye Bye 2013 !!! Year of Big Data

Posted by sranka on January 3, 2014

Hi All,

Wishing you all readers!! a very happy new year. 2013 is over and dawn of 2014 has arrived. It just feel like yesterday and now we are here sitting and waiting for the year number to change. By the time I am writting blog, Australia, Mumbai and Dubai has already seen the dawn of new year. Hoping to finish this writing before the dawn of New York

2013 A Year Of awareness and Adoption

What a year !!!! year of Big Data adoption, cloud BI , and year of crunching more and more data. Year started with client talking more about cohesive architecture of Traditional BI and Big Data.  Talked with many clients making them understand  need of Big Data and qualifying or disqualifying their Big Data use case.

From Strategy To Real World Use

Now that business analytics are here and enterprises are grappling with their own Big Data, it’s time to set some technical strategies in motion to harness these assets. Fortunately, solutions for the data center that can deliver both high performance computing (HPC) and Big Data analytics are becoming increasingly scalable and affordable, even for medium-size businesses. With availability of solutions , Big Data is getting easier and easier with more affordable options.

OOW 2013, year of America’s Cup, Cloud and Big Data !!!!

Open world was like a Saga,  Larry arranged a big red carpet with wide screen TV and live telecast of America’s cup. Never enjoyed a sport in this setting, a sport about which most of the viewer had no clue, but glued to the wide screen, cheering for Larry’s team. Larry did it again, by the time results were announced every one knew about the sports. In my memory 2013 OOW would remembered as America’s cup venue and “NO SHOW” from Larry. For me listing to Larry live is highlight of OOW, but luckily I caught him on Sunday. Hats Off!!! to  Thomas Kurian for jumping in live and show casing the demo , which had no meaning for most of the audiece, but I must say only LARRY ELLISION can do a NO SHOW in front of thousands of audience with strogest media presence. Still he is the most loved/hated technological person.

This year open world was all bout Cloud and Big Data, now one could say that oracle is not a cloud company, in memory database was another talk of the talk. Apart from few feature, there was not much in BI space. The best feature, so far I felt was self service option in Endeca. As usual, Big party at treasure Island was fun. 

OOW2013 OOW13-2 OOW13-1 OOW13-3

2014 would be a year of Big Data, it would be a natural conversion of BI customers to implement Big Data. 

2014 was a great year from personal as well professional point of view. We had our share of joy and happiness, with multiple flights and many segment, still managed to spent time with kids. With few regrets, looking forward for a positive outlook of 2014.

In all 2013 was a satisfactory year, waiting for 2014. Thank You for all your continue support and looking forward for the same in 2013.

Wishing Every one a Happy and Prosperous New Year, Be Safe!!!!!!

Thanks

Sunil S Ranka

“Superior BI is the antidote to Business Failure”

Posted in Uncategorized | Tagged: , , , , , , , , | Leave a Comment »

How To Find Size Of Table In Hive / HDFS

Posted by sranka on November 19, 2013

Hi All

Volume on BigData being the constant challenge, as an administrator, you will have to keep a tab on the data growth, at the same time you need to make sure there is spurge growth of unwanted objects or folders. Typically you would want to be worried about the data growth in GB size. Hence below is the script which you could use to translate your current folder size to GB. Anything below GB would be shown as 0. This is a simple script, you can modify to track MB level details as well. Just change the multiplier factor of 1024.

sudo -u hdfs hadoop fs -du /app/hadoop/hive/warehouse/ | awk '/^[0-9]+/ { print int($1/(1024**3)) " [GB]\t" $2 }'

Hope This Helps

Sunil S Ranka

“Superior BI is the antidote to Business Failure”

Posted in Big Data | Tagged: , , , , , , , , , | Leave a Comment »

 
Follow

Get every new post delivered to your Inbox.

Join 41 other followers

%d bloggers like this: