Sunil S. Ranka's Weblog

Superior Data Analytics is the antidote to Business Failure

Machine Learning Algorithm Cheat Sheet

Posted by sranka on April 11, 2017

With so many algorithms around its always a struggle to find out which algorithm could be suitable for the problem statement, I want to solve. Microsoft has done an amazing job to start with. Please find attached  Machine Learning Algorithm Cheat Sheet .

Screen Shot 2017-04-10 at 7.23.12 PM

Hope This Helps

Sunil S Ranka

Posted in Big Data, sunil s ranka | Tagged: , , , , , | Leave a Comment »

Simplified Maze of MachineLearning Algorithm

Posted by sranka on November 12, 2016

Hi All

Finally, I found something that I was looking for a long time, thought is  would be very useful.  A simplified maze of #MachineLearning #Algorithm feeling like Nirvana Today #MachinesWouldRuleWorld #AdvanceAnalytics

c3c8b0f4-5a22-4f32-9fe4-92ab92b44f00-original

 

Hope this helps

Sunil S Ranka

“Superior Data Analytics is the antidote to Business Failure”

 

Posted in Uncategorized | Tagged: , , , | Leave a Comment »

What is Oracle Business Intelligence Cloud Service ( BICS )

Posted by sranka on August 11, 2016

Recently we have been getting lots of traction on BICS , existing OBIEE customers been asking for BICS . In nutshell :

BI Cloud Service enables organisations of all sizes to quickly and cost effectively deploy business intelligence with the simplicity of the cloud..

Silent features of BICS :

  • No need of software installation
  • No need of software maintenance
  • No upfront costs, low monthly subscription
  • Customers can get started in hours
  • 100% cloud based
  • Robust reporting with interactive visuals, auto-suggestions, detailed formatting, export, and more
  • Powerful analytics platform with advanced calculations and analytic functions
  • Easy self-serve data loading
  • Rich data integration options
  • Mobile access with no extra programming required
  • Comprehensive sharing framework
  • Role-based fine grain security
  • Simple self-service administration

Key Benefits :

  • Fast access and low cost speed time to value
  • Quick start means users are productive quickly
  • A single BI platform for all users helps consolidate analytic investments
  • Timely access to data means greater impact
  • Streamlined operations and reduced burden on IT

Summary :

Based on my past experience of working on OBIEE on premise, BICS is a good alternative for any IT organisation, BICS gives all the needed feature of On-Premise installation and flexibility of operation, management , and most importantly low cost solution. In next few post, I will describe more about BICS tool and features.

Hope this helps

Sunil S Ranka

“Superior Data Analytics is the antidote to Business Failure”

 

Posted in Uncategorized | Tagged: , , , , | Leave a Comment »

Cloud Allergy – Clouds Security and Changing Notion

Posted by sranka on June 30, 2016

With my recent role as CTO/Advisor with www.analytos.com, during most of my conversation with Analytics leaders within the company, all are concern over security. At a recent conversation with another entrepreneur friend, one of his solution was stalled due to SQL injection issue on the cloud ( a valid concern , but is it valid ?) .

During my recent startup sting, cloud Allergy word was coined, and it did make sense, because allergies do exist and you need to go past them , and need to worry about the only life threating ones.

My Early Internet Days

I remember the year 1996 , when I had created my 1st email address — coolguy123@yahoo.com –,  20 years back we were apprehensive about using our real name as part of the email address, now past 20 years, only hackers and late night chat rooms create fake ids. In the year 2001, when I got my 1st credit card ( $500 credit limit ), using it for online shopping was a taboo, in fact till mid of 2005 I had paid my PG&E bill in person at the authorized facility . With the mindset, The fear was not to use personal or financial information over the public internet.

Changing Notion

Come the year 2013 ( within 15 years ) , using a credit card is a norm, giving credit card number to a Comcast agent seating overseas is a trivial and nonissue.  With the notion of facebook, whatsApps, SnapChat and many more social Apps, we take pride and  effort to share personal and important moments with our — extended Social Families — (Yes, just coined a new word ).  With google search data retention capability, I tell my customer — Google Knows you more than your Wife or partner — Most of us take backup of most important documents by sending via email to yourself.

Most importantly kp.org (Kaiser Permanente, a leading national HMO) has all the personal information about your recent visits, vaccination and secured messaging through their enhanced portal .

With the mobile banking capability taking a photo of cheque and depositing is just another norm.

With the changing notion, we will go past the – cloud allergy — behaviour and some of the security questions and concerns will be trivial or non-issue.

Giant Cloud Providers and Security Capabilities

At times if you look at the public clouds, AWS, Google, and MS Azure, these giants are able to attract more talented individuals than most of the companies small to mid-size companies. With cloud being their core focus, they have hundreds of brilliant minds dedicated to security. A company with a modest budget can not match the level of expertise prominent cloud providers can spend on security. Unlike earlier, Fast Deployment, Lower Costs, and Rapid Time to Value have assumed advantages of cloud, security will/is achieving the same level of confidence.

Public clouds at times are much safer than the internal network ( Sony and Target hacking were the best example we all can use )

Trust in and adoption of cloud computing continues to grow despite persistent cloud-related security and compliance concerns. Such is the overarching takeaway of Intel Security’s recent report, “Blue Skies Ahead? The State of Cloud Adoption.” – See more at http://www.baselinemag.com/cloud-computing/slideshows/cloud-deployments-grow-despite-security-concerns.html#sthash.nXNytNaT.dpuf

Different Cloud Service Models :  

With the evolving nature of the cloud, Understanding the relationships and dependencies between different cloud servicing models are critical to understanding cloud computing security risks. IaaS is the foundation of all cloud services, with PaaS building upon IaaS, and SaaS, in turn, building upon PaaS as described in the Cloud.

** Infrastructure as a Service (IaaS), delivers computer infrastructure (typically a platform virtualization environment) as a service, along with raw storage and networking. Rather than purchasing servers, software, data-center space, or network equipment, clients instead buy those resources as a fully outsourced service.

** Software as a service (SaaS), sometimes referred to as “on-demand software,” is a software delivery model in which software and its associated data are hosted centrally (typically in the (Internet) cloud) and are typically accessed by users using a thin client, normally using a web browser over the Internet.

** Platform as a service (PaaS), is the delivery of a computing platform and solution stack as a service. PaaS offerings facilitate deployment of applications without the cost and complexity of buying and managing the underlying hardware and software and provisioning hosting capabilities. This provides all of the facilities required to support the complete life cycle of building and delivering web applications and services entirely available from the Internet.

** Definitions are taken from the internet.

** The figure below shows an example of how a cloud service mapping can be compared against a catalogue of compensating controls to determine which controls exist and which do not — as provided by the consumer, the cloud service provider, or a third party. This can, in turn, be compared to a compliance framework or set of requirements such as PCI DSS, as shown.

Picture1

** Mapping the Cloud Model to the Security Control & Compliance

 

** Text and Figure Taken from CSA (Cloud Security Alliance).

 

Conclusion:

Customer needs to be made aware of what they are considering moving to the cloud. Not every dataset moved to the cloud, needs the same level of security. For low critical dataset, lower security can be used. For a high-value dataset with audit, compliance requirement might entail audit and data retention requirements, for high-value dataset with no regularity compliance restrictions, there could me need for more technical security than the data retention. In short, there would be always a place for all type of dataset in the cloud.

 

Posted in Uncategorized | Tagged: , , , , , , , , | Leave a Comment »

More Animals in Big Data Zoo – Big Data Landscape for 2016

Posted by sranka on March 26, 2016

Hi All

While surfing net stumbled upon Big Data Landscape for 2016 image and it was very impressive to see many more new Animals in Big Data Zoo.

 

New Animals

Hope This Helps

Sunil S Ranka

Posted in Big Data, Hadoop, Me | Tagged: , , , | Leave a Comment »

Big Data – Tez, MR, Spark Execution Engine : Performance Comparison

Posted by sranka on February 25, 2016

There is no question that massive data is being generated in greater volumes than ever before. Along with the traditional data set, new data sources as sensors, application logs, IOT devices, and social networks are adding to data growth. Unlike traditional ETL platforms like Informatica, ODI, DataStage that are largely proprietary commercial products, the majority of Big ETL platforms are powered by open source.

With many execution engines, customers are always curious about their usage and performance.

To put it into perspective, In this post I am running set of query against 3 key Query Engines namely Tez, MapReduce, Spark (MapReduce) to compare the query execution timings.

create external table sensordata_csv
(
ts string,
deviceid int,
sensorid int,
val double
)
row format delimited
fields terminated by '|'
stored as textfile
location '/user/sranka/MachineData/sensordata'
;

drop table sensordata_part;

create table sensordata_part
(
deviceid int,
sensorid int,
val double
)
partitioned by (ts string)
clustered by (deviceid) sorted by (deviceid) into 10 buckets
stored as orc
;

"**********************************************"
"** 1) Baseline: Read a csv without Tez"
" set hive.execution.engine=mr"
" select count(*) from sensordata_csv where ts = '2014-01-01'"
"**********************************************"
2016-02-25 02:57:27,444 Stage-1 map = 0%,  reduce = 0%
2016-02-25 02:57:35,880 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.84 sec
2016-02-25 02:57:44,420 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 4.99 sec
MapReduce Total cumulative CPU time: 4 seconds 990 msec
Ended Job = job_1456183816302_0046
MapReduce Jobs Launched:
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 4.99 sec   HDFS Read: 3499156 HDFS Write: 6 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 990 msec
OK
16733
Time taken: 32.524 seconds, Fetched: 1 row(s)

"**********************************************"
"** 2) Read a csv with Tez"
" set hive.execution.engine=tez"
" select count(*) from sensordata_csv where ts = '2014-01-01'"
"**********************************************"
Total jobs = 1
Launching Job 1 out of 1

Status: Running (application id: application_1456183816302_0047)

Map 1: -/-    Reducer 2: 0/1
Map 1: 0/1    Reducer 2: 0/1
Map 1: 0/1    Reducer 2: 0/1
Map 1: 0/1    Reducer 2: 0/1
Map 1: 1/1    Reducer 2: 0/1
Map 1: 1/1    Reducer 2: 1/1
Status: Finished successfully
OK
16733
Time taken: 16.905 seconds, Fetched: 1 row(s)

"**********************************************"
"** 3) Read a partition with Tez"
" select count(*) from sensordata_part where ts = '2014-01-01'"
"**********************************************"
Total jobs = 1
Launching Job 1 out of 1
Status: Running (application id: application_1456183816302_0047)

Map 1: -/-    Reducer 2: 0/1
Map 1: 0/2    Reducer 2: 0/1
Map 1: 1/2    Reducer 2: 0/1
Map 1: 2/2    Reducer 2: 0/1
Map 1: 2/2    Reducer 2: 1/1
Status: Finished successfully
OK
16733
Time taken: 6.503 seconds, Fetched: 1 row(s)

"**********************************************"
"** 4) Read a partition with Spark"
" select count(*) from sensordata_part where ts = '2014-01-01'"
"**********************************************"

Time taken: took 5.8 seconds

"**********************************************"
"** 5) Read a csv with Spark"
" select count(*) from sensordata_csv where ts = '2014-01-01'"
"**********************************************"
Time taken: took 4.5 seconds

Query 1select count(*) from sensordata_csv where ts = ‘2014-01-01’

Query 2select count(*) from sensordata_part where ts = ‘2014-01-01’

Below tables shows the execution timings :
Screen Shot 2016-02-24 at 11.07.03 PM

Conclusion Which Engine is right :

Spark being In memory execution engine comes out to be a clear winner, but in certain scenario especially in the current scenario of running query on partition table TEZ execution engines comes closer to spark.

With this you can not conclude that you Spark will solve your — World Hunger Problem — of Big ETL, being continuously growing product Spark has its own challenges when it comes to productization of the Spark workload, same holds True with TEZ. In all MR engine has been around for the most time and its been the core of HDFS framework, for mission critical workloads which are not time bound, MR could be the best choice.

Hope This Helps

Sunil S Ranka

About Spark : http://spark.apache.org/

About MapReduce : https://en.wikipedia.org/wiki/MapReduce

About Tez : https://tez.apache.org/

Posted in Hadoop | Tagged: , , , , , , , , , , | 1 Comment »

Map Reduce: File compression and Processing cost

Posted by sranka on August 19, 2015

Recently while working with a customer we ran into an interesting situation concerning file compression and processing time. For a system like Hadoop, file compression has been always a good way to save on space, especially when Hadoop replicates the data multiple times.

All Hadoop compression algorithms exhibit a space/time trade-off: faster compression and decompression speeds usually come at the expense of space savings.  For more details about how compression is used, see https://documentation.altiscale.com/when-and-why-to-use-compression. There are many file compression formats, but below we only mention some of the commonly used compression methods in Hadoop. 

The type of compression plays an important role — the true power of MapReduce is realized when input can be split, and not all compression formats are splittable, resulting in an unexpected number of map tasks.

In the case of splittable formats, the number of mappers will correspond to number of block-sized chunks into which the file has been stored, whereas in case of a non-splittable format a single map task will process all the blocks. Although the table above shows that LZO is not splittable, in fact it is possible to index LZO files so that performance can be greatly improved. At Altiscale, our experience has shown that by indexing LZO format files, it will make LZO-compressed files splittable and you can gain in performance. For more information, see https://documentation.altiscale.com/compressing-and-indexing-your-data-with-lzo on how to do it.

Format Codec Extension Splittable Hadoop HDInsight
DEFLATE org.apache.hadoop.io.compress.DefaultCodec .deflate N Y Y
Gzip org.apache.hadoop.io.compress.GzipCodec .gz N Y Y
Bzip2 org.apache.hadoop.io.compress.BZip2Codec .bz2 Y Y Y
LZO com.hadoop.compression.lzo.LzopCodec .lzo N Y N
LZ4 org.apache.hadoop.io.compress.Lz4Codec .Lz4 N Y N
Snappy org.apache.hadoop.io.compress.SnappyCodec .Snappy N Y N

To measure the performance and processing costs for the compression method, we did a simple wordcount example on Altiscale’s in-house production cluster by following the steps mentioned in the below link.  Although there are other more detailed methodologies to measure performance, we choose this example just to demonstrate the benefit and performance tradeoff of a splittable solution:

https://documentation.altiscale.com/wordcount-example
The following  table represents the results.

# File Name Compression Option size

(GB)

Comments # Of

Maps

# Of

Reducers

Processing Time
1 input.txt n/a 5.9 No Compression 24 1 1min 16sec
2 input.txt.gz default 1.46 Normal Gzip compression 1 1 11min 19sec
3 input1.txt.gz -1 1.42 gzip with -1 option, means optimize for speed 1 1 11min 41sec
4 input9.txt.gz -9 1.14 gzip with -9 option, means optimize for space 1 1 11min 21sec

The following shows how you can use the Resource Manager Web UI to find the relevant job statistics that show the processing time as well as the number of mappers and reducers that were used.  

input-gz

Conclusion

In our test #1 scenario the uncompressed file size was 5.9 GB when stored in HDFS. With a HDFS block size of 256 MB, the file was stored as ~24 blocks, and a MapReduce job using this file as input created 24 input splits, each processed independently as input to a separate map task taking only 1 min 16 sec.

In the rest of the test scenarios, due to gzip, the file could not be split, resulting in a single input split and taking an average time of approximately 11 min. Even the gzip -1 option, meant for optimize speed, or  -9 option, meant for optimize space, did not  help much.

Gzip compression is an important aspect of the Hadoop ecosystem; it helps save space at a trade off of processing time. If the data processing is time sensitive, then a splittable compression format, or even uncompressed files would be recommended.

Posted in Uncategorized | Tagged: , , , , , , | 1 Comment »

Accessing HDFS files on local File system using mountableHDFS – FUSE

Posted by sranka on April 9, 2015

Hi All

Recently we had one requirement wherein we had to merge the files post Map and Reducer job. Since the file needed to be given to the outbound team outside of Hadoop development team, having these files on local system would have been ideal. The customer IT team worked with cloudera and gave us a mount point using a utility/concept called “mountableHDFS” aka FUSE (Filesystem in Userspace)  .

mountableHDFS, helps allowing HDFS to be mounted (on most flavors of Unix) as a standard file system using the mount command. Once mounted, the user can operate on an instance of hdfs using standard Unix utilities such as ‘ls’, ‘cd’, ‘cp’, ‘mkdir’, ‘find’, ‘grep’

For more details on mountableHDFS :

https://wiki.apache.org/hadoop/MountableHDFS

For how to configure on cloudera :

http://www.cloudera.com/content/cloudera/en/documentation/core/v5-2-x/topics/cdh_ig_hdfs_mountable.html

 

Special thanks to Aditi Hedge for bringing to my attention.

Hope This Helps,

Sunil S Ranka

“Superior BI is the antidote to Business Failure”

Posted in Uncategorized | Tagged: , , , , | Leave a Comment »

How to read HDFS fsImage file

Posted by sranka on April 2, 2015

During one of the sizing exercise the  ask for server capacity  was more than the actual usage of cluster . Knowing the data and usage, I was not convinced that we should be asking for more memory space. That triggered the thought of

Conceptually FSIMG file is the balancesheet of all the file and their existence and location. If somehow we could read the metadata withing the file and make sence out of it, than it could help us as follow :

  • how to keep the cluster clean.
  • how to manage the space on server by means of knowing file duplication, last access time
  • To Know which are longest running jobs

To more about the files and attributes :

STEP 1: Download the latest fsimage copy.

$ hdfs dfsadmin -fetchImage /tmp

$ ls -ltr /tmp | grep -i fsimage
-rw-r–r– 1 root root 22164 Aug 15 17:27 fsimage_0000000000000004389

$ hdfs oiv -i /tmp/fsimage_0000000000000001386 -o /tmp/fsimage.txt

This would launche a HTTP server which exposes read-only WebHDFS API by default at port “5978”.

For more detail on oiv, you can visit :

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsImageViewer.html

 

Hope This Helps

Sunil S Ranka

“Superior BI is the antidote to Business Failure”

Posted in Uncategorized | Tagged: , , , , , , | Leave a Comment »

Permissions for both HDFS and local fileSystem paths

Posted by sranka on July 18, 2014

Hi All,

Permission issues is one of the key error , while setting up Hadoop Cluster, while debugging some error found below table on http://hadoop.apache.org/ . It’s a good scorecard to keep handy.

 

Permissions for both HDFS and local fileSystem paths

The following table lists various paths on HDFS and local filesystems (on all nodes) and recommended permissions:

Filesystem Path User:Group Permissions
local dfs.namenode.name.dir hdfs:hadoop drwx——
local dfs.datanode.data.dir hdfs:hadoop drwx——
local $HADOOP_LOG_DIR hdfs:hadoop drwxrwxr-x
local $YARN_LOG_DIR yarn:hadoop drwxrwxr-x
local yarn.nodemanager.local-dirs yarn:hadoop drwxr-xr-x
local yarn.nodemanager.log-dirs yarn:hadoop drwxr-xr-x
local container-executor root:hadoop –Sr-s—
local conf/container-executor.cfg root:hadoop r——–
hdfs / hdfs:hadoop drwxr-xr-x
hdfs /tmp hdfs:hadoop drwxrwxrwxt
hdfs /user hdfs:hadoop drwxr-xr-x
hdfs yarn.nodemanager.remote-app-log-dir yarn:hadoop drwxrwxrwxt
hdfs mapreduce.jobhistory.intermediate-done-dir mapred:hadoop drwxrwxrwxt
hdfs mapreduce.jobhistory.done-dir mapred:hadoop drwxr-x—

Hope this helps

Sunil S Ranka

“Superior BI is the antidote to Business Failure”

This table was taken directly from http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-common/SecureMode.html

 

Posted in 11g, Big Data | Tagged: , , , , , , , | 1 Comment »