Sunday, August 5. 2012
Interesting comparison of Hadoop today with the Linux story from the past. This could mean Hadoop/MapReduce as state of the art around 2020.
tags: bigdata cloud technology linux
While Hadoop is all the rage in the technology media today, it has barely scratched the surface of enterprise adoption
Hadoop seems set to win despite its many shortcomings
still in the transition from zero per cent adoption to one per cent adoption
IBM points to a few specific deficiencies
lack of performance and scalability, inflexible resource management, and a limitation to a single distributed file system
IBM, of course, promises to resolve these issues with its proprietary complements to Hadoop
Hadoop is batch oriented in a world increasingly run in real-time
customers are buying big into Hadoop
it’s still possible that other alternatives, like Percolator, will claim the Hadoop crown
Back in 2000 IBM announced that it was going to invest $1bn in advancing the Linux operating system. This was big news
it came roughly 10 years after Linus Torvalds released the first Linux source code, and it took another 10 years before Linux really came to dominate the industry
The same seems true of Hadoop today
we’re just starting the marathon
Three data-mangling job sites, all only for the US
tags: bigdata career
Bright, one of several new companies
another new job site,
Gild, a third major player
tags: bigdata datascience technology opinion
Thomas H. Davenport, Paul Barth and Randy Bean
how do the potential insights from big data differ from what managers generate from traditional analytics?
1. Paying attention to flows as opposed to stocks
the data is not the “stock” in a data warehouse but a continuous flow
organizations will need to develop continuous processes
data extraction, preparation and analysis took weeks to prepare — and weeks more to execute
conventional, high-certitude approaches to decision-making are often not appropriate
new data is often available that renders the decision obsolete
2. Relying on data scientists and product and process developers as opposed to data analysts
the people who work with big data need substantial and creative IT skills
programming, mathematical and statistical skills, as well as business acumen and the ability to communicate effectively
started an educational offering for data scientists
3. Moving analytics from IT into core business and operational functions
new products designed to deal with big data
Relational databases have also been transformed
Statistical analysis packages
“virtual data marts” allow data scientists to share existing data without replicating it
traditional role of IT— automating business processes — imposes precise requirements
Analytics has been more of an afterthought for monitoring processes
business and IT capabilities used to be stability and scale, the new advantages are based on discovery and agility
discovery and analysis as the first order of business
IT processes and systems need to be designed for insight, not just automation
The title is misleading: It’s not about what DS is. It’s rather a vision of the ideal solution.
tags: datascience technology opinion
the old state and the ideal future state, which he calls “Analyst 1.0” and “Analyst 2.0,”
Analyst 1.0 as the state of maturity achieved by using the last generation of business intelligence tools
Analyst 1.0 has some coding skills, and perhaps writes an SQL query here and there
inflexibility of data warehouses and relational databases
Our current state of affairs, which we’ll call Analyst 1.5, finds us in limbo
two primary limitations: the immense size and variety of the data, and the complexity of the tools needed
to get value from big data, business analysts cannot simply be presented with a programming language
Analyst 1.5 is characterized by a disconnect between data scientists and the tools and systems in the more complex camp of programmers and computer scientists
caused data to be totally fragmented
Analyst 2.0 will have arrived when vendors and IT make analysis easy enough that a typical business user can conduct analysis entirely by themselves
Tools such as self-learning recommendations engines
demands new skills, such as a more precise focus on aberrant or statistically significant data in a stream, as well as better tools
somehow at some point you have to get your analytical inspection down to the equivalent of code level
what we’re trying to model is every person’s brain–at least the part of the brain that decides how to shop, when to shop, and what you want
we need to continue to mine for behavioral data, such as what people looked at before and after they made transactions
among the top pitfalls is the tendency to focus on a very small piece of data without occasionally stepping back
tendency to over-focus on technology
organizations are tempted to put the most technology-savvy person on the job, rather than the most business-savvy
computer scientists are not trained to ask the right business questions
tags: datascience datamining facts
discover hidden patterns in data that the human expert may not see
Part 1 of a four part series on predictive analytics
descriptive analytics lets us know what happened in the past, predictive analytics focuses on what will happen next
business intelligence. It allows us to make decisions based on statistics obtained from historical data
Expert knowledge is based on experience
data-driven knowledge, as its name suggests, is based upon data
supervised because, during training, data is presented to a predictive model with the input data and the desired output or outcome
neural networks, support vector machines, and decision trees
only presented with the input data
also use unsupervised learning
historical data in search of features that you could use
After we build a predictive model, we need to validate it
PMML (Predictive Model Markup Language) exists that allows predictive models to easily move between different systems
IBM SPSS Statistics to build and validate a predictive model
upload it into a scoring engine such as the Zementis ADAPA
a hundred or so records containing data for customers that churned in the past may not be enough. If not enough data is used for training, a model may not be able learn or worse, it may over fit.
Data by itself does not translate to predictive value. Good data does.
book by Duda, Hart, and Stock entitled Pattern Classification
recommend products and services
GPS mobile device data to predict traffic
data from sensors is a clear way towards helping to ensure safety
Dr. Alex Guazzelli is the VP of Analytics at Zementis
co-authored the book PMML in Action
tags: datascience datamining facts
we are accumulating data on an exponential scale
90 percent of the data available today was created just in the past two years
neural networks (NNs), clustering, support vector machines (SVMs), and association rules
extract value from historical data obtained from people and sensors
predict the risk of customer churn or defection, in case of people data, or the risk of machinery breakdown, in case of sensor data
compute a score or risk by implementing a regression
Support vector machines (SVMs)
decision trees are the most commonly used predictive modeling technique
association rules can be used to discover that people who buy diapers and milk, also buy beer
give a NN too few hidden nodes, it may not learn the mapping function between the input fields and the target. Too many nodes and it will over fit
many models can be combined together in what is called a model ensemble
scores from all models are computed and the final prediction is determined by a voting mechanism or the average
techniques that are not capable of explaining their reasoning
NNs and SVMs fall into this category
If it outputs a high risk of churn for a particular customer, it will not be able to tell us why
technique that clearly pinpoints the reasons for its decisions. Scorecards fit such a criteria very well
Based on regression models, scorecards are a popular technique used by financial institutions to assess risk
decision trees are easy to explain and understand
black-box modeling techniques are hard to explain, the models themselves should not be
representing data pre-processing as well as predictive models is now straightforward with PMML, the Predictive Model Markup Language
represented as a PMML file, a predictive model can be moved right away from the scientist’s desktop, where it was developed, to the operational environment, where it is put to work
predictive solutions are only now experiencing a boom in all industries, due to the advent of: 1) big data derived from people and sensors; 2) cost-efficient processing platforms such as Cloud- and Hadoop-based; and 3) PMML
this data gives us the potential to transform the world into a smarter world
tags: datascience datamining facts
I will use IBM SPSS Statistics to illustrate many of these phases
phases involved in the making of a predictive solution, from data pre-processing all the way to its operational deployment
structured data encompasses fields such as customer age, gender, and number of purchases in the last month
Unstructured data may be represented by a comment the same customer provided as feedback
all ages lower than 21 may be binned together into a student category
all ages higher than 55 may be binned together into a retiree category
Category worker may then be assigned to everyone 21 to 55 years old
if customer A is 25 years old
three distinct fields: student, worker, and retiree, which would be mapped to 0, 1, and 0
continuous fields may need to be normalized
text mining to identify attrition cues in comments
statistical packages allow users to select an option to pre-process the data automatically
1) Balance speed and accuracy; 2) Optimize for speed; 3) Optimize for accuracy; and 4) Customize analysis
neural networks (NNs) and support vector machines
Decision Trees and Scorecards, are also capable of explaining the reasoning
I will focus in this article on NNs
split your input into factors and covariates. Factors represent categorical input
Covariates represent continuous variables
one reserved for model training, and the other for testing
option for Automatic architecture selection
you want the model to have a low rate of false-positives (FP) and false-negatives (FN), which implies a high rate of true-positives (TP) and true-negatives (TN)
cost associated with a FN is usually very different from the cost associated with a FP
ROC (Receiver Operating Characteristic) curve
graphical representation of the true positive rate (sensitivity) versus the false positive rate (one minus specificity) for a binary classifier as its discrimination threshold varies
scores need to be translated into business decisions
benefit from two types of knowledge: expert and data-driven
tags: datascience datamining facts
Predictive Model Markup Language (PMML) has completely changed this scenario
a predictive solution needs to successfully bridge the gap between two very different worlds. I call these, planets Predicta and Engira
Planet Predicta is populated by data scientists with expertise in statistics, data mining, and language skills such as Perl and Python. Planet Engira, on the other hand, is populated by IT engineers with expertise in Java™, .NET, C, SQL
phases involved in the making of a predictive solution, from data pre-preprocessing and model building all the way to post-processing of model scores. PMML is able to represent all these phases in a single file
include multiple models or a model ensemble
we can use it to unveil the secrecy and the black box feeling
At Zementis, we have created a PMML-based predictive analytics decision management platform called ADAPA
ADAPA lives on the operational side
Universal PMML Plug-in (UPPI) for in-database scoring and for Hadoop
Posted from Diigo. The rest of my favorite links are here.