Link roundup, week 31/2012

Sunday, August 5. 2012

Link roundup, week 31/2012

Linux lessons for Hadoop doubters • The Register

Interesting comparison of Hadoop today with the Linux story from the past. This could mean Hadoop/MapReduce as state of the art around 2020.

tags: bigdata cloud technology linux
- While Hadoop is all the rage in the technology media today, it has barely scratched the surface of enterprise adoption
- Hadoop seems set to win despite its many shortcomings
- still in the transition from zero per cent adoption to one per cent adoption
- IBM points to a few specific deficiencies
- lack of performance and scalability, inflexible resource management, and a limitation to a single distributed file system
- IBM, of course, promises to resolve these issues with its proprietary complements to Hadoop
- Hadoop is batch oriented in a world increasingly run in real-time
- customers are buying big into Hadoop
- it’s still possible that other alternatives, like Percolator, will claim the Hadoop crown
- Back in 2000 IBM announced that it was going to invest $1bn in advancing the Linux operating system. This was big news
- it came roughly 10 years after Linus Torvalds released the first Linux source code, and it took another 10 years before Linux really came to dominate the industry
- The same seems true of Hadoop today
- we’re just starting the marathon
Can Big Data cut through your growing resume pile? - Fortune Management

Three data-mangling job sites, all only for the US

tags: bigdata career
- Bright, one of several new companies
- another new job site, Path.to
- Gild, a third major player
How ‘Big Data’ Is Different

tags: bigdata datascience technology opinion
- Thomas H. Davenport, Paul Barth and Randy Bean
- how do the potential insights from big data differ from what managers generate from traditional analytics?
- 1. Paying attention to flows as opposed to stocks
- the data is not the “stock” in a data warehouse but a continuous flow
- Streaming analytics
- organizations will need to develop continuous processes
- data extraction, preparation and analysis took weeks to prepare — and weeks more to execute
- conventional, high-certitude approaches to decision-making are often not appropriate
- new data is often available that renders the decision obsolete
- 2. Relying on data scientists and product and process developers as opposed to data analysts
- the people who work with big data need substantial and creative IT skills
- programming, mathematical and statistical skills, as well as business acumen and the ability to communicate effectively
- EMC Corporation
- started an educational offering for data scientists
- 3. Moving analytics from IT into core business and operational functions
- new products designed to deal with big data
- Hadoop
- Relational databases have also been transformed
- Statistical analysis packages
- the cloud
- “virtual data marts” allow data scientists to share existing data without replicating it
- traditional role of IT— automating business processes — imposes precise requirements
- Analytics has been more of an afterthought for monitoring processes
- business and IT capabilities used to be stability and scale, the new advantages are based on discovery and agility
- discovery and analysis as the first order of business
- IT processes and systems need to be designed for insight, not just automation
PayPal’s Mok Oh On What Is A Data Scientist? - Forbes

The title is misleading: It’s not about what DS is. It’s rather a vision of the ideal solution.

tags: datascience technology opinion
- the old state and the ideal future state, which he calls “Analyst 1.0” and “Analyst 2.0,”
- Analyst 1.0 as the state of maturity achieved by using the last generation of business intelligence tools
- Analyst 1.0 has some coding skills, and perhaps writes an SQL query here and there
- inflexibility of data warehouses and relational databases
- Our current state of affairs, which we’ll call Analyst 1.5, finds us in limbo
- two primary limitations: the immense size and variety of the data, and the complexity of the tools needed
- Hadoop
- to get value from big data, business analysts cannot simply be presented with a programming language
- Analyst 1.5 is characterized by a disconnect between data scientists and the tools and systems in the more complex camp of programmers and computer scientists
- caused data to be totally fragmented
- Analyst 2.0 will have arrived when vendors and IT make analysis easy enough that a typical business user can conduct analysis entirely by themselves
- Tools such as self-learning recommendations engines
- demands new skills, such as a more precise focus on aberrant or statistically significant data in a stream, as well as better tools
- somehow at some point you have to get your analytical inspection down to the equivalent of code level
- what we’re trying to model is every person’s brain–at least the part of the brain that decides how to shop, when to shop, and what you want
- we need to continue to mine for behavioral data, such as what people looked at before and after they made transactions
- among the top pitfalls is the tendency to focus on a very small piece of data without occasionally stepping back
- tendency to over-focus on technology
- organizations are tempted to put the most technology-savvy person on the job, rather than the most business-savvy
- computer scientists are not trained to ask the right business questions

Predicting the future, Part 1: What is predictive analytics?

tags: datascience datamining facts
- discover hidden patterns in data that the human expert may not see
- Part 1 of a four part series on predictive analytics
- descriptive analytics lets us know what happened in the past, predictive analytics focuses on what will happen next
- business intelligence. It allows us to make decisions based on statistics obtained from historical data
- Expert knowledge is based on experience
- data-driven knowledge, as its name suggests, is based upon data
- supervised because, during training, data is presented to a predictive model with the input data and the desired output or outcome
- neural networks, support vector machines, and decision trees
- only presented with the input data
- also use unsupervised learning
- Clustering
- historical data in search of features that you could use
- After we build a predictive model, we need to validate it
- PMML (Predictive Model Markup Language) exists that allows predictive models to easily move between different systems
- IBM SPSS Statistics to build and validate a predictive model
- upload it into a scoring engine such as the Zementis ADAPA
- a hundred or so records containing data for customers that churned in the past may not be enough. If not enough data is used for training, a model may not be able learn or worse, it may over fit.
- garbage in, garbage out
- Data by itself does not translate to predictive value. Good data does.
- book by Duda, Hart, and Stock entitled Pattern Classification
- fraud detection
- healthcare
- recommend products and services
- GPS mobile device data to predict traffic
- data from sensors is a clear way towards helping to ensure safety
- Dr. Alex Guazzelli is the VP of Analytics at Zementis
- co-authored the book PMML in Action
Predicting the future, Part 2: Predictive modeling techniques

tags: datascience datamining facts
- we are accumulating data on an exponential scale
- 90 percent of the data available today was created just in the past two years
- neural networks (NNs), clustering, support vector machines (SVMs), and association rules
- extract value from historical data obtained from people and sensors
- predict the risk of customer churn or defection, in case of people data, or the risk of machinery breakdown, in case of sensor data
- compute a score or risk by implementing a regression
- Support vector machines (SVMs)
- decision trees are the most commonly used predictive modeling technique
- association rules can be used to discover that people who buy diapers and milk, also buy beer
- give a NN too few hidden nodes, it may not learn the mapping function between the input fields and the target. Too many nodes and it will over fit
- many models can be combined together in what is called a model ensemble
- scores from all models are computed and the final prediction is determined by a voting mechanism or the average
- Black-box
- techniques that are not capable of explaining their reasoning
- NNs and SVMs fall into this category
- If it outputs a high risk of churn for a particular customer, it will not be able to tell us why
- technique that clearly pinpoints the reasons for its decisions. Scorecards fit such a criteria very well
- Based on regression models, scorecards are a popular technique used by financial institutions to assess risk
- decision trees are easy to explain and understand
- black-box modeling techniques are hard to explain, the models themselves should not be
- representing data pre-processing as well as predictive models is now straightforward with PMML, the Predictive Model Markup Language
- represented as a PMML file, a predictive model can be moved right away from the scientist’s desktop, where it was developed, to the operational environment, where it is put to work
- predictive solutions are only now experiencing a boom in all industries, due to the advent of: 1) big data derived from people and sensors; 2) cost-efficient processing platforms such as Cloud- and Hadoop-based; and 3) PMML
- this data gives us the potential to transform the world into a smarter world
Predicting the future, Part 3: Create a predictive solution

tags: datascience datamining facts
- I will use IBM SPSS Statistics to illustrate many of these phases
- phases involved in the making of a predictive solution, from data pre-processing all the way to its operational deployment
- structured data encompasses fields such as customer age, gender, and number of purchases in the last month
- Unstructured data may be represented by a comment the same customer provided as feedback
- all ages lower than 21 may be binned together into a student category
- all ages higher than 55 may be binned together into a retiree category
- Category worker may then be assigned to everyone 21 to 55 years old
- if customer A is 25 years old
- three distinct fields: student, worker, and retiree, which would be mapped to 0, 1, and 0
- continuous fields may need to be normalized
- text mining to identify attrition cues in comments
- statistical packages allow users to select an option to pre-process the data automatically
- 1) Balance speed and accuracy; 2) Optimize for speed; 3) Optimize for accuracy; and 4) Customize analysis
- neural networks (NNs) and support vector machines
- Decision Trees and Scorecards, are also capable of explaining the reasoning
- I will focus in this article on NNs
- split your input into factors and covariates. Factors represent categorical input
- Covariates represent continuous variables
- one reserved for model training, and the other for testing
- option for Automatic architecture selection
- you want the model to have a low rate of false-positives (FP) and false-negatives (FN), which implies a high rate of true-positives (TP) and true-negatives (TN)
- cost associated with a FN is usually very different from the cost associated with a FP
- ROC (Receiver Operating Characteristic) curve
- graphical representation of the true positive rate (sensitivity) versus the false positive rate (one minus specificity) for a binary classifier as its discrimination threshold varies
- area under the ROC curve
- scores need to be translated into business decisions
- benefit from two types of knowledge: expert and data-driven
Predicting the future, Part 4: Put a predictive solution to work

tags: datascience datamining facts
- Predictive Model Markup Language (PMML) has completely changed this scenario
- a predictive solution needs to successfully bridge the gap between two very different worlds. I call these, planets Predicta and Engira
- Planet Predicta is populated by data scientists with expertise in statistics, data mining, and language skills such as Perl and Python. Planet Engira, on the other hand, is populated by IT engineers with expertise in Java™, .NET, C, SQL
- phases involved in the making of a predictive solution, from data pre-preprocessing and model building all the way to post-processing of model scores. PMML is able to represent all these phases in a single file
- include multiple models or a model ensemble
- we can use it to unveil the secrecy and the black box feeling
- R project
- At Zementis, we have created a PMML-based predictive analytics decision management platform called ADAPA
- ADAPA lives on the operational side
- Universal PMML Plug-in (UPPI) for in-database scoring and for Hadoop

Posted from Diigo. The rest of my favorite links are here.

Posted by Stephan Paukner in Data Science at 12:08 | Comments (0) | Trackbacks (0)

Defined tags for this entry: linkroll

Trackbacks

Trackback specific URI for this entry

No Trackbacks

Comments

Display comments as (Linear | Threaded)

No comments

Add Comment

Name

Homepage

Comment

In reply to

Phone*

What is two plus seven?

Enclosing asterisks marks text as bold (*word*), underscore are made via _word_.

Standard emoticons like :-) and ;-) are converted to images.

To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly.
CAPTCHA