Link roundup, week 29/2012

Friday, July 20. 2012

Link roundup, week 29/2012

It took me quite a long time to discover that my favorite knowledge management tool, Diigo, provides a feature to post one’s bookmarks to a blog. As I often had the desire to repost certain links I stumbled upon, I will do that occasionally from now on, mainly about everything from the topic pool of data mining (and related buzzwords), with flavors ranging from theory to applications, from technology to business. (I can’t really do that to social media sites, as it’s almost impossible to explicitly consume posts topic-wise. So, blogs aren’t really obsolete—yet.)

Btw, Diigo is really awesome: You can highlight text on webpages and add annotations to help understanding an article and create a summary on the fly, right while going through it. In this sense: If you want to be briefed, read at least this. (And don’t worry, the next episodes will contain less content; this one ranges back a few weeks.)

Your Laptop Can Now Analyze Big Data - Technology Review

tags: computerscience datascience technology software
- GraphChi, exploits the capacious hard drives
- a Mac Mini running GraphChi can analyze Twitter’s social graph from 2010—which contains 40 million users and 1.2 billion connections—in 59 minutes
- The previous published result on this problem took 400 minutes using a cluster of about 1,000 computers
- graph computation is becoming more and more relevant
- GraphChi is capable of effectively handling many large-scale graph-computing problems without resorting to cloud-based solutions or supercomputers
Large-scale Incremental Processing Using Distributed Transactions and Notifications

Google’s Percolator paper

tags: google datascience bigdata technology
- Publication Year 2010
- MapReduce and other batch-processing systems cannot process small updates individually
- Percolator, a system for incrementally processing updates to a large data set
Big Data in Deutschland – der Status Quo | silicon.de

tags: bigdata datascience technology opinion
- das Thema Big Data noch in einem frühen Stadium
- noch in der Analyse- und Planungsphase
- Verfügbarkeit neuer Analyse- und Datenbanktechnologien
- dynamische Zunahme des unternehmensinternen Datenverkehrs
- Big Data vielfach ‘durch die Hintertür’ ins Unternehmen
- Datenwachstum von 42 Prozent bis zum Ende des Jahres 2014
- auf Seiten der Storage-Infrastruktur eine Menge Arbeit
- mittelständischen (500-999 Mitarbeiter) und
  den Großunternehmen (ab 1.000 Mitarbeiter)
- Über ein Drittel erwarten sich Kosteneinsparungen. Fast die Hälfte verspricht sich bessere Einsichten in das Informations- und Konsumverhalten der Kunden
- hohen Erwartungen, die an Dienstleister und Lösungsanbieter gestellt werden
Why the days are numbered for Hadoop as we know it — Cloud Computing News

tags: datascience bigdata cloud technology opinion
- it has become synonymous with big data
- de facto standard
- Is the enterprise buying into a technology whose best day has already passed?
- Hadoop’s inspiration – Google’s MapReduce
- Google File System (GFS) and Google MapReduce (GMR)
- make big data processing approachable to Google’s typical user/developer
- Hadoop Distributed File System and Hadoop MapReduce — was born in the image of GFS and GMR
- Your code is turned into map and reduce jobs, and Hadoop runs those jobs for you
- Google evolved. Can Hadoop catch up?
- GMR no longer holds such prominence in the Google stack
- Here are technologies that I hope will ultimately seed the post-Hadoop era
- it will require new, non-MapReduce-based architectures that leverage the Hadoop core (HDFS and Zookeeper) to truly compete with Google
- Percolator for incremental indexing and analysis of frequently changing datasets
- each time you want to analyze the data (say after adding, modifying or deleting data) you have to stream over the entire dataset
- displacing GMR in favor of an incremental processing engine called Percolator
- dealing only with new, modified, or deleted documents
- Dremel for ad hoc analytics
- SQL-like familiarity
- many interface layers have been built
- purpose-built for organized data processing (jobs). It is baked from the core for workflows, not ad hoc exploration
- BI/analytics queries are fundamentally ad hoc, interactive, low-latency
- Google invented Dremel (now exposed as the BigQuery product)
- I’m not aware of any compelling open source alternatives to Dremel
- Pregel for analyzing graph data
- certain core assumptions of MapReduce are at fundamental odds with analyzing networks of people, telecommunications equipment, documents and other
- petabyte -scale graph processing on distributed commodity machines
- Hadoop, which often causes exponential data amplification in graph processing
- execute graph algorithms such as SSSP or PageRank in dramatically shorter time
- near linear scaling of execution time with graph size
- the only viable option in the open source world is Giraph
- if you’re trying to process dynamic data sets, ad-hoc analytics or graph data structures, Google’s own actions clearly demonstrate better alternatives to the MapReduce paradigm
- Percolator, Dremel and Pregel make an impressive trio and comprise the new canon of big data
- similar impact on IT as Google’s original big three of GFS, GMR, and BigTable

» Devs Love Bacon: Everything you need to know about Machine Learning in 30 minutes or less hilarymason.com

tags: machinelearning video opinion
- The talk is geared toward engineers with no prior knowledge of machine learning, and it’s designed to lay out the basic vocabulary
- investigating which techniques they might want to learn more about or implement
Universität Konstanz | Informatik und Informationswissenschaft | Guide to Intelligent Data Analysis

tags: datamining datascience books
Marginally Interesting: Is Machine Learning Losing Impact?

tags: machinelearning datascience research opinion
- machine learning might be in danger of losing its impact because the community as a whole has become quite self-referential
- People are probably solving real-world problems using ML methods, but there is little sharing of these results within the community
- people focus on existing benchmarks
- I think it is wrong to take the main ML conferences and journals as a reference to how much application work is going on
- you have to start to publish in the conferences and journals of the application field, not in pure ML conferences
- If everything were always very application specific, it would be very hard to transfer knowledge between people working on different applications
- it is very hard to publish application related papers at ML conferences
- The hype around Big Data and Data Science is pretty big
- machine learners are one of three groups who can potentially contribute a lot to this field (the others being data ming people, and computational statisticians)
- we’re losing the race to get our share of the cake, mostly to data mining people who have much better expertise on the technological side
- machine learning has been a bit too successful in finding an abstract mathematical language
- many of my colleagues consider databases as just another file format
- as a data scientist, you also need to be able to put your stuff into production, which means dealing with all kinds of enterprise level technology like web services, databases, messaging middleware, and questions of stability and scalability
- people consider this extra work as merely “programming”, and something which is outside of the scope of a machine learner
- know how to implement an algorithm in an enterprise environment as opposed to a ML-friendly matrix based language such as matlab or R
Freshplum Raises $1.4M in Seed Funding for its Analytics Solution

tags: datascience technology companies
- data-science driven analytics solution
- 26th June 2012
- funding from some pretty impressive investors
- in stealth mode up until now, but is already handling over 10M API requests a day
- Freshplum is aiming to use data-driven and mathematical approaches to analytics to help its users direct its sales decisions
- Today, when people sell things online – especially virtual goods – they just guess at pricing
- We’re being extremely selective at the moment
- implementing its solution takes minutes instead of hours or days
- “later this year”
Gartner 2011 Hype Cycle for Emerging Technologies [GIF]

tags: datascience technology graphics
Gartner Adds Big Data, Gamification, and Internet of Things to Its Hype Cycle

tags: datascience technology graphics
- August 11, 2011
- private cloud computing has reached the peak level of hype, and cloud/Web platforms are slipping into the “trough of disillusionment” in the face of Platform as a Service (PaaS)
Computer Science: Where can I learn Math online in the same way that Code Academy has me learning programming? - Quora

tags: mathematics lectures
RDataMining.com: R and Data Mining

tags: r programming datamining
CRAN Task View: Machine Learning & Statistical Learning

tags: r statistics machinelearning programming
- Neural Networks
- Recursive Partitioning : Tree-structured models
- Random Forests
- Regularized and Shrinkage Methods
- Boosting
- Support Vector Machines and Kernel Methods
- Bayesian Methods
- Model selection and validation
- Elements of Statistical Learning
- rattle is a graphical user interface for data mining in R
Udacity - Introduction to Statistics (ST101)

tags: statistics video lectures
- does not require any previous knowledge of statistics
- Visualizing relationships in data
- dealing with noise
- conditional probability; Bayes Rule
- Normal distributions; the central limit theorem
- Sampling distributions; confidence intervals; hypothesis tests
- Least squares;residuals; inference
- Transformation; smoothing; regression
- Statistics vs machine learning
- Final exam
Netverify – Building a strong customer relationship. | Jumio

tags: computervision technology
- Netverify allows customers to verify their ID online in real time
- uses the users camera to capture any ID
- sophisticated algorithms for identifying faked identification documents
- Our security and monitoring team are overseeing all activities in our systems. 24 hours all around the clock, 7 days a week, 365 days a year
- Netverify maintains its own black list database and also checks identities with a variety of databases like criminal records, credit scoring agencies, etc.
Verified by Jumio | Jumio

tags: computervision technology
- 3-D secure protocol based authentication services like Verified by Visa, Mastercard Secure Code or SafeKey
- authenticate a credit card transaction by scanning the actual payment card with the camera
- the technology can analyze the material of the card, the embossed data, holograms and additional card specific security features
- patent pending
- 10x faster than a keyed transaction
What’s the Definition of ‘Big Data’? Who Cares? | SmartData Collective

tags: datascience technology
- What moves the ball forward is the business team agreeing that the new data is useful and worth analyzing. What moves the ball forward is when the IT team decides how to best make the data available based on the characteristics of the data. Progress is made with a focus on putting the data to work, not on semantics.
- whether the data was valuable enough to collect or not has nothing to do with what definitional bucket we might place the data source in
- much of what is being associated with “big data” is actually a function of “different data”
- change direction and focus the discussion on what the value of the data might be and how it can be leveraged for analysis
Teradata Aster Standardizes Access to Hadoop with SQL-H | SmartData Collective

tags: datascience software technology
- SQL-H, a new query interface to analyze data from Hadoop
- Most Hadoop access methods require
- Hadoop Distributed File System (HDFS) using technologies such as MapReduce
- reduces staffing and training issues required for learning more Hadoop-specific interfaces
- The need to extract and store data from Hadoop into other database systems and thereby lose the computing power of Hadoop has been the Achilles heel
- Teradata Aster has an advantage over EMC Greenplum, IBM and Oracle
- one-third of organizations plan to use Hadoop
Data Mining with R | SmartData Collective

tags: datamining r programming video tutorial
- webinar
- watch the replay
- practical examples of using R for decision trees, random forests
- support vector machines
- K-means clustering
- handy links to data mining resources near the end
Computer Vision Using Local Binary Patterns: Matti Pietikäinen, Abdenour Hadid, Guoying Zhao, Timo Ahonen

tags: computervision books
- July 8, 2011
Forecasting: principles and practice | An online textbook by Rob J Hyndman and George Athanasopoulos

Seems to be very business-oriented

tags: statistics books
- We plan to finish the remaining chapters by the end of 2012
- online and free-of-charge
- textbooks never make much money anyway — the publishers make all the money
- forecasting in business
- students studying business
- high school mathematics should be sufficient background
- R throughout the book
- continuously updated
machine learning - Why adding an NA indicator column instead of value imputation (for randomForest) - Statistical Analysis

tags: machinelearning tutorial todo next
- sample codes from kaggle for the randomForest benchmark
  - URL??
- rfImpute
- include NA indicator values that become a part of your overall model
- simply choosing the median of a variable to fill in its NAs is probably not a great model
Chart of distribution relationships

tags: statistics facts

Posted from Diigo. The rest of my favorite links are here.

Posted by Stephan Paukner in Data Science at 12:52 | Comments (0) | Trackbacks (0)

Defined tags for this entry: linkroll

Trackbacks

Trackback specific URI for this entry

No Trackbacks

Comments

Display comments as (Linear | Threaded)

No comments

Add Comment

Name

Homepage

Comment

In reply to

Phone*

What is five plus four?

Enclosing asterisks marks text as bold (*word*), underscore are made via _word_.

Standard emoticons like :-) and ;-) are converted to images.

To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly.
CAPTCHA