Friday, July 20. 2012
It took me quite a long time to discover that my favorite knowledge management tool, Diigo, provides a feature to post one’s bookmarks to a blog. As I often had the desire to repost certain links I stumbled upon, I will do that occasionally from now on, mainly about everything from the topic pool of data mining (and related buzzwords), with flavors ranging from theory to applications, from technology to business. (I can’t really do that to social media sites, as it’s almost impossible to explicitly consume posts topic-wise. So, blogs aren’t really obsolete—yet.)
Btw, Diigo is really awesome: You can highlight text on webpages and add annotations to help understanding an article and create a summary on the fly, right while going through it. In this sense: If you want to be briefed, read at least this. (And don’t worry, the next episodes will contain less content; this one ranges back a few weeks.)
tags: computerscience datascience technology software
GraphChi, exploits the capacious hard drives
a Mac Mini running GraphChi can analyze Twitter’s social graph from 2010—which contains 40 million users and 1.2 billion connections—in 59 minutes
The previous published result on this problem took 400 minutes using a cluster of about 1,000 computers
graph computation is becoming more and more relevant
GraphChi is capable of effectively handling many large-scale graph-computing problems without resorting to cloud-based solutions or supercomputers
Google’s Percolator paper
tags: google datascience bigdata technology
MapReduce and other batch-processing systems cannot process small updates individually
Percolator, a system for incrementally processing updates to a large data set
tags: bigdata datascience technology opinion
das Thema Big Data noch in einem frühen Stadium
noch in der Analyse- und Planungsphase
Verfügbarkeit neuer Analyse- und Datenbanktechnologien
dynamische Zunahme des unternehmensinternen Datenverkehrs
Big Data vielfach ‘durch die Hintertür’ ins Unternehmen
Datenwachstum von 42 Prozent bis zum Ende des Jahres 2014
auf Seiten der Storage-Infrastruktur eine Menge Arbeit
mittelständischen (500-999 Mitarbeiter) und den Großunternehmen (ab 1.000 Mitarbeiter)
Über ein Drittel erwarten sich Kosteneinsparungen. Fast die Hälfte verspricht sich bessere Einsichten in das Informations- und Konsumverhalten der Kunden
hohen Erwartungen, die an Dienstleister und Lösungsanbieter gestellt werden
tags: datascience bigdata cloud technology opinion
it has become synonymous with big data
Is the enterprise buying into a technology whose best day has already passed?
Hadoop’s inspiration – Google’s MapReduce
make big data processing approachable to Google’s typical user/developer
Hadoop Distributed File System and Hadoop MapReduce — was born in the image of GFS and GMR
Your code is turned into map and reduce jobs, and Hadoop runs those jobs for you
Google evolved. Can Hadoop catch up?
GMR no longer holds such prominence in the Google stack
Here are technologies that I hope will ultimately seed the post-Hadoop era
it will require new, non-MapReduce-based architectures that leverage the Hadoop core (HDFS and Zookeeper) to truly compete with Google
Percolator for incremental indexing and analysis of frequently changing datasets
each time you want to analyze the data (say after adding, modifying or deleting data) you have to stream over the entire dataset
displacing GMR in favor of an incremental processing engine called Percolator
dealing only with new, modified, or deleted documents
Dremel for ad hoc analytics
many interface layers have been built
purpose-built for organized data processing (jobs). It is baked from the core for workflows, not ad hoc exploration
BI/analytics queries are fundamentally ad hoc, interactive, low-latency
I’m not aware of any compelling open source alternatives to Dremel
Pregel for analyzing graph data
certain core assumptions of MapReduce are at fundamental odds with analyzing networks of people, telecommunications equipment, documents and other
petabyte -scale graph processing on distributed commodity machines
Hadoop, which often causes exponential data amplification in graph processing
execute graph algorithms such as SSSP or PageRank in dramatically shorter time
near linear scaling of execution time with graph size
the only viable option in the open source world is Giraph
if you’re trying to process dynamic data sets, ad-hoc analytics or graph data structures, Google’s own actions clearly demonstrate better alternatives to the MapReduce paradigm
Percolator, Dremel and Pregel make an impressive trio and comprise the new canon of big data
similar impact on IT as Google’s original big three of GFS, GMR, and BigTable
tags: machinelearning video opinion
The talk is geared toward engineers with no prior knowledge of machine learning, and it’s designed to lay out the basic vocabulary
investigating which techniques they might want to learn more about or implement
tags: datamining datascience books
tags: machinelearning datascience research opinion
machine learning might be in danger of losing its impact because the community as a whole has become quite self-referential
People are probably solving real-world problems using ML methods, but there is little sharing of these results within the community
people focus on existing benchmarks
I think it is wrong to take the main ML conferences and journals as a reference to how much application work is going on
you have to start to publish in the conferences and journals of the application field, not in pure ML conferences
If everything were always very application specific, it would be very hard to transfer knowledge between people working on different applications
it is very hard to publish application related papers at ML conferences
The hype around Big Data and Data Science is pretty big
machine learners are one of three groups who can potentially contribute a lot to this field (the others being data ming people, and computational statisticians)
we’re losing the race to get our share of the cake, mostly to data mining people who have much better expertise on the technological side
machine learning has been a bit too successful in finding an abstract mathematical language
many of my colleagues consider databases as just another file format
as a data scientist, you also need to be able to put your stuff into production, which means dealing with all kinds of enterprise level technology like web services, databases, messaging middleware, and questions of stability and scalability
people consider this extra work as merely “programming”, and something which is outside of the scope of a machine learner
know how to implement an algorithm in an enterprise environment as opposed to a ML-friendly matrix based language such as matlab or R
tags: datascience technology companies
data-science driven analytics solution
funding from some pretty impressive investors
in stealth mode up until now, but is already handling over 10M API requests a day
Freshplum is aiming to use data-driven and mathematical approaches to analytics to help its users direct its sales decisions
Today, when people sell things online – especially virtual goods – they just guess at pricing
We’re being extremely selective at the moment
implementing its solution takes minutes instead of hours or days
tags: datascience technology graphics
tags: datascience technology graphics
private cloud computing has reached the peak level of hype, and cloud/Web platforms are slipping into the “trough of disillusionment” in the face of Platform as a Service (PaaS)
tags: mathematics lectures
tags: r programming datamining
tags: r statistics machinelearning programming
Recursive Partitioning : Tree-structured models
Regularized and Shrinkage Methods
Support Vector Machines and Kernel Methods
Model selection and validation
Elements of Statistical Learning
rattle is a graphical user interface for data mining in R
tags: statistics video lectures
does not require any previous knowledge of statistics
Visualizing relationships in data
conditional probability; Bayes Rule
Normal distributions; the central limit theorem
Sampling distributions; confidence intervals; hypothesis tests
Least squares;residuals; inference
Transformation; smoothing; regression
Statistics vs machine learning
tags: computervision technology
Netverify allows customers to verify their ID online in real time
uses the users camera to capture any ID
sophisticated algorithms for identifying faked identification documents
Our security and monitoring team are overseeing all activities in our systems. 24 hours all around the clock, 7 days a week, 365 days a year
Netverify maintains its own black list database and also checks identities with a variety of databases like criminal records, credit scoring agencies, etc.
tags: computervision technology
3-D secure protocol based authentication services like Verified by Visa, Mastercard Secure Code or SafeKey
authenticate a credit card transaction by scanning the actual payment card with the camera
the technology can analyze the material of the card, the embossed data, holograms and additional card specific security features
10x faster than a keyed transaction
tags: datascience technology
What moves the ball forward is the business team agreeing that the new data is useful and worth analyzing. What moves the ball forward is when the IT team decides how to best make the data available based on the characteristics of the data. Progress is made with a focus on putting the data to work, not on semantics.
whether the data was valuable enough to collect or not has nothing to do with what definitional bucket we might place the data source in
much of what is being associated with “big data” is actually a function of “different data”
change direction and focus the discussion on what the value of the data might be and how it can be leveraged for analysis
tags: datascience software technology
SQL-H, a new query interface to analyze data from Hadoop
Most Hadoop access methods require
Hadoop Distributed File System (HDFS) using technologies such as MapReduce
reduces staffing and training issues required for learning more Hadoop-specific interfaces
The need to extract and store data from Hadoop into other database systems and thereby lose the computing power of Hadoop has been the Achilles heel
Teradata Aster has an advantage over EMC Greenplum, IBM and Oracle
one-third of organizations plan to use Hadoop
tags: datamining r programming video tutorial
practical examples of using R for decision trees, random forests
handy links to data mining resources near the end
tags: computervision books
Seems to be very business-oriented
tags: statistics books
We plan to finish the remaining chapters by the end of 2012
online and free-of-charge
textbooks never make much money anyway — the publishers make all the money
forecasting in business
students studying business
high school mathematics should be sufficient background
tags: machinelearning tutorial todo next
sample codes from kaggle for the randomForest benchmark
include NA indicator values that become a part of your overall model
simply choosing the median of a variable to fill in its NAs is probably not a great model
tags: statistics facts
Posted from Diigo. The rest of my favorite links are here.