Sunday, July 29. 2012
tags: machinelearning books
Information Theory, Inference, and Learning Algorithms
640 pages, Published September 2003
PDF (A4) pdf (9M) (fourth printing, March 2005)
tags: r statistics programming
formerly named R Cookbook
It is not related to Paul Teetor’s excellent R Cookbook
When are we done defining big data? p.1
tags: datascience bigdata opinion
settled on 3 Vs -- volume, variety, and velocity
if big data is understood solely on the basis of these trends, it isn’t clear that it’s at all hype-worthy
if “big data” simply describes the volume, variety, and velocity of the information that constitutes it, our existing data management practices are still arguably up to the task
big data is hyped on the basis of its real or imagined outputs
a lot more interesting when you bring in ‘V’ for value
When are we done defining data science?
tags: datascience statistics opinion
the skills of a “data scientist” are those of a modern statistician
know how to move data around and manipulate data with some programming language
know how to draw informative pictures of data
Knowledge of stats, errorbars, confidence intervals
try to get people from different backgrounds
Great communication skills
a lot of what we teach The Kids now looks a lot more like machine learning than statistics as it was taught circa 1970, or even circa 1980
Everything I know about statistics I’ve learned without formal instruction
is not, in my experience, intrinsically hard for anyone who already has a decent grounding in some other mathematical science
mastering them really does mean trying to do things and failing
potentially hazardous. This is the idea that all that really matters is being “smart”
counter-productive for students to attribute their success or failure in learning about something to an innate talent
Bill Franks is Chief Analytics Officer at Teradata
tags: datascience bigdata books
tags: datascience opinion bigdata
Some folks like to confuse Hadoop with big data
Focus On the Questions To Ask, Not The Answers
The failure of data warehouses to provide real-time data led to the creation of data marts
Data marts failed to provide complete and updated and comprehensive views
existing solutions still don’t solve the problem. Why? The market and business environment have changed
Data moves from structured to unstructured. Sources exponentially proliferate. Data quality is paramount.
Real-time is irrelevant because speed does not trump fidelity. Quantity does not trump quality
Business questions remained unanswered despite the massive number of reports and views and charts
The big shift is about moving from data to decisions
tags: machinelearning video lectures
Draft videos (editing incomplete)
Entropy and Data Compression
Shannon’s Source Coding Theorem
Inference and Information Measures for Noisy Channels
Introduction to Bayesian Inference
Approximating Probability Distributions
Posted from Diigo. The rest of my favorite links are here.
Friday, July 20. 2012
It took me quite a long time to discover that my favorite knowledge management tool, Diigo, provides a feature to post one’s bookmarks to a blog. As I often had the desire to repost certain links I stumbled upon, I will do that occasionally from now on, mainly about everything from the topic pool of data mining (and related buzzwords), with flavors ranging from theory to applications, from technology to business. (I can’t really do that to social media sites, as it’s almost impossible to explicitly consume posts topic-wise. So, blogs aren’t really obsolete—yet.)
Btw, Diigo is really awesome: You can highlight text on webpages and add annotations to help understanding an article and create a summary on the fly, right while going through it. In this sense: If you want to be briefed, read at least this. (And don’t worry, the next episodes will contain less content; this one ranges back a few weeks.)
tags: computerscience datascience technology software
GraphChi, exploits the capacious hard drives
a Mac Mini running GraphChi can analyze Twitter’s social graph from 2010—which contains 40 million users and 1.2 billion connections—in 59 minutes
The previous published result on this problem took 400 minutes using a cluster of about 1,000 computers
graph computation is becoming more and more relevant
GraphChi is capable of effectively handling many large-scale graph-computing problems without resorting to cloud-based solutions or supercomputers
Google’s Percolator paper
tags: google datascience bigdata technology
MapReduce and other batch-processing systems cannot process small updates individually
Percolator, a system for incrementally processing updates to a large data set
tags: bigdata datascience technology opinion
das Thema Big Data noch in einem frühen Stadium
noch in der Analyse- und Planungsphase
Verfügbarkeit neuer Analyse- und Datenbanktechnologien
dynamische Zunahme des unternehmensinternen Datenverkehrs
Big Data vielfach ‘durch die Hintertür’ ins Unternehmen
Datenwachstum von 42 Prozent bis zum Ende des Jahres 2014
auf Seiten der Storage-Infrastruktur eine Menge Arbeit
mittelständischen (500-999 Mitarbeiter) und den Großunternehmen (ab 1.000 Mitarbeiter)
Über ein Drittel erwarten sich Kosteneinsparungen. Fast die Hälfte verspricht sich bessere Einsichten in das Informations- und Konsumverhalten der Kunden
hohen Erwartungen, die an Dienstleister und Lösungsanbieter gestellt werden
tags: datascience bigdata cloud technology opinion
it has become synonymous with big data
Is the enterprise buying into a technology whose best day has already passed?
Hadoop’s inspiration – Google’s MapReduce
make big data processing approachable to Google’s typical user/developer
Hadoop Distributed File System and Hadoop MapReduce — was born in the image of GFS and GMR
Your code is turned into map and reduce jobs, and Hadoop runs those jobs for you
Google evolved. Can Hadoop catch up?
GMR no longer holds such prominence in the Google stack
Here are technologies that I hope will ultimately seed the post-Hadoop era
it will require new, non-MapReduce-based architectures that leverage the Hadoop core (HDFS and Zookeeper) to truly compete with Google
Percolator for incremental indexing and analysis of frequently changing datasets
each time you want to analyze the data (say after adding, modifying or deleting data) you have to stream over the entire dataset
displacing GMR in favor of an incremental processing engine called Percolator
dealing only with new, modified, or deleted documents
Dremel for ad hoc analytics
many interface layers have been built
purpose-built for organized data processing (jobs). It is baked from the core for workflows, not ad hoc exploration
BI/analytics queries are fundamentally ad hoc, interactive, low-latency
I’m not aware of any compelling open source alternatives to Dremel
Pregel for analyzing graph data
certain core assumptions of MapReduce are at fundamental odds with analyzing networks of people, telecommunications equipment, documents and other
petabyte -scale graph processing on distributed commodity machines
Hadoop, which often causes exponential data amplification in graph processing
execute graph algorithms such as SSSP or PageRank in dramatically shorter time
near linear scaling of execution time with graph size
the only viable option in the open source world is Giraph
if you’re trying to process dynamic data sets, ad-hoc analytics or graph data structures, Google’s own actions clearly demonstrate better alternatives to the MapReduce paradigm
Percolator, Dremel and Pregel make an impressive trio and comprise the new canon of big data
similar impact on IT as Google’s original big three of GFS, GMR, and BigTable
Continue reading "Link roundup, week 29/2012"