Stephan Paukner :: syslog :: Entries from Friday, July 20. 2012

Friday, July 20. 2012

Link roundup, week 29/2012

It took me quite a long time to discover that my favorite knowledge management tool, Diigo, provides a feature to post one’s bookmarks to a blog. As I often had the desire to repost certain links I stumbled upon, I will do that occasionally from now on, mainly about everything from the topic pool of data mining (and related buzzwords), with flavors ranging from theory to applications, from technology to business. (I can’t really do that to social media sites, as it’s almost impossible to explicitly consume posts topic-wise. So, blogs aren’t really obsolete—yet.)

Btw, Diigo is really awesome: You can highlight text on webpages and add annotations to help understanding an article and create a summary on the fly, right while going through it. In this sense: If you want to be briefed, read at least this. (And don’t worry, the next episodes will contain less content; this one ranges back a few weeks.)

Your Laptop Can Now Analyze Big Data - Technology Review

tags: computerscience datascience technology software
- GraphChi, exploits the capacious hard drives
- a Mac Mini running GraphChi can analyze Twitter’s social graph from 2010—which contains 40 million users and 1.2 billion connections—in 59 minutes
- The previous published result on this problem took 400 minutes using a cluster of about 1,000 computers
- graph computation is becoming more and more relevant
- GraphChi is capable of effectively handling many large-scale graph-computing problems without resorting to cloud-based solutions or supercomputers
Large-scale Incremental Processing Using Distributed Transactions and Notifications

Google’s Percolator paper

tags: google datascience bigdata technology
- Publication Year 2010
- MapReduce and other batch-processing systems cannot process small updates individually
- Percolator, a system for incrementally processing updates to a large data set
Big Data in Deutschland – der Status Quo | silicon.de

tags: bigdata datascience technology opinion
- das Thema Big Data noch in einem frühen Stadium
- noch in der Analyse- und Planungsphase
- Verfügbarkeit neuer Analyse- und Datenbanktechnologien
- dynamische Zunahme des unternehmensinternen Datenverkehrs
- Big Data vielfach ‘durch die Hintertür’ ins Unternehmen
- Datenwachstum von 42 Prozent bis zum Ende des Jahres 2014
- auf Seiten der Storage-Infrastruktur eine Menge Arbeit
- mittelständischen (500-999 Mitarbeiter) und
  den Großunternehmen (ab 1.000 Mitarbeiter)
- Über ein Drittel erwarten sich Kosteneinsparungen. Fast die Hälfte verspricht sich bessere Einsichten in das Informations- und Konsumverhalten der Kunden
- hohen Erwartungen, die an Dienstleister und Lösungsanbieter gestellt werden
Why the days are numbered for Hadoop as we know it — Cloud Computing News

tags: datascience bigdata cloud technology opinion
- it has become synonymous with big data
- de facto standard
- Is the enterprise buying into a technology whose best day has already passed?
- Hadoop’s inspiration – Google’s MapReduce
- Google File System (GFS) and Google MapReduce (GMR)
- make big data processing approachable to Google’s typical user/developer
- Hadoop Distributed File System and Hadoop MapReduce — was born in the image of GFS and GMR
- Your code is turned into map and reduce jobs, and Hadoop runs those jobs for you
- Google evolved. Can Hadoop catch up?
- GMR no longer holds such prominence in the Google stack
- Here are technologies that I hope will ultimately seed the post-Hadoop era
- it will require new, non-MapReduce-based architectures that leverage the Hadoop core (HDFS and Zookeeper) to truly compete with Google
- Percolator for incremental indexing and analysis of frequently changing datasets
- each time you want to analyze the data (say after adding, modifying or deleting data) you have to stream over the entire dataset
- displacing GMR in favor of an incremental processing engine called Percolator
- dealing only with new, modified, or deleted documents
- Dremel for ad hoc analytics
- SQL-like familiarity
- many interface layers have been built
- purpose-built for organized data processing (jobs). It is baked from the core for workflows, not ad hoc exploration
- BI/analytics queries are fundamentally ad hoc, interactive, low-latency
- Google invented Dremel (now exposed as the BigQuery product)
- I’m not aware of any compelling open source alternatives to Dremel
- Pregel for analyzing graph data
- certain core assumptions of MapReduce are at fundamental odds with analyzing networks of people, telecommunications equipment, documents and other
- petabyte -scale graph processing on distributed commodity machines
- Hadoop, which often causes exponential data amplification in graph processing
- execute graph algorithms such as SSSP or PageRank in dramatically shorter time
- near linear scaling of execution time with graph size
- the only viable option in the open source world is Giraph
- if you’re trying to process dynamic data sets, ad-hoc analytics or graph data structures, Google’s own actions clearly demonstrate better alternatives to the MapReduce paradigm
- Percolator, Dremel and Pregel make an impressive trio and comprise the new canon of big data
- similar impact on IT as Google’s original big three of GFS, GMR, and BigTable