Modern Massive Data Sets

I’m excited about the Workshop on Modern Massive Data Sets I’ll be attending next week at Stanford.

The 2008 Workshop on Algorithms for Modern Massive Data Sets (MMDS 2008) will address algorithmic, mathematical, and statistical challenges in modern statistical data analysis. The goals of MMDS 2008 are to explore novel techniques for modeling and analyzing massive, high-dimensional, and nonlinearly-structured scientific and internet data sets, and to bring together computer scientists, statisticians, mathematicians, and data analysis practitioners to promote cross-fertilization of ideas.

These are my notes (disclaimer: I’m no expert, corrections are welcome.)

Most algorithm engineers thus far are happy with algorithms that are linear, i.e., \mathcal{O}(n) in the number of data points in the dataset. With web-scale data, linear is not good enough: sub-linear is required. These streams of data are typically unbounded and do not fit in main memory. They cannot even be stored, as it is infeasible to go back and reload them.

The Frequency Problem is used as a model problem over data streams, for example, computing means/variances, medians, top-k frequent items, distinct elements etc. One approach is to approximate the computation over the stream via random sampling.

There are four topics that keep recurring when looking at proofs of streaming algorithms using randomization:

This is really key. As an example, suppose we wanted to run some complicated algorithm (offline) over the packets flowing through a router. As this would a huge number of packets, we won’t have the luxury of storing all packets, but only a subset. By sampling the packets with probability p, we can estimate the amount of memory required to store the packets as the expectation value of the binomial variable with parameter p and n, where n is the number of packets going through the router.

The use of one of the bounds over the other is a matter of how tight the bound is. The Markov inequality only uses the expectation value of the random variable while the other use higher moments. They do this by using the Markov inequality to a moment generating function exp(tX) of the random variable X.

You’ll want to check the preliminary program for the kind of topics that are going to be covered. While you are at it, check out the workshop webpage from 2006. I’ll go find something to wipe the drool off my chin…

One Response to “Modern Massive Data Sets”

  1. Igor Carron says:

    The papers of MMDS are now here:

    http://www.stanford.edu/group/mmds/slides2008/

    Cheers,

    Igor.