Willsky & Team demonstrate significance of inline data tree reading

August 25, 2010

The MIT News Office reported today on work by EECS professor and Director of the Laboratory for Information & Decision Systems, (LIDS), Alan Willsky and a team from the Stochastics Systems Group that describes improved methods for reading computer data -- significant for a wide variety of data analysis applications in science and engineering.

As described in the News Office article: "In most cases, more data means more reliable inference of patterns. But how much data is enough? EECS graduate student Vincent Tan and his colleagues in the Stochastic Systems Group have taken the first steps toward answering that question.

Tan, Willsky and Animashree Anandkumar, a postdoc in Willsky’s group, envision data sets as what mathematicians call graphs. A graph is anything with nodes and edges: Nodes are generally depicted as circles and edges as lines connecting them. A typical diagram of a communications network, where the nodes represent electronic devices and the edges represent communications links, is a graph.

In the MIT researchers’ work, however, the nodes represent data and the edges correlations between them. For instance, one node might represent asthma, and the others could be a host of environmental, physiological and genetic factors. Some of the factors might be correlated with asthma, others not; other factors might be correlated with each other but not with asthma. Moreover, the edges can have different weights: The strength of the correlations can vary. From this perspective, a computer charged with pattern recognition is given a bunch of nodes and asked to infer the weights of the edges between them."

Willsky and Tan demonstrated in a Spring 2010 IEEE Transactions on Signal Processing paper, that data trees with a “star” pattern — in which one central node is connected to all the others — are the hardest to recognize; their shape can’t be inferred without lots of data; while trees that form “chains,” on the other hand — where each node is linked to at most two others — are the easiest to recognize."

Read more:

MIT News Office, August 24, 2010 article: "Sizing samples: Many scientific disciplines use computers to infer patterns in data. But how much data is enough to ensure that the inferences are right?"

IEEE Transactions on Signal Processing, VOL. 58, NO. 5, MAY 2010: "Learning Gaussian Tree Models: Analysis of Error Exponents and Extremal Structures"