Diminishing returns and Machine Learning: Increasing Gains

SHARE:

May 24, 2016 | Research News

By Stefanie Jegelka, Assistant Professor, MIT Computer Science and Artificial Intelligence Laboratory

Making sense of data” nowadays underlies progress in technology, science, healthcare, and business. This term is multifaceted, as some example scenarios illustrate. In a string of DNA, we may be looking for the regions that predict a certain disease. In a social network, we are interested in a small set of people who are most influential and will spread a piece of information most rapidly. For environmental monitoring or the IoT, we are faced with the question of which measurements to take to obtain the maximum amount of information possible. To understand or present a collection of images, e.g., from a recent vacation, we may select a few presentable ones, or automatically identify the objects displayed in those images.

Answering such questions relies on a spectrum of tools, from mathematical modeling to optimization and algorithms. Having identified the questions to ask, the next step is to formulate a mathematical model that guides computational methods for solving the task — often by optimizing a loss or utility function.

The resulting procedure needs to be both accurate and efficient enough to be deployed on real-world data. Accuracy relies on two sources: the model needs to represent the characteristics of the given problem accurately, and the associated optimization algorithm needs to be reliable, ideally guaranteeing that the obtained answers maintain a certain quality. Combining the joint goals of efficiency, accuracy, and reliability in understanding data is an ongoing challenge, and it fosters the confluence of statistics, mathematics, optimization, and computer science.

A productive example of this interplay of techniques are recent insights about connections between discrete mathematics and machine learning. The above introductory example tasks arise in different areas, and are phrased via different models. Yet, stripping things away to an abstract level, these tasks have surprisingly much in common: First, all of them ultimately ask to identify a “good” subset out of a larger set of candidate items. In the social network example, we seek a small, jointly influential set out of all people in the network. In sensing, we seek a small set of maximally informative measurements.

The actual notion of “good” of course varies across examples: in the social network, the measure of goodness is the influence exerted by the selected subset of people; in sensor placement, it is the information gathered via measurements in the selected locations; in other cases, it is defined via expected errors, probabilities, or costs. This measure of goodness is a “set function” that assigns scores to subsets of items.

This unifying view of subset selection problems opens ways for common computational techniques to solve these problems. Yet, as some readers may remember from a basic class in computational complexity, the problem of selecting a “good” subset can be notoriously difficult.

Figure 1

Figure 1: Diminishing returns in sensing. Adding a new sensor (red) to the smaller set A (blue) yields a larger increase in information than adding it to the larger set B.

But the above examples (and many others) share another important property: the combinatorial concept of submodularity or, in simpler language, diminishing returns. This property says that the more we have, the less we gain from an additional item. Figure 1 shows the problem of placing sensors to maximize information[1], in two situations: (a) adding a new sensor to a small set of existing sensors, or (b) adding the same sensor to a larger set of existing sensors that includes the previous ones. The additional information gained from the new sensor in (b) is never larger than in (a). The submodularity of information equally plays a role when covering news articles, summarizing videos and image collections, or identifying informative queries to pose to a human expert. Similarly, submodularity underlies our recent methods for finding objects in a collection of images, with minimal required input[2]. Submodularity is not limited to utilities: it also occurs in everyday cost functions, as the “free refill” in Figure 2 demonstrates. Here, it expresses notions of “economies of scale.”

Figure 2

Figure 2: Submodular costs.

Submodularity has long been known in combinatorial optimization, information theory, game theory, graph theory, and related fields[3]. The benefits of submodularity for machine learning have been emerging only recently. But what is special about this property, other than its perhaps surprising occurrence in a wide range of problems? This property, as simple as it may appear at first, brings along a rich mathematical structure. This structure is the foundation for computational procedures with quality guarantees for the obtained solution, one of the goals mentioned above. Phrasing machine learning problems in the language of submodularity makes it possible to exploit this beneficial structure.

If, for instance, the measure of goodness (e.g., information) that we maximize exhibits diminishing returns, then a simple greedy strategy implies guarantees on the solution. A submodular cost or loss function can be optimized by building on results in continuous (convex) optimization, achieving an optimal solution. Connecting these insights to machine learning and its application has offered a theoretical explanation for the empirical success of several “heuristics”.

The properties of submodular functions do not only explain existing heuristics. Importantly, they also inspire new models and methods. Figure 3 illustrates another example from our work in computer vision[4].

Figure 3

Figure 3: Example application: segmentation. (a) Input image; (b) segmenting by color; (c) classical model: neighboring pixels tend to have the same label (foreground/background); (d) new model from submodularity: diminishing marginal penalties if the boundary of the object is homogeneous.

At the same time, these developments and the demands of machine learning applications pose new challenges to established theory.

Are there practically efficient, scalable algorithms that retain the desired theoretical guarantees on realistic data sets? After all, diminishing returns often imply nontrivial, nonlinear dependencies between the items of interest, such as the interplay of information measured by different sensors, or correlations between different parts of a visual object. By carefully exploiting mathematical properties, we were able to devise parallel algorithms for submodular problems that indeed scale to large data, while retaining all desirable guarantees[5,6].

Moreover, new uses of submodularity, such as the example in Figure 3, give rise to new problem settings that are no longer addressed by existing methods. In recent work, we developed practically efficient algorithms for solving such problems. Besides computer vision, our algorithms have been used for tasks as diverse as the facilitation of diffusion in social networks, or aerial exploration.

While for problems like the example in Figure 3 no efficient algorithm can always guarantee an optimal solution here[4,7], empirical and initial theoretical results reveal a curious discrepancy: often, the empirical results are much better than the theory predicts. Our work helps close this apparent gap, by showing how to distinguish an “average” case from the worst case, and quantifying how often we are guaranteed to obtain much better solutions than in the worst case scenario[8].

These results indicate that identifying and exploiting submodularity in machine learning opens avenues towards more accurate mathematical models for making sense of data, and towards practically scalable algorithms with quality guarantees for a spectrum of tasks. While these observations indicate new directions of theory and applications, they are only one example of exploiting mathematical structure in combinatorial machine learning problems. We are working on exploiting such structure. Indeed, the confluence of disciplines and viewpoints often has increasing returns.

References

[1] A. Krause and C. Guestrin. Optimizing Sensing: from Water to the Web. IEEE Computer Magazine, 2009.

[2] H. Song, R. Girshick, S. Jegelka, J. Mairal, Z. Harchaoui and T. Darrell. On learning to localize objects with minimal supervision. International Conference on Machine Learning (ICML), 2014.

[3]. L. Lovász. Submodular Functions and Convexity. In: Mathematical programming — The State of the Art, pp 235–257, 1983.

[4] S. Jegelka and J. Bilmes. Submodularity beyond Submodular Energies: Coupling Edges in Graph Cuts. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011.

[5] S. Jegelka, F. Bach and S. Sra. Reflection methods for User-Friendly Submodular Optimization. Conference on Neural Information Processing Systems (NIPS), 2013.

[6] X. Pan, S. Jegelka, J. Gonzalez, J. Bradley and M.I. Jordan. Parallel Double Greedy Submodular Maximization. Conference on Neural Information Processing Systems (NIPS), 2014.

[7] G. Goel, C. Karande, P. Tripathi, and L. Wang. Approximability of combinatorial problems with multi-agent submodular cost functions. IEEE Symposium on Foundations of Computer Science (FOCS), 2009.

[8] R. Iyer, S. Jegelka and J. Bilmes. Curvature and Optimal Algorithms for Learning and Minimizing Submodular Functions. Conference on Neural Information Processing Systems (NIPS), 2013.