New device can control light at unprecedented speeds

In a scene from “Star Wars: Episode IV — A New Hope,” R2D2 projects a three-dimensional hologram of Princess Leia making a desperate plea for help. That scene, filmed more than 45 years ago, involved a bit of movie magic — even today, we don’t have the technology to create such realistic and dynamic holograms.

Generating a freestanding 3D hologram would require extremely precise and fast control of light beyond the capabilities of existing technologies, which are based on liquid crystals or micromirrors.

An international group of researchers, led by a team at MIT, spent more than four years tackling this problem of high-speed optical beam forming. They have now demonstrated a programmable, wireless device that can control light, such as by focusing a beam in a specific direction or manipulating the light’s intensity, and do it orders of magnitude more quickly than commercial devices.

They also pioneered a fabrication process that ensures the device quality remains near-perfect when it is manufactured at scale. This would make their device more feasible to implement in real-world settings.

Known as a spatial light modulator, the device could be used to create super-fast lidar (light detection and ranging) sensors for self-driving cars, which could image a scene about a million times faster than existing mechanical systems. It could also accelerate brain scanners, which use light to “see” through tissue. By being able to image tissue faster, the scanners could generate higher-resolution images that aren’t affected by noise from dynamic fluctuations in living tissue, like flowing blood.

“We are focusing on controlling light, which has been a recurring research theme since antiquity. Our development is another major step toward the ultimate goal of complete optical control — in both space and time — for the myriad applications that use light,” says lead author Christopher Panuski PhD ’22, who recently graduated with his PhD in electrical engineering and computer science.

The paper is a collaboration between researchers at MIT; Flexcompute, Inc.; the University of Strathclyde; the State University of New York Polytechnic Institute; Applied Nanotools, Inc.; the Rochester Institute of Technology; and the U.S. Air Force Research Laboratory. The senior author is Dirk Englund, an associate professor of electrical engineering and computer science at MIT and a researcher in the Research Laboratory of Electronics (RLE) and Microsystems Technology Laboratories (MTL). The research is published today in Nature Photonics.

Manipulating light

A spatial light modulator (SLM) is a device that manipulates light by controlling its emission properties. Similar to an overhead projector or computer screen, an SLM transforms a passing beam of light, focusing it in one direction or refracting it to many locations for image formation.

Inside the SLM, a two-dimensional array of optical modulators controls the light. But light wavelengths are only a few hundred nanometers, so to precisely control light at high speeds the device needs an extremely dense array of nanoscale controllers. The researchers used an array of photonic crystal microcavities to achieve this goal. These photonic crystal resonators allow light to be controllably stored, manipulated, and emitted at the wavelength-scale.

When light enters a cavity, it is held for about a nanosecond, bouncing around more than 100,000 times before leaking out into space. While a nanosecond is only one billionth of a second, this is enough time for the device to precisely manipulate the light. By varying the reflectivity of a cavity, the researchers can control how light escapes. Simultaneously controlling the array modulates an entire light field, so the researchers can quickly and precisely steer a beam of light.

“One novel aspect of our device is its engineered radiation pattern. We want the reflected light from each cavity to be a focused beam because that improves the beam-steering performance of the final device. Our process essentially makes an ideal optical antenna,” Panuski says.

To achieve this goal, the researchers developed a new algorithm to design photonic crystal devices that form light into a narrow beam as it escapes each cavity, he explains.

Using light to control light

The team used a micro-LED display to control the SLM. The LED pixels line up with the photonic crystals on the silicon chip, so turning on one LED tunes a single microcavity. When a laser hits that activated microcavity, the cavity responds differently to the laser based on the light from the LED.

“This application of high-speed LED-on-CMOS displays as micro-scale optical pump sources is a perfect example of the benefits of integrated photonic technologies and open collaboration. We have been thrilled to work with the team at MIT on this ambitious project,” says Michael Strain, professor at the Institute of Photonics of the University of Strathclyde.  

The use of LEDs to control the device means the array is not only programmable and reconfigurable, but also completely wireless, Panuski says.

“It is an all-optical control process. Without metal wires, we can place devices closer together without worrying about absorption losses,” he adds.

Figuring out how to fabricate such a complex device in a scalable fashion was a years-long process. The researchers wanted to use the same techniques that create integrated circuits for computers, so the device could be mass produced. But microscopic deviations occur in any fabrication process, and with micron-sized cavities on the chip, those tiny deviations could lead to huge fluctuations in performance.

The researchers partnered with the Air Force Research Laboratory to develop a highly precise mass-manufacturing process that stamps billions of cavities onto a 12-inch silicon wafer. Then they incorporated a postprocessing step to ensure the microcavities all operate at the same wavelength.

“Getting a device architecture that would actually be manufacturable was one of the huge challenges at the outset. I think it only became possible because Chris worked closely for years with Mike Fanto and a wonderful team of engineers and scientists at AFRL, AIM Photonics, and with our other collaborators, and because Chris invented a new technique for machine vision-based holographic trimming,” says Englund.

For this “trimming” process, the researchers shine a laser onto the microcavities. The laser heats the silicon to more than 1,000 degrees Celsius, creating silicon dioxide, or glass. The researchers created a system that blasts all the cavities with the same laser at once, adding a layer of glass that perfectly aligns the resonances — that is, the natural frequencies at which the cavities vibrate.

“After modifying some properties of the fabrication process, we showed that we were able to make world-class devices in a foundry process that had very good uniformity. That is one of the big aspects of this work — figuring out how to make these manufacturable,” Panuski says.

The device demonstrated near-perfect control — in both space and time — of an optical field with a joint “spatiotemporal bandwidth” 10 times greater than that of existing SLMs. Being able to precisely control a huge bandwidth of light could enable devices that can carry massive amounts of information extremely quickly, such as high-performance communications systems.

Now that they have perfected the fabrication process, the researchers are working to make larger devices for quantum control or ultrafast sensing and imaging.

This research was funded, in part, by the Hertz Foundation, the NDSEG Fellowship Program, the Schmidt Postdoctoral Award, the Israeli Vatat Scholarship, the U.S. Army Research Office, the U.S. Air Force Research Laboratory, the UK’s Engineering and Physical Sciences Research Council, and the Royal Academy of Engineering.

The task of magnetic classification suddenly looks easier

Knowing the magnetic structure of crystalline materials is critical to many applications, including data storage, high-resolution imaging, spintronics, superconductivity, and quantum computing. Information of this sort, however, is difficult to come by. Although magnetic structures can be obtained from neutron diffraction and scattering studies, the number of machines that can support these analyses — and the time available at these facilities — is severely limited.

As a result, the magnetic structures of only about 1,500 materials worked out experimentally have been tabulated to date. Researchers have also predicted magnetic structures by numerical means, but lengthy calculations are required, even on large, state-of-the-art supercomputers. These calculations, moreover, become increasingly expensive, with power demands growing exponentially, as the size of the crystal structures under consideration goes up.

Now, researchers at MIT, Harvard University, and Clemson University — led by Mingda Li, MIT assistant professor of nuclear science and engineering, and Tess Smidt, MIT assistant professor of electrical engineering and computer science — have found a way to streamline this process by employing the tools of machine learning. “This might be a quicker and cheaper approach,” Smidt says.

The team’s results were recently published in the journal iScience. One unusual feature of this paper, apart from its novel findings, is that its first authors are three MIT undergraduates — Helena Merker, Harry Heiberger, and Linh Nguyen — plus one PhD student, Tongtong Liu.

Merker, Heiberger, and Nguyen joined the project as first-years in fall 2020, and they were given a sizable challenge: to design a neural network that can predict the magnetic structure of crystalline materials. They did not start from scratch, however, making use of “equivariant Euclidean neural networks” that were co-invented by Smidt in 2018. The advantage of this kind of network, Smidt explains, “is that we won’t get a different prediction for the magnetic order if a crystal is rotated or translated, which we know should not affect the magnetic properties.” That feature is especially helpful for examining 3D materials.

The elements of structure

The MIT group drew upon a database of nearly 150,000 substances compiled by the Materials Project at the Lawrence Berkeley National Laboratory, which provided information concerning the arrangement of atoms in the crystal lattice. The team used this input to assess two key properties of a given material: magnetic order and magnetic propagation.

Figuring out the magnetic order involves classifying materials into three categories: ferromagnetic, antiferromagnetic, and nonmagnetic. The atoms in a ferromagnetic material act like little magnets with their own north and south poles. Each atom has a magnetic moment, which points from its south to north pole. In a ferromagnetic material, Liu explains, “all the atoms are lined up in the same direction — the direction of the combined magnetic field produced by all of them.” In an antiferromagnetic material, the magnetic moments of the atoms point in a direction opposite to that of their neighbors — canceling each other out in an orderly pattern that yields zero magnetization overall. In a nonmagnetic material, all the atoms could be nonmagnetic, having no magnetic moments whatsoever. Or the material could contain magnetic atoms, but their magnetic moments would point in random directions so that the net result, again, is zero magnetism.

The concept of magnetic propagation relates to the periodicity of a material’s magnetic structure. If you think of a crystal as a 3D arrangement of bricks, a unit cell is the smallest possible building block — the smallest number, and configuration, of atoms that can make up an individual “brick.” If the magnetic moments of every unit cell are aligned, the MIT researchers accorded the material a propagation value of zero. However, if the magnetic moment changes direction, and hence “propagates,” in moving from one cell to the next, the material is given a non-zero propagation value.

A network solution

So much for the goals. How can machine learning tools help achieve them? The students’ first step was to take a portion of the Materials Project database to train the neural network to find correlations between a material’s crystalline structure and its magnetic structure. The students also learned — through educated guesses and trial-and-error — that they achieved the best results when they included not just information about the atoms’ lattice positions, but also the atomic weight, atomic radius, electronegativity (which reflects an atom’s tendency to attract an electron), and dipole polarizability (which indicates how far the electron is from the atom’s nucleus). During the training process, a large number of so-called “weights” are repeatedly fine-tuned.

“A weight is like the coefficient m in the equation y = mx + b,” Heiberger explains. “Of course, the actual equation, or algorithm, we use is a lot messier, with not just one coefficient but perhaps a hundred; x, in this case, is the input data, and you choose m so that y is predicted most accurately. And sometimes you have to change the equation itself to get a better fit.”

Next comes the testing phase. “The weights are kept as-is,” Heiberger says, “and you compare the predictions you get to previously established values [also found in the Materials Project database].”

As reported in iScience, the model had an average accuracy of about 78 percent and 74 percent, respectively, for predicting magnetic order and propagation. The accuracy for predicting the order of nonmagnetic materials was 91 percent, even if the material contained magnetic atoms.

Charting the road ahead

The MIT investigators believe this approach could be applied to large molecules whose atomic structures are hard to discern and even to alloys, which lack crystalline structures. “The strategy there is to take as big a unit cell — as big a sample — as possible and try to approximate it as a somewhat disordered crystal,” Smidt says.

The current work, the authors wrote, represents one step toward “solving the grand challenge of full magnetic structure determination.” The “full structure” in this case means determining “the specific magnetic moments of every atom, rather than the overall pattern of the magnetic order,” Smidt explains.

“We have the math in place to take this on,” Smidt adds, “though there are some tricky details to be worked out. It’s a project for the future, but one that appears to be within reach.”

The undergraduates won’t participate in that effort, having already completed their work in this venture. Nevertheless, they all appreciated the research experience. “It was great to pursue a project outside the classroom that gave us the chance to create something exciting that didn’t exist before,” Merker says.

“This research, entirely led by undergraduates, started in 2020 when they were first-years. With Institute support from the ELO [Experiential Learning Opportunities] program and later guidance from PhD student Tongtong Liu, we were able to bring them together even while physically remote from each other. This work demonstrates how we can expand the first-year learning experience to include a real research product,” Li adds. “Being able to support this kind of collaboration and learning experience is what every educator strives for. It is wonderful to see their hard work and commitment result in a contribution to the field.”

“This really was a life-changing experience,” Nguyen agrees. “I thought it would be fun to combine computer science with the material world. That turned out to be a pretty good choice.”

Solving brain dynamics gives rise to flexible machine-learning models

Last year, MIT researchers announced that they had built “liquid” neural networks, inspired by the brains of small species: a class of flexible, robust machine learning models that learn on the job and can adapt to changing conditions, for real-world safety-critical tasks, like driving and flying. The flexibility of these “liquid” neural nets meant boosting the bloodline to our connected world, yielding better decision-making for many tasks involving time-series data, such as brain and heart monitoring, weather forecasting, and stock pricing.

But these models become computationally expensive as their number of neurons and synapses increase and require clunky computer programs to solve their underlying, complicated math. And all of this math, similar to many physical phenomena, becomes harder to solve with size, meaning computing lots of small steps to arrive at a solution. 

Now, the same team of scientists has discovered a way to alleviate this bottleneck by solving the differential equation behind the interaction of two neurons through synapses to unlock a new type of fast and efficient artificial intelligence algorithms. These modes have the same characteristics of liquid neural nets — flexible, causal, robust, and explainable — but are orders of magnitude faster, and scalable. This type of neural net could therefore be used for any task that involves getting insight into data over time, as they’re compact and adaptable even after training — while many traditional models are fixed. There hasn’t been a known solution since 1907 — the year that the differential equation of the neuron model was introduced.

The models, dubbed a “closed-form continuous-time” (CfC) neural network, outperformed state-of-the-art counterparts on a slew of tasks, with considerably higher speedups and performance in recognizing human activities from motion sensors, modeling physical dynamics of a simulated walker robot, and event-based sequential image processing. On a medical prediction task, for example, the new models were 220 times faster on a sampling of 8,000 patients. 

A new paper on the work is published today in Nature Machine Intelligence.

“The new machine-learning models we call ‘CfC’s’ replace the differential equation defining the computation of the neuron with a closed form approximation, preserving the beautiful properties of liquid networks without the need for numerical integration,” says MIT EECS Professor Daniela Rus, director of the Computer Science and Artificial Intelligence Laboratory (CSAIL) and senior author on the new paper. “CfC models are causal, compact, explainable, and efficient to train and predict. They open the way to trustworthy machine learning for safety-critical applications.”

Keeping things liquid 

Differential equations enable us to compute the state of the world or a phenomenon as it evolves, but not all the way through time — just step-by-step. To model natural phenomena through time and understand previous and future behavior, like human activity recognition or a robot’s path, for example, the team reached into a bag of mathematical tricks to find just the ticket: a “closed form’” solution that models the entire description of a whole system, in a single compute step. 

With their models, one can compute this equation at any time in the future, and at any time in the past. Not only that, but the speed of computation is much faster because you don’t need to solve the differential equation step-by-step. 

Imagine an end-to-end neural network that receives driving input from a camera mounted on a car. The network is trained to generate outputs, like the car’s steering angle. In 2020, the team solved this by using liquid neural networks with 19 nodes, so 19 neurons plus a small perception module could drive a car. A differential equation describes each node of that system. With the closed-form solution, if you replace it inside this network, it would give you the exact behavior, as it’s a good approximation of the actual dynamics of the system. They can thus solve the problem with an even lower number of neurons, which means it would be faster and less computationally expensive. 

These models can receive inputs as time series (events that happened in time), which could be used for classification, controlling a car, moving a humanoid robot, or forecasting financial and medical events. With all of these various modes, it can also increase accuracy, robustness, and performance, and, importantly, computation speed — which sometimes comes as a trade-off. 

Solving this equation has far-reaching implications for advancing research in both natural and artificial intelligence systems. “When we have a closed-form description of neurons and synapses’ communication, we can build computational models of brains with billions of cells, a capability that is not possible today due to the high computational complexity of neuroscience models. The closed-form equation could facilitate such grand-level simulations and therefore opens new avenues of research for us to understand intelligence,” says MIT CSAIL Research Affiliate Ramin Hasani, first author on the new paper.

Portable learning

Moreover, there is early evidence of Liquid CfC models in learning tasks in one environment from visual inputs, and transferring their learned skills to an entirely new environment without additional training. This is called out-of-distribution generalization, which is one of the most fundamental open challenges of artificial intelligence research.  

“Neural network systems based on differential equations are tough to solve and scale to, say, millions and billions of parameters. Getting that description of how neurons interact with each other, not just the threshold, but solving the physical dynamics between cells enables us to build up larger-scale neural networks,” says Hasani. “This framework can help solve more complex machine learning tasks — enabling better representation learning — and should be the basic building blocks of any future embedded intelligence system.”

“Recent neural network architectures, such as neural ODEs and liquid neural networks, have hidden layers composed of specific dynamical systems representing infinite latent states instead of explicit stacks of layers,” says Sildomar Monteiro, AI and Machine Learning Group lead at Aurora Flight Sciences, a Boeing company, who was not involved in this paper. “These implicitly-defined models have shown state-of-the-art performance while requiring far fewer parameters than conventional architectures. However, their practical adoption has been limited due to the high computational cost required for training and inference.” He adds that this paper “shows a significant improvement in the computation efficiency for this class of neural networks … [and] has the potential to enable a broader range of practical applications relevant to safety-critical commercial and defense systems.”

Hasani and Mathias Lechner, a postdoc at MIT CSAIL, wrote the paper supervised by Rus, alongside MIT Alexander Amini, a CSAIL postdoc; Lucas Liebenwein SM ’18, PhD ’21; Aaron Ray, an MIT electrical engineering and computer science PhD student and CSAIL affiliate; Max Tschaikowski, associate professor in computer science at Aalborg University in Denmark; and Gerald Teschl, professor of mathematics at the University of Vienna.

A Satisfying Solve

In a hail of confetti, the members of the MIT programming team stand on a large stage clutching trophies and plaques. Behind them, a huge screen displays the MIT team name.

On November 10th, MIT’s team of crack coders made history by winning the globe’s oldest, largest, and most prestigious programming contest—the World Finals of the International Collegiate Programming Contest (ICPC). Held in Dhaka, Bangladesh, the 45th World Finals drew a live audience of over 1600 viewers to the tense 12-problem competition, which featured 420 contestants, representing 140 universities across 45 nations.

The first ICPC World Finals contest was held in 1977, and the second (in 1978) was won by MIT—followed by many, many years of close misses for the team from Cambridge. “We have recently come close with very strong teams, and at times it felt like we might never make it,” said faculty coordinator Martin Rinard, Professor of CS and Engineering within MIT’s Department of Electrical Engineering and Computer Science. “Since I took over the team in 1997, we have won 5 gold medals, 5 silver medals, and 3 bronze medals. We have come in second three times. Overall, it’s a very good record, but it also feels great to finally win!”

A crowded hall full of low cubicles is livened up with hundreds of colorful balloons and several huge timekeeping clocks. Inside each cubicle, a team of 3 computer programmers sit, wearing colorful t-shirts that identify their university.
140 universities and 45 countries were represented by the teams packed into the colorful ICPC hall. Photo credit: Michael Roytek for ICPC.

That win was the work of many, including admin Mary McDavitt, who dealt with the daunting logistics involved in sending a team of undergraduates halfway around the world, and student coaches Ce Jin and Yinzhan Xu, both PhD students in EECS, who help select the best team to represent MIT. “In addition to the official regional contests held by ICPC, we also organize two selection contests every year especially for MIT students,” says student coach Ce Jin. “MIT students are usually extremely talented, and most of them are already experienced competitive programmers even before joining MIT. For example, the three members on this team all won medals at IOI (International Olympiad in Informatics) during high school!” That team is composed of Xiao Mao ’21 MEng ’22, who has degrees in both computer science and engineering and in mathematics; Jerry Mao, a senior in computer science and engineering; and Mingyang Deng, a junior in computer science and engineering. (Deng also recently competed in and won the 2022 North American Championships of the ICPC, clinching eligibility to attend the 46th annual ICPC World Finals next year.)

In this interview, conducted via email during and immediately after the flight back from Bangladesh, the trio reflected on their historical victory.

First off, congratulations on this incredible win. Tell me a little about how you got in the mental space to compete. What kinds of practices, rituals, and preparation habits do you recommend for this kind of intense, competitive brain work? 

Jerry: The ICPC is certainly intense—and unlike some other programming competitions, in the ICPC there is no such thing as partial credit! As a team, we did many test runs over the months leading to the competition, to iron out those nerves and develop a routine for the real thing.

Xiao: We ran several weekly practice sessions, but they were not optimal, since I already graduated and was in another city. We had to communicate via Zoom and emulate the “one keyboard” environment via communication. However, these difficulties were somewhat of a blessing in disguise, since they forced us to sharpen our communication skills and improve our strategies. 

A colorful market in Dhaka features a stacked array of sandwiches and a spinning cone of meat. In the background, vendor umbrellas and packages of clothing are visible.
Contestants toured some of the city of Dhaka, sampling the sights, sounds, and tastes. Photo credit: Randy Piland for ICPC.

Dhaka is a long way from Cambridge! Tell us about your experience of the city.

Jerry Mao: It’s a bustling city: there are people and cars and rickshaws everywhere. We didn’t go too far from where we were staying, because we knew we’d get stuck in the gridlock. ICPC signs were also everywhere around the city, including in the airport, on the roads, and even on the public transport—the world finals were definitely a major event for the city.

Xiao Mao: I did not experience the best traffic situation during our stay, but I still liked the city for many of its offerings and its hospitality! The food was also amazing and so were the people that prepared it. 

Jerry: I certainly enjoyed sampling the tastes, such as a mutton bhuna or a vegetable bhaji.

Mingyang Deng: I didn’t have time to visit many of the sights, but I wandered around the city a bit and had lots of conversations with local teenagers. Dhaka has a vast, visible wealth gap. The young people are aware of this, and hopefully, they can make a better future with their knowledge. 


Many folks may never have seen a programming competition before. Just from a logistical perspective, how do you divide up the work of programming? Is the fastest typist the person who gets the keyboard? Do all three work on separate possible solutions and compare? 

Jerry: All three of us are very experienced competitive programmers, so thankfully typing speed is not something we have to worry about. For most problems, the most challenging part is coming up with the idea of the solution, while programming is just a way to write it down. That’s why our teamwork is built on collaborating to find ideas; there are times when we’d each have partial ideas on a problem, and when we discuss them, we discover that they combine to a full solution.

Xiao: As there was only one keyboard, we had to alternate between coders. When one person was coding, the other two could cross-check each other’s solution. We actually started with some strategy where one person did all the coding and the other did all the thinking, but we quickly abandoned it since we realized we could easily get tired if we kept doing one thing without a break.

Jerry: We each have our own individual strengths, whether that be math, geometry, data structures, or something else. Some of the most challenging problems may pull together a combination of these, and that’s when our teamwork is able to shine the most.

Three male competitors, wearing the plum-colored tee shirts that identify them as students from the Faculty of Computer Science in Belgrade, confer over papers detailing a coding problem in the ICPC World Finals.
Two team members from Faculty of Computer Science, Belgrade, confer while the third codes. Photo credit: Michael Roytek for ICPC.

You got four first-solves out of twelve problems! Was speed a deliberate part of your strategy? 

Mingyang: We didn’t aim for speed. However, while most teams follow the leaderboard, our team prefers to explore new problems. As a result, we were the first to solve many unexplored problems. 


Jerry: While we weren’t specifically aiming for first-solves, there are 12 problems to work on, but only 5 hours. And on the leaderboard, teams that solve more problems faster are ranked higher, so speed is of utmost importance.

A bearded coder wearing a green t-shirt stares at a problem with a look of intense concentration. He is wiping his forehead.
An intense moment for a coder at the ICPC World Finals. Photo credit: Bob Smith for ICPC.

Xiao: We started on two unpopular problems instead of the one most of the teams were solving, and that was what contributed to two of our first-solves. Moreover, we focused more on correctness than speed, since an incorrect solution could waste a lot of time. Our strategy of alternating between coders and cross-checking solutions made sure that there was no “idle time” on the machine (i.e. time when no one was coding) and that we also never had incorrect solutions. Despite the expectations other people have put on us, we came into the competition with a “just for fun” mindset, and were not aiming for anything. Being first was certainly a surprise for us. 

A team composed of one woman and two men are captured in intense discussion about a potential solution. One of the men is staring at the computer screen.
A team from St. Petersburg Campus of Higher School of Economics discusses a possible solution. Photo credit: Randy Piland for ICPC.

Looking at the final scoreboard, it’s evident that Problem D, called “Guardians of the Gallery”, was the most challenging problem. While many teams attempted it, and you gave it a valiant 19 tries, no one solved it correctly. What was it about Problem D that gave everyone such trouble? 

Jerry: Problem D was a deceptively simple but exceptionally tricky geometry problem — and to make it harder, imprecision was everywhere. The concept of the problem was simple: there’s a guard in an art gallery, and an alarm goes off for a treasured sculpture. Art galleries are oddly-shaped, so the sculpture might not be in the guard’s line of sight–can you calculate how quickly they can run somewhere to see it?

What made this problem tricky was that some galleries would have walls with the tiniest sliver of a gap between them, and depending on the shape, the guard would sometimes be able to see through that gap. Figuring out what to do with these tiny slivers is what caused most teams who tried this problem to stumble.

Xiao: The challenging part of it was all the tricky edge cases and precision issues. Think about all the glitches in any physics engine in video games! Although we did fix a lot of bugs, most of the 19 attempts were “Hail Mary” attempts where we simply tried different parameters in hope that one of them would pass.

Jerry: I solved problem D this afternoon after getting off the plane back to Boston — unfortunately a bit late, but a satisfying solve nonetheless! While we had a clear path to solving the problem during the contest, we didn’t have enough time to reach the full and complete solution.

The MIT team, wearing ICPC badges, face masks, and matching burgundy t-shirts, pose next to a large trophy cup and a small plush tiger toy.
The MIT team (from left to right, Mingyang Deng, Xiao Mao, and Jerry Mao) pose next to their trophy. Photo credit: Michael Roytek for ICPC.

Individually, did you have a “favorite” problem?

Xiao: Problem I was a particularly fun experience for us. It uses one of the most common data structures called “segment tree.” Our solution borrowed a technique called “lazy propagation” in a very unconventional way.

Mingyang: I especially liked problem E. It’s a problem related to a magic trick in which a servant helps the magician guess a hidden card. The topic is interesting on its own; moreover, clever mathematical intuition is involved in modeling the trick precisely. I found the modeling part challenging and exciting.

Jerry: My favorite problems are about geometry. Geometry problems are often considered the bane of all programming contests due to the unique obstacles they bring: just like how a picture gets blurry the more you zoom in, this “blurriness” or “imprecision” can make a lot of correct ideas hard to express in code. However, there is a certain beauty to discovering how a computer program, which works with just numbers, can connect with a picture, such as a geometric diagram. In fact, it is in this connection that the most elegant results in mathematics become related.

Caption for video: Front row, from left to right: Ce Jin, Jerry Mao, Mingyang Deng, Xiao Mao, and Mr. Zunaid Ahmed Palak MP, the Bangladesh State Minister for the ICT Division (wearing red and green).

In this YouTube clip, shared by Prof. Rinard, you’re being announced as the World Champion Gold Medalists and called up to the stage to receive your trophies. Can you tell us a little about what you were thinking about and feeling at this particular moment? 

Mingyang: It was awesome. I felt unreal when this happened. Many strong teams participated, but our excellent performance placed us at the top. Xiao and Jerry are amazing teammates, and I enjoyed the time spent with them.

Xiao: This competition was my swan song performance concluding my more-than-a-decade-long competitive programming career starting from the 5th grade. On the stage, I was very happy that it ended on a high note, and I was able to avenge my disastrous performance at International Olympiad in Informatics (IOI) 2017. I was also grateful for all the people who made this possible, especially my two teammates, Mingyang and Jerry.

Jerry: We’ve all been medalists on the world stage before at international contests, but this was an entirely different feeling. The ICPC is the oldest, largest, and most prestigious programming contest in the world. To have the opportunity to compete in the World Finals is already a great honor; to become a medalist is extraordinary; and to be the world champion team, representing MIT and bringing the trophy home, is a dream come true.

The team of three MIT programmers, and their student coach Ce Jin, smile broadly and raise their hands in celebration.
The team raises their hands as they are named champions. Photo credit: Randy Piland for ICPC.

Three from MIT named 2023 Rhodes Scholars

Jack Cook, Matthew Kearney, and Jupneet Singh have been selected for the 2023 cohort of the prestigious Rhodes Scholarship program. They will begin fully funded postgraduate studies at Oxford University in the U.K. next fall. Each year, Rhodes awards 32 scholarships to U.S. citizens plus additional scholarships for citizens from non-U.S. constituencies.

The students were supported by Associate Dean Kim Benard and the Distinguished Fellowships team in Career Advising and Professional Development, and received additional mentorship from the Presidential Committee on Distinguished Fellowships.

“Our students have worked incredibly hard throughout this process,” says Professor Tamar Schapiro, who co-chairs the committee along with Professor Will Broadhead. “They have been challenged to think deeply about what they want to do and about who they want to be. They have learned to communicate their values and goals in powerful ways. And they have developed confidence presenting themselves to others. We are thrilled that so many of them were recognized this year, as finalists and as winners.” 

Jack Cook ’22

Jack Cook is a MEng student from New York City who recently graduated with a major in computer science and a minor in brain and cognitive sciences. At Oxford, he plans to pursue an MSc in the social science of the internet and an MSc in evidence-based social intervention and policy evaluation. In the future, he plans to apply his technical skills toward solving problems involving misinformation.

As an undergraduate at MIT, Cook was lead author on “There’s Always a Bigger Fish,” a research paper from Mengjia Yan’s lab that demonstrates how machine learning can be weaponized to extract sensitive information from applications such as a web browser. His work on this project won him MIT’s 2022 Robert M. Fano UROP Award. For his master’s thesis, in partnership with Lahey Hospital, Jack is building a digital cognitive assessment for diagnosing patients with neurodegenerative diseases.

Cook also leads natural language processing initiatives at The New York Times R&D, where he built a system that answers questions from readers about breaking news in real time. As a high school student, he was on the founding team of Mixer, a startup focusing on low-latency live-streaming that was acquired by Microsoft in 2016.

Cook was also director of HackMIT, MIT’s premier annual 1,000-person hackathon, for two years. For HackMIT’s first virtual event in September 2020, he led the development of a 3D virtual platform on which hackers could “walk around” and interact with each other while participating remotely.

Matthew Kearney

Matt Kearney from Austin, Texas, is a senior majoring in both electrical engineering and computer science and philosophy. At Oxford, he will pursue an MSc in research in statistics. His goal is to redesign AI technologies and practices to both address their harms and reimagine them as tools for solutions to pressing societal issues such as climate change and economic inequality.

At MIT, Kearney has researched theoretical quantum computing with the Quanta Research Group, computer vision for 3D scene understanding with the Computer Science and Artificial Intelligence Laboratory (CSAIL), probabilistic climate downscaling with the Human Systems Lab, and explainability methods for natural language models with CSAIL. He also interned with Argo AI, an autonomous vehicle company, and Google X, the moonshot factory of Google.

Kearney ran on the MIT Cross Country and Track and Field teams and served as a captain for three years. He also co-founded a project in 2020 with the goal of focusing individual efforts on the most effective solutions to climate change. He and his co-founder were awarded the PKG Fellowship and the IDEAS Fellowship to support this work. Additionally, as part of his studies in the humanities, he was selected as an MIT Burchard Scholar.

In his spare time, Kearney loves spontaneously singing, cooking elaborate meals, and absolutely anything in the outdoors.

Jupneet Singh

Jupneet Singh is a senior from Somis, California, majoring in chemistry with a flex in biomedical engineering and minoring in history. As a Rhodes Scholar at Oxford, she intends to study for an MSc in evidence-based social intervention and policy evaluation. Following Rhodes, she plans to attend medical school and then complete residency as an active-duty Air Force Captain.

Singh’s career goals include serving as a trauma surgeon in the Air Force, and then entering the United States Public Health Commissioned Corps to advocate for the representation of minorities and culturally adaptive practices in health care. She currently holds leadership positions in Air Force ROTC, MIT Mock Trial, and Project Sunshine MIT, and is also involved with the PKG Center. She conducts research in the Shalek Lab studying fatty liver disease, and she has also worked in the Nolan Lab on natural products research.  

This past summer, Singh worked in de-addiction centers in India and had an abstract accepted to the American College of Surgeons Southern California Conference. She has worked in California at the Ventura County Family Justice Center and Ventura County Medical Center Trauma Center and published a paper as first author in The American Surgeon. Singh founded a program, Pathways to Promise, to support the health of children in Ventura affected by domestic violence, and has received four fellowships to support it.

In machine learning, synthetic data can offer real performance improvements

Teaching a machine to recognize human actions has many potential applications, such as automatically detecting workers who fall at a construction site or enabling a smart home robot to interpret a user’s gestures.

To do this, researchers train machine-learning models using vast datasets of video clips that show humans performing actions. However, not only is it expensive and laborious to gather and label millions or billions of videos, but the clips often contain sensitive information, like people’s faces or license plate numbers. Using these videos might also violate copyright or data protection laws. And this assumes the video data are publicly available in the first place — many datasets are owned by companies and aren’t free to use.

So, researchers are turning to synthetic datasets. These are made by a computer that uses 3D models of scenes, objects, and humans to quickly produce many varying clips of specific actions — without the potential copyright issues or ethical concerns that come with real data.

But are synthetic data as “good” as real data? How well does a model trained with these data perform when it’s asked to classify real human actions? A team of researchers at MIT, the MIT-IBM Watson AI Lab, and Boston University sought to answer this question. They built a synthetic dataset of 150,000 video clips that captured a wide range of human actions, which they used to train machine-learning models. Then they showed these models six datasets of real-world videos to see how well they could learn to recognize actions in those clips.

The researchers found that the synthetically trained models performed even better than models trained on real data for videos that have fewer background objects.

This work could help researchers use synthetic datasets in such a way that models achieve higher accuracy on real-world tasks. It could also help scientists identify which machine-learning applications could be best-suited for training with synthetic data, in an effort to mitigate some of the ethical, privacy, and copyright concerns of using real datasets.

“The ultimate goal of our research is to replace real data pretraining with synthetic data pretraining. There is a cost in creating an action in synthetic data, but once that is done, then you can generate an unlimited number of images or videos by changing the pose, the lighting, etc. That is the beauty of synthetic data,” says Rogerio Feris, principal scientist and manager at the MIT-IBM Watson AI Lab, and co-author of a paper detailing this research.

The paper is authored by lead author Yo-whan “John” Kim ’22; Aude Oliva, director of strategic industry engagement at the MIT Schwarzman College of Computing, MIT director of the MIT-IBM Watson AI Lab, and a senior research scientist in the Computer Science and Artificial Intelligence Laboratory (CSAIL); and seven others. The research will be presented at the Conference on Neural Information Processing Systems.   

Building a synthetic dataset

The researchers began by compiling a new dataset using three publicly available datasets of synthetic video clips that captured human actions. Their dataset, called Synthetic Action Pre-training and Transfer (SynAPT), contained 150 action categories, with 1,000 video clips per category.

They selected as many action categories as possible, such as people waving or falling on the floor, depending on the availability of clips that contained clean video data.

Once the dataset was prepared, they used it to pretrain three machine-learning models to recognize the actions. Pretraining involves training a model for one task to give it a head-start for learning other tasks. Inspired by the way people learn — we reuse old knowledge when we learn something new — the pretrained model can use the parameters it has already learned to help it learn a new task with a new dataset faster and more effectively.

They tested the pretrained models using six datasets of real video clips, each capturing classes of actions that were different from those in the training data.

The researchers were surprised to see that all three synthetic models outperformed models trained with real video clips on four of the six datasets. Their accuracy was highest for datasets that contained video clips with “low scene-object bias.”

Low scene-object bias means that the model cannot recognize the action by looking at the background or other objects in the scene — it must focus on the action itself. For example, if the model is tasked with classifying diving poses in video clips of people diving into a swimming pool, it cannot identify a pose by looking at the water or the tiles on the wall. It must focus on the person’s motion and position to classify the action.

“In videos with low scene-object bias, the temporal dynamics of the actions is more important than the appearance of the objects or the background, and that seems to be well-captured with synthetic data,” Feris says.

“High scene-object bias can actually act as an obstacle. The model might misclassify an action by looking at an object, not the action itself. It can confuse the model,” Kim explains.

Boosting performance

Building off these results, the researchers want to include more action classes and additional synthetic video platforms in future work, eventually creating a catalog of models that have been pretrained using synthetic data, says co-author Rameswar Panda, a research staff member at the MIT-IBM Watson AI Lab.

“We want to build models which have very similar performance or even better performance than the existing models in the literature, but without being bound by any of those biases or security concerns,” he adds.

They also want to combine their work with research that seeks to generate more accurate and realistic synthetic videos, which could boost the performance of the models, says SouYoung Jin, a co-author and CSAIL postdoc. She is also interested in exploring how models might learn differently when they are trained with synthetic data.

“We use synthetic datasets to prevent privacy issues or contextual or social bias, but what does the model actually learn? Does it learn something that is unbiased?” she says.

Now that they have demonstrated this use potential for synthetic videos, they hope other researchers will build upon their work.

“Despite there being a lower cost to obtaining well-annotated synthetic data, currently we do not have a dataset with the scale to rival the biggest annotated datasets with real videos. By discussing the different costs and concerns with real videos, and showing the efficacy of synthetic data, we hope to motivate efforts in this direction,” adds co-author Samarth Mishra, a graduate student at Boston University (BU).

Additional co-authors include Hilde Kuehne, professor of computer science at Goethe University in Germany and an affiliated professor at the MIT-IBM Watson AI Lab; Leonid Karlinsky, research staff member at the MIT-IBM Watson AI Lab; Venkatesh Saligrama, professor in the Department of Electrical and Computer Engineering at BU; and Kate Saenko, associate professor in the Department of Computer Science at BU and a consulting professor at the MIT-IBM Watson AI Lab.

This research was supported by the Defense Advanced Research Projects Agency LwLL, as well as the MIT-IBM Watson AI Lab and its member companies, Nexplore and Woodside.

Louis Braida, hearing aid innovator and mentor, dies at 79

Lou Braida passed on September 2, 2022.

Louis Braida, the Henry Ellis Warren (1894) Professor (Emeritus) in the Department of Electrical Engineering and Computer Science (EECS), died Sept. 2nd. He was 79. Braida was a principal researcher in the Research Laboratory of Electronics (RLE), and a faculty member in the Harvard-MIT Health Sciences and Technology (HST) program. The Institute for Medical Engineering and Science (IMES) is HST’s home at MIT.

Born in the Bronx to Louis Braida and Elvina Tonelli Braida, Braida received the B.E.E from The Cooper Union in 1964, and the S.M. and PhD in electrical engineering from MIT in 1965 and 1969, respectively. During the course of his career at MIT, he was for many years the director of the Speech and Hearing Sciences training program within HST.

Braida was internationally known for his research in the areas of intensity perception, the characterization of hearing impairments, and aids for the deaf. Using modern communication theory and computational techniques, he worked to develop improved hearing aids for people suffering from sensorineural hearing impairments, and cochlear implants for the deaf, addressing many of the field’s knottiest problems in the pursuit of improved performance.

His work strongly enhanced the research community’s analytical understanding of both the benefits, and limitations of, compression amplification in hearing aids. Additionally, Braida sought to develop tactile aids for people who are profoundly deaf or deaf-blind, serving as a substitute for hearing in the reception of speech and environmental sounds.

“Lou Braida was, in many respects, the father of speech and hearing sciences within HST,” said Collin Stultz, Nina T. and Robert H. Rubin Professor in Medical Engineering and Science, Associate Director of IMES, and Co-Director of HST. “His contributions to the field will endure in perpetuity. He was a scholar, a cherished mentor, and a dedicated educator.” Charlotte Reed, a principal investigator and Senior Research Scientist in RLE and longtime friend and colleague of Braida’s, noted that “Lou applied a rigorous quantitative approach to the study of a wide range of topics in speech and hearing science.  Among his lasting contributions to the field are his comprehensive modeling work on the auditory perception of intensity and loudness and on the multimodal perception of speech.”

Beyond Braida’s contributions to the world of auditory science, he was known throughout EECS for his community-minded and collegial approach to work. Taking time from his intense research schedule, he volunteered to mentor new faculty members and orient them to MIT’s largest department. Elazer R. Edelman, Edward J. Poitras Professor in Medical Engineering and Science, MIT, and the Director of the Institute for Medical Engineering and Science (IMES) was one of the many influenced by Braida: “Lou was the consummate educator and mentor, a citizen of MIT and a dedicated member of HST whose engineering and programmatic innovations made life better for all in our community and the world at large.” And Jae Lim, Professor Post-Tenure of Electrical Engineering, remembered his friend as a kindly influence on all who entered his sphere. “Lou influenced the lives of many students at MIT. He supervised my bachelor’s and master’s theses.  I learnt from him what research is and how exciting and satisfying research can be. As a floor tutor of Burton-Conner House, he helped many students including me not only with academic issues but personal matters. He will be remembered and missed by many whose lives he touched.”

Braida’s remarkable devotion to his community was recognized in 2001, when he was awarded the Thomas A. McMahon Mentoring Award by HST. His friend Charlotte Reed aptly summed up his legacy of care, saying, “Lou will be remembered by his many students and colleagues as an intellectual force who had an enormous impact on our personal and professional growth, and he will be greatly missed.”

They’re going the distance: for MIT’s competitive programmers, North America is just the beginning.

Sharing a single computer, Ziqian Zhong, MingYang Deng, and Anton Trygub work on their coding solution.

When you think of computer programmers, you might picture a lone coder, sitting in a cubicle, bathed in flickering light. But you should picture a team: in MIT’s case, a joyous, triumphant team of competitive programmers, bent on solving incredibly thorny problems faster and more accurately than their competition. With a #1 placement in the North American Championships of the International Collegiate Programming Contest (ICPC), MIT’s programming team is now eligible to attend the 46th Annual ICPC World Finals next year. This year, the world finals are being held in Bangladesh; next year’s finals will be held in Egypt.

We sat down with three of the team’s members, Ziqian Zhong (Computer Science and Engineering ’24), MingYang Deng (Computer Science and Engineering ’24), and Anton Trygub (Mathematics and Computer Science and Engineering ’23), to learn what it’s like to compete at the very top tiers of computing.

Many of our readers might have never seen a programming contest before. Tell us a little about the basic rules: what kinds of problems you are faced with, how much time you have to prepare or solve them, and the tools you’re allowed to use!

Ziqian Zhong: We’re typically faced with 10 to 15 problems and five hours. You can code with Java, Kotlin, Python and C/C++. Most teams, including us, prefer to use C++, since it is very concise and usually runs faster.

MingYang Deng: In ACM-ICPC contests, a team consists of three members while only one computer is available, so on average, a member has only a third of the access to the computer. Besides, problems in a programming contest usually require implementing efficient algorithms and data structures, so we typically spend much of our time thinking about solutions before coding.

Do you typically know anything about your competitors, or have friends on opposing teams? Is there trash talking in programming competitions? In other words, what’s the social atmosphere like?

Ziqian Zhong: I personally knew some of my competitors before, and the overall social atmosphere is pretty chill and friendly. We played poker and all kinds of other card games together; I ended up learning two new card games! I guess trash talking is not so popular here.

MingYang Deng: I agree. I made friends with many of my competitors, who are all very nice. People here share the same interests and similar backgrounds. It feels like a community. So there’s no need to be competitive outside the contest.

Anton Trygub: Competitive Programming is different from other competitive activities in the sense that you don’t compete directly against someone, as in football, chess, etc. You just need to do as well as you can yourself, so there is no tension between teams. And on the higher level, competitors know each other, as we participate in the same competitions on a regular basis. We are here to make friends and to help the ICPC community develop!

MIT’s team topped the rankings of the North American Championships, which were held at the University of Central Florida (UCF) in Orlando, FL, from May 26-31.

What makes a problem particularly hard to solve, and which problem was the most difficult in the North American Championships, from your perspective?

Ziqian Zhong: To solve a question, we think and code. Some problems are hard to code, with lots of messy details and casework that is hard to get right. For some other problems it’s hard to figure out the correct solution. Personally, I hate the former kinds of problems (it’s not a typing contest!) and the latter kinds of problems are more popular. Problem H was probably the hardest one and we were the only team that solved it.

Anton Trygub: The problem may be difficult because of different reasons: it might be just heavy implementation with small to no thinking, or requiring some knowledge without which you might just give up, or have a lot of messy details. I prefer problems in which the difficulty lies in the thinking part, in something creative. It’s great to read the problem statement and feel “How can this even be solvable?”

Tell me about a solution you felt was particularly creative, or that you were really proud of!

MingYang Deng: I think Anton’s solution to Problem H and Ziqian’s solution to Problem D are pretty creative.

Anton Trygub: I don’t think that any of my solutions were particularly creative, but Problem H was quite interesting to solve. The setup is, again, “How can this even be solvable”, and then you start looking at some cases and notice some observations, which suddenly add up to the full solution.

Ziqian Zhong: I don’t remember the details about Problem D but I remember it’s a pretty straightforward problem and requires some counting tricks!

How will you be preparing for the world finals next year—what’s involved in that?

Ziqian Zhong: I guess we’ll just do more training contests together. There is no secret ingredient. Mainly just practice more.

MingYang Deng: We will practice more, during which we will refine our strategies.

Anton Trygub: We will have to listen more to our coach on that one 😛 But yeah, mostly practice.

Over the last few years, teams from China, Russia, and Poland have been particularly dominant in the world finals. Why is that, from your perspective—do different countries have different styles of preparation or competition?

Ziqian Zhong: I think many teams in Russia and Poland practiced really hard and they have good strategies. I heard that the Red Panda Team from Moscow State University likes to start with the hard problems and leave some easy problems to the final hour. This is not optimal penalty-wise (the longer you take to solve every problem in total, the larger the penalty) but it probably utilizes the last hour better (the last hour is usually really stressful and it’s hard to get things right). This is pretty different from what we usually pursue.

Anton Trygub: I don’t really think that we should talk about some kind of a trend here, the participants from those countries just happened to be really strong and went to the world finals several times. I hope we can bring a little bit of North America dominance to the finals!

Tell me a little about what competitions like this one have taught you, as a programmer.

Ziqian Zhong: I think it helps me to code faster and more accurately. In a contest, you don’t have much time to debug once things go south.

Anton Trygub: It helps me to go for the most efficient way to implement something, while keeping it clean, as otherwise I would die debugging.

MingYang Deng: It also improves my collaboration skills. In contests like this, you must communicate with your teammates, express your thoughts clearly, work as a team, and believe in each other.

3 Questions: How AI image generators could help robots 

In this conceptual painting, a computer responds with several different images in response to the prompt "A horse in a yellow flower field".

AI image generators, which create fantastical sights at the intersection of dreams and reality, bubble up on every corner of the web. Their entertainment value is demonstrated by an ever-expanding treasure trove of whimsical and random images serving as indirect portals to the brains of human designers. A simple text prompt yields a nearly instantaneous image, satisfying our primitive brains, which are hardwired for instant gratification. 

Although seemingly nascent, the field of AI-generated art can be traced back as far as the 1960s with early attempts using symbolic rule-based approaches to make technical images. While the progression of models that untangle and parse words has gained increasing sophistication, the explosion of generative art has sparked debate around copyright, disinformation, and biases, all mired in hype and controversy. Yilun Du, a PhD student in the Department of Electrical Engineering and Computer Science and affiliate of MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), recently developed a new method that makes models like DALL-E 2 more creative and have better scene understanding. Here, Du describes how these models work, whether this technical infrastructure can be applied to other domains, and how we draw the line between AI and human creativity. 

Q: AI-generated images use something called “stable diffusion” models to turn words into astounding images in just a few moments. But for every image used, there’s usually a human behind it. So what’s the the line between AI and human creativity? How do these models really work? 

A: Imagine all of the images you could get on Google Search and their associated patterns. This is the diet these models are fed on. They’re trained on all of these images and their captions to generate images similar to the billions of images it has seen on the internet.

Let’s say a model has seen a lot of dog photos. It’s trained so that when it gets a similar text input prompt like “dog,” it’s able to generate a photo that looks very similar to the many dog pictures already seen. Now, more methodologically, how this all works dates back to a very old class of models called “energy-based models,” originating in the ’70’s or ’80’s.

In energy-based models, an energy landscape over images is constructed, which is used to simulate the physical dissipation to generate images. When you drop a dot of ink into water and it dissipates, for example, at the end, you just get this uniform texture. But if you try to reverse this process of dissipation, you gradually get the original ink dot in the water again. Or let’s say you have this very intricate block tower, and if you hit it with a ball, it collapses into a pile of blocks. This pile of blocks is then very disordered, and there’s not really much structure to it. To resuscitate the tower, you can try to reverse this folding process to generate your original pile of blocks.

The way these generative models generate images is in a very similar manner, where, initially, you have this really nice image, where you start from this random noise, and you basically learn how to simulate the process of how to reverse this process of going from noise back to your original image, where you try to iteratively refine this image to make it more and more realistic. 

In terms of what’s the line between AI and human creativity, you can say that these models are really trained on the creativity of people. The internet has all types of paintings and images that people have already created in the past. These models are trained to recapitulate and generate the images that have been on the internet. As a result, these models are more like crystallizations of what people have spent creativity on for hundreds of years. 

At the same time, because these models are trained on what humans have designed, they can generate very similar pieces of art to what humans have done in the past. They can find patterns in art that people have made, but it’s much harder for these models to actually generate creative photos on their own. 

If you try to enter a prompt like “abstract art” or “unique art” or the like, it doesn’t really understand the creativity aspect of human art. The models are, rather, recapitulating what people have done in the past, so to speak, as opposed to generating fundamentally new and creative art.

Since these models are trained on vast swaths of images from the internet, a lot of these images are likely copyrighted. You don’t exactly know what the model is retrieving when it’s generating new images, so there’s a big question of how you can even determine if the model is using copyrighted images. If the model depends, in some sense, on some copyrighted images, are then those new images copyrighted? That’s another question to address. 

Yilun Du, an EECS PhD student and MIT CSAIL affiliate, discusses the potential applications of generative art beyond the explosion of images that put the web into creative hysterics.

Q: Do you believe images generated by diffusion models encode some sort of understanding about natural or physical worlds, either dynamically or geometrically? Are there efforts toward “teaching” image generators the basics of the universe that babies learn so early on? 

A: Do they understand, in code, some grasp of natural and physical worlds? I think definitely. If you ask a model to generate a stable configuration of blocks, it definitely generates a block configuration that’s stable. If you tell it, generate an unstable configuration of blocks, it does look very unstable. Or if you say “a tree next to a lake,” it’s roughly able to generate that. 

In a sense, it seems like these models have captured a large aspect of common sense. But the issue that makes us, still, very far away from truly understanding the natural and physical world is that when you try to generate infrequent combinations of words that you or I in our working our minds can very easily imagine, these models cannot.

For example, if you say, “put a fork on top of a plate,” that happens all the time. If you ask the model to generate this, it easily can. If you say, “put a plate on top of a fork,” again, it’s very easy for us to imagine what this would look like. But if you put this into any of these large models, you’ll never get a plate on top of a fork. You instead get a fork on top of a plate, since the models are learning to recapitulate all the images it’s been trained on. It can’t really generalize that well to combinations of words it hasn’t seen. 

A fairly well-known example is an astronaut riding a horse, which the model can do with ease. But if you say a horse riding an astronaut, it still generates a person riding a horse. It seems like these models are capturing a lot of correlations in the datasets they’re trained on, but they’re not actually capturing the underlying causal mechanisms of the world.

Another example that’s commonly used is if you get very complicated text descriptions like one object to the right of another one, the third object in the front, and a third or fourth one flying. It really is only able to satisfy maybe one or two of the objects. This could be partially because of the training data, as it’s rare to have very complicated captions But it could also suggest that these models aren’t very structured. You can imagine that if you get very complicated natural language prompts, there’s no manner in which the model can accurately represent all the component details.

Q: You recently came up with a new method that uses multiple models to create more complex images with better understanding for generative art. Are there potential applications of this framework outside of image or text domains? 

A: We were really inspired by one of the limitations of these models. When you give these models very complicated scene descriptions, they aren’t actually able to correctly generate images that match them. 

One thought is, since it’s a single model with a fixed computational graph, meaning you can only use a fixed amount of computation to generate an image, if you get an extremely complicated prompt, there’s no way you can use more computational power to generate that image.

If I gave a human a description of a scene that was, say, 100 lines long versus a scene that’s one line long, a human artist can spend much longer on the former. These models don’t really have the sensibility to do this. We propose, then, that given very complicated prompts, you can actually compose many different independent models together and have each individual model represent a portion of the scene you want to describe.

We find that this enables our model to generate more complicated scenes, or those that more accurately generate different aspects of the scene together. In addition, this approach can be generally applied across a variety of different domains. While image generation is likely the most currently successful application, generative models have actually been seeing all types of applications in a variety of domains. You can use them to generate different diverse robot behaviors, synthesize 3D shapes, enable better scene understanding, or design new materials. You could potentially compose multiple desired factors to generate the exact material you need for a particular application.

One thing we’ve been very interested in is robotics. In the same way that you can generate different images, you can also generate different robot trajectories (the path and schedule), and by composing different models together, you are able to generate trajectories with different combinations of skills. If I have natural language specifications of jumping versus avoiding an obstacle, you could also compose these models together, and then generate robot trajectories that can both jump and avoid an obstacle . 

In a similar manner, if we want to design proteins, we can specify different functions or aspects — in an analogous manner to how we use language to specify the content of the images — with language-like descriptions, such as the type or functionality of the protein. We could then compose these together to generate new proteins that can potentially satisfy all of these given functions. 

We’ve also explored using diffusion models on 3D shape generation, where you can use this approach to generate and design 3D assets. Normally, 3D asset design is a very complicated and laborious process. By composing different models together, it becomes much easier to generate shapes such as, “I want a 3D shape with four legs, with this style and height,” potentially automating portions of 3D asset design. 

Expanding the MIT-IBM Watson AI Lab’s network of neurons

On October 6, nearly 50 undergraduate and graduate students and postdocs, primarily from MIT, attended the MIT-IBM Watson AI Lab’s networking event. The goal was to connect young researchers with domain experts across the Lab for applied research through the MIT 6-A program offered through the EECS Alliance and the Lab’s summer 2023 internship; the event also helped to give the students a feel for what the Lab has to offer.

The event kicked off with an introduction from David Cox, IBM director of the Lab and director of exploratory AI research at IBM, who provided insights into the Lab’s structure and how the work fits into the larger picture of global machine learning innovation. The Lab, Cox says, is part of IBM Research, which is one of the oldest and largest research labs around the world; part of IBM’s storied history includes their researchers receiving several Nobel Prizes and Turing Awards — that’s more than most countries. “It’s a testament to this long commitment to charting the future of computing,” says Cox. While IBM Research has many different areas of work, such as cloud, AI and security, the Lab is within a part called exploratory science. “So, our job is to be very academic in our goals to chart frontiers, invent new methods, and look at new technologies with fresh eyes, and we focus on artificial intelligence.” In this field, MIT and IBM can trace their roots back to a 1956 workshop, when the term “artificial intelligence” was first coined, and officially joined forces to found the Lab in 2017. Today, in the evolution of machine learning, the Lab’s work sits on the tail-end of narrow AI (emerging and specific, limited-use technologies) and before general AI (revolutionary, coming around 2050 and beyond), in a class Cox described as broad AI, which is disruptive and pervasive. This is characterized by systems that have multi-task, multimodal, multi-domain uses and are easier to broadly apply different problems. Working in this unique area of broad AI, the Lab is considering if the systems we’re building are explainable, secure, ethical and unbiased, able to learn from small data, and have efficient computing infrastructure.

The roughly 70 projects undertaken in the Lab at any one time are jointly conceived and executed, and the research output routinely appears in top AI conferences and journals. Providing a 30,000-foot view, Cox provided a proverbial tour of the research portfolio, consisting of foundation models (like self-supervised learning), synthetic data (generating new data that also helps with security), multimodal research, fluid intelligence (reason and logic), accelerated discovery (chemistry, materials science, climate change), AI for business and decision-making (forecasting supply chains, causal discovery, healthcare, and logistics), efficient AI (hardware/software co-design, cheaper to run, lower energy requirements), and trusted AI and robustness (safe and fair systems, unbiased models, robust to adversarial attacks, human-AI interactions, and explainable systems).

One of the special aspects of the Lab is that it brings together an academic institution and industry, which has the advantage of making real impacts for business. A benefit of this unique structure is that, Cox says, “… we built this member program, where external companies come and co-invest with us…to use cutting edge AI to solve problems that they are facing in their businesses” — an opportunity 6-A program students would be able to leverage for their career growth.

As an integral part of the Lab and the future of the research community, event attendees—the students and early career researchers—were invited to engage with the Lab’s researchers and explore current applied projects through demonstrations, including privacy preserving synthetic data generationstrolling cities that generated realistic city images, creating a molecular grammar for chemical generation and discovery, building giant language model test room (GLTR) that can detect computer-generated text, modeling individual fairness, and an adversarial t-shirt that can make someone invisible to a person detection computer vision model. These showcased many pillars of the Lab’s research and the solutions that have come out of them.

Students expressed interests and backgrounds spanning mathematics, computer science, robotics, cybersecurity, neuroscience, hardware, natural language processing, quantum computing, economics, healthcare, computer vision, algorithms, and software engineering, to name a few. Drawn in, they inquired about overcoming roadblocks in projects, how research questions and problems evolved, how certain results were achieved and the thinking behind it, and how their background would fit into a particular team and area of inquiry.

“It was great to see such a high level of engagement from such young researchers,” says Aude Oliva, MIT director of the MIT-IBM Watson AI Lab and director of strategic industry engagement in the MIT Stephen A. Schwarzman College of Computing. “As the pace of computing and machine learning innovation skyrockets, students and early-career researchers are and will continue to be vital contributors of novel ideas and creative solutions to new research questions and industry problems in AI and computing.”