Study shows vision-language models can’t handle queries with negation words

Imagine a radiologist examining a chest X-ray from a new patient. She notices the patient has swelling in the tissue but does not have an enlarged heart. Looking to speed up diagnosis, she might use a vision-language machine-learning model to search for reports from similar patients.

But if the model mistakenly identifies reports with both conditions, the most likely diagnosis could be quite different: If a patient has tissue swelling and an enlarged heart, the condition is very likely to be cardiac related, but with no enlarged heart there could be several underlying causes.

In a new study, MIT researchers have found that vision-language models are extremely likely to make such a mistake in real-world situations because they don’t understand negation — words like “no” and “doesn’t” that specify what is false or absent. 

“Those negation words can have a very significant impact, and if we are just using these models blindly, we may run into catastrophic consequences,” says Kumail Alhamoud, an MIT graduate student and lead author of this study.

The researchers tested the ability of vision-language models to identify negation in image captions. The models often performed as well as a random guess. Building on those findings, the team created a dataset of images with corresponding captions that include negation words describing missing objects.

They show that retraining a vision-language model with this dataset leads to performance improvements when a model is asked to retrieve images that do not contain certain objects. It also boosts accuracy on multiple choice question answering with negated captions.

But the researchers caution that more work is needed to address the root causes of this problem. They hope their research alerts potential users to a previously unnoticed shortcoming that could have serious implications in high-stakes settings where these models are currently being used, from determining which patients receive certain treatments to identifying product defects in manufacturing plants.

“This is a technical paper, but there are bigger issues to consider. If something as fundamental as negation is broken, we shouldn’t be using large vision/language models in many of the ways we are using them now — without intensive evaluation,” says senior author Marzyeh Ghassemi, an associate professor in the Department of Electrical Engineering and Computer Science (EECS) and a member of the Institute of Medical Engineering Sciences and the Laboratory for Information and Decision Systems.

Ghassemi and Alhamoud are joined on the paper by Shaden Alshammari, an MIT graduate student; Yonglong Tian of OpenAI; Guohao Li, a former postdoc at Oxford University; Philip H.S. Torr, a professor at Oxford; and Yoon Kim, an assistant professor of EECS and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL) at MIT. The research will be presented at Conference on Computer Vision and Pattern Recognition.

Neglecting negation

Vision-language models (VLM) are trained using huge collections of images and corresponding captions, which they learn to encode as sets of numbers, called vector representations. The models use these vectors to distinguish between different images.

A VLM utilizes two separate encoders, one for text and one for images, and the encoders learn to output similar vectors for an image and its corresponding text caption.

“The captions express what is in the images — they are a positive label. And that is actually the whole problem. No one looks at an image of a dog jumping over a fence and captions it by saying ‘a dog jumping over a fence, with no helicopters,’” Ghassemi says.

Because the image-caption datasets don’t contain examples of negation, VLMs never learn to identify it.

To dig deeper into this problem, the researchers designed two benchmark tasks that test the ability of VLMs to understand negation.

For the first, they used a large language model (LLM) to re-caption images in an existing dataset by asking the LLM to think about related objects not in an image and write them into the caption. Then they tested models by prompting them with negation words to retrieve images that contain certain objects, but not others.

For the second task, they designed multiple choice questions that ask a VLM to select the most appropriate caption from a list of closely related options. These captions differ only by adding a reference to an object that doesn’t appear in the image or negating an object that does appear in the image.

The models often failed at both tasks, with image retrieval performance dropping by nearly 25 percent with negated captions. When it came to answering multiple choice questions, the best models only achieved about 39 percent accuracy, with several models performing at or even below random chance.

One reason for this failure is a shortcut the researchers call affirmation bias — VLMs ignore negation words and focus on objects in the images instead.

“This does not just happen for words like ‘no’ and ‘not.’ Regardless of how you express negation or exclusion, the models will simply ignore it,” Alhamoud says.

This was consistent across every VLM they tested.

“A solvable problem”

Since VLMs aren’t typically trained on image captions with negation, the researchers developed datasets with negation words as a first step toward solving the problem.

Using a dataset with 10 million image-text caption pairs, they prompted an LLM to propose related captions that specify what is excluded from the images, yielding new captions with negation words.

They had to be especially careful that these synthetic captions still read naturally, or it could cause a VLM to fail in the real world when faced with more complex captions written by humans.

They found that finetuning VLMs with their dataset led to performance gains across the board. It improved models’ image retrieval abilities by about 10 percent, while also boosting performance in the multiple-choice question answering task by about 30 percent.

“But our solution is not perfect. We are just recaptioning datasets, a form of data augmentation. We haven’t even touched how these models work, but we hope this is a signal that this is a solvable problem and others can take our solution and improve it,” Alhamoud says.

At the same time, he hopes their work encourages more users to think about the problem they want to use a VLM to solve and design some examples to test it before deployment.

In the future, the researchers could expand upon this work by teaching VLMs to process text and images separately, which may improve their ability to understand negation. In addition, they could develop additional datasets that include image-caption pairs for specific applications, such as health care.

MIT engineers advance toward a fault-tolerant quantum computer

In the future, quantum computers could rapidly simulate new materials or help scientists develop faster machine-learning models, opening the door to many new possibilities.

But these applications will only be possible if quantum computers can perform operations extremely quickly, so scientists can make measurements and perform corrections before compounding error rates reduce their accuracy and reliability.

The efficiency of this measurement process, known as readout, relies on the strength of the coupling between photons, which are particles of light that carry quantum information, and artificial atoms, units of matter that are often used to store information in a quantum computer.

Now, MIT researchers have demonstrated what they believe is the strongest nonlinear light-matter coupling ever achieved in a quantum system. Their experiment is a step toward realizing quantum operations and readout that could be performed in a few nanoseconds.

The researchers used a novel superconducting circuit architecture to show nonlinear light-matter coupling that is about an order of magnitude stronger than prior demonstrations, which could enable a quantum processor to run about 10 times faster.

There is still much work to be done before the architecture could be used in a real quantum computer, but demonstrating the fundamental physics behind the process is a major step in the right direction, says Yufeng “Bright” Ye SM ’20, PhD ’24, lead author of a paper on this research.

“This would really eliminate one of the bottlenecks in quantum computing. Usually, you have to measure the results of your computations in between rounds of error correction. This could accelerate how quickly we can reach the fault-tolerant quantum computing stage and be able to get real-world applications and value out of our quantum computers,” says Ye.

He is joined on the paper by senior author Kevin O’Brien, an associate professor and principal investigator in the Research Laboratory of Electronics (RLE) at MIT who leads the Quantum Coherent Electronics Group in the Department of Electrical Engineering and Computer Science (EECS). Additional MIT co-authors, with affiliations in RLE and/or MIT Lincoln Laboratory, include Jeremy B. Kline, Alec Yen, Gregory Cunningham, Max Tan, Alicia Zang, Michael Gingras, Bethany M. Niedzielski, Hannah Stickler, Kyle Serniak, and Mollie E. Schwartz. The research appears today in Nature Communications.

A new coupler

This physical demonstration builds on years of theoretical research in the O’Brien group.

After Ye joined the lab as a PhD student in 2019, he began developing a specialized photon detector to enhance quantum information processing.

Through that work, he invented a new type of quantum coupler, which is a device that facilitates interactions between qubits. Qubits are the building blocks of a quantum computer. This so-called quarton coupler had so many potential applications in quantum operations and readout that it quickly became a focus of the lab.

This quarton coupler is a special type of superconducting circuit that has the potential to generate extremely strong nonlinear coupling, which is essential for running most quantum algorithms. As the researchers feed more current into the coupler, it creates an even stronger nonlinear interaction. In this sense, nonlinearity means a system behaves in a way that is greater than the sum of its parts, exhibiting more complex properties.

“Most of the useful interactions in quantum computing come from nonlinear coupling of light and matter. If you can get a more versatile range of different types of coupling, and increase the coupling strength, then you can essentially increase the processing speed of the quantum computer,” Ye explains.

For quantum readout, researchers shine microwave light onto a qubit and then, depending on whether that qubit is in state 0 or 1, there is a frequency shift on its associated readout resonator. They measure this shift to determine the qubit’s state.

Nonlinear light-matter coupling between the qubit and resonator enables this measurement process.

The MIT researchers designed an architecture with a quarton coupler connected to two superconducting qubits on a chip. They turn one qubit into a resonator and use the other qubit as an artificial atom which stores quantum information. This information is transferred in the form of microwave light particles called photons.

“The interaction between these superconducting artificial atoms and the microwave light that routes the signal is basically how an entire superconducting quantum computer is built,” Ye explains.

Enabling faster readout

The quarton coupler creates nonlinear light-matter coupling between the qubit and resonator that’s about an order of magnitude stronger than researchers had achieved before. This could enable a quantum system with lightning-fast readout.

“This work is not the end of the story. This is the fundamental physics demonstration, but there is work going on in the group now to realize really fast readout,” O’Brien says.

That would involve adding additional electronic components, such as filters, to produce a readout circuit that could be incorporated into a larger quantum system.

The researchers also demonstrated extremely strong matter-matter coupling, another type of qubit interaction that is important for quantum operations. This is another area they plan to explore with future work.

Fast operations and readout are especially important for quantum computers because qubits have finite lifespans, a concept known as coherence time.

Stronger nonlinear coupling enables a quantum processor to run faster and with lower error, so the qubits can perform more operations in the same amount of time. This means the qubits can run more rounds of error correction during their lifespans.

“The more runs of error correction you can get in, the lower the error will be in the results,” Ye says.

In the long run, this work could help scientists build a fault-tolerant quantum computer, which is essential for practical, large-scale quantum computation.

This research was supported, in part, by the Army Research Office, the AWS Center for Quantum Computing, and the MIT Center for Quantum Engineering.

Making AI models more trustworthy for high-stakes settings

The ambiguity in medical imaging can present major challenges for clinicians who are trying to identify disease. For instance, in a chest X-ray, pleural effusion, an abnormal buildup of fluid in the lungs, can look very much like pulmonary infiltrates, which are accumulations of pus or blood.

An artificial intelligence model could assist the clinician in X-ray analysis by helping to identify subtle details and boosting the efficiency of the diagnosis process. But because so many possible conditions could be present in one image, the clinician would likely want to consider a set of possibilities, rather than only having one AI prediction to evaluate.

One promising way to produce a set of possibilities, called conformal classification, is convenient because it can be readily implemented on top of an existing machine-learning model. However, it can produce sets that are impractically large. 

MIT researchers have now developed a simple and effective improvement that can reduce the size of prediction sets by up to 30 percent while also making predictions more reliable.

Having a smaller prediction set may help a clinician zero in on the right diagnosis more efficiently, which could improve and streamline treatment for patients. This method could be useful across a range of classification tasks — say, for identifying the species of an animal in an image from a wildlife park — as it provides a smaller but more accurate set of options.

“With fewer classes to consider, the sets of predictions are naturally more informative in that you are choosing between fewer options. In a sense, you are not really sacrificing anything in terms of accuracy for something that is more informative,” says Divya Shanmugam PhD ’24, a postdoc at Cornell Tech who conducted this research while she was an MIT graduate student.

Shanmugam is joined on the paper by Helen Lu ’24; Swami Sankaranarayanan, a former MIT postdoc who is now a research scientist at Lilia Biosciences; and senior author John Guttag, the Dugald C. Jackson Professor of Computer Science and Electrical Engineering at MIT and a member of the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). The research will be presented at the Conference on Computer Vision and Pattern Recognition in June.

Prediction guarantees

AI assistants deployed for high-stakes tasks, like classifying diseases in medical images, are typically designed to produce a probability score along with each prediction so a user can gauge the model’s confidence. For instance, a model might predict that there is a 20 percent chance an image corresponds to a particular diagnosis, like pleurisy.

But it is difficult to trust a model’s predicted confidence because much prior research has shown that these probabilities can be inaccurate. With conformal classification, the model’s prediction is replaced by a set of the most probable diagnoses along with a guarantee that the correct diagnosis is somewhere in the set.

But the inherent uncertainty in AI predictions often causes the model to output sets that are far too large to be useful.

For instance, if a model is classifying an animal in an image as one of 10,000 potential species, it might output a set of 200 predictions so it can offer a strong guarantee.

“That is quite a few classes for someone to sift through to figure out what the right class is,” Shanmugam says.

The technique can also be unreliable because tiny changes to inputs, like slightly rotating an image, can yield entirely different sets of predictions.

To make conformal classification more useful, the researchers applied a technique developed to improve the accuracy of computer vision models called test-time augmentation (TTA).

TTA creates multiple augmentations of a single image in a dataset, perhaps by cropping the image, flipping it, zooming in, etc. Then it applies a computer vision model to each version of the same image and aggregates its predictions.

“In this way, you get multiple predictions from a single example. Aggregating predictions in this way improves predictions in terms of accuracy and robustness,” Shanmugam explains.

Maximizing accuracy

To apply TTA, the researchers hold out some labeled image data used for the conformal classification process. They learn to aggregate the augmentations on these held-out data, automatically augmenting the images in a way that maximizes the accuracy of the underlying model’s predictions.

Then they run conformal classification on the model’s new, TTA-transformed predictions. The conformal classifier outputs a smaller set of probable predictions for the same confidence guarantee.

“Combining test-time augmentation with conformal prediction is simple to implement, effective in practice, and requires no model retraining,” Shanmugam says.

Compared to prior work in conformal prediction across several standard image classification benchmarks, their TTA-augmented method reduced prediction set sizes across experiments, from 10 to 30 percent.

Importantly, the technique achieves this reduction in prediction set size while maintaining the probability guarantee.

The researchers also found that, even though they are sacrificing some labeled data that would normally be used for the conformal classification procedure, TTA boosts accuracy enough to outweigh the cost of losing those data.

“It raises interesting questions about how we used labeled data after model training. The allocation of labeled data between different post-training steps is an important direction for future work,” Shanmugam says.

In the future, the researchers want to validate the effectiveness of such an approach in the context of models that classify text instead of images. To further improve the work, the researchers are also considering ways to reduce the amount of computation required for TTA.

This research is funded, in part, by the Wistrom Corporation.

System lets robots identify an object’s properties through handling

A human clearing junk out of an attic can often guess the contents of a box simply by picking it up and giving it a shake, without the need to see what’s inside. Researchers from MIT, Amazon Robotics, and the University of British Columbia have taught robots to do something similar.

They developed a technique that enables robots to use only internal sensors to learn about an object’s weight, softness, or contents by picking it up and gently shaking it. With their method, which does not require external measurement tools or cameras, the robot can accurately guess parameters like an object’s mass in a matter of seconds.

This low-cost technique could be especially useful in applications where cameras might be less effective, such as sorting objects in a dark basement or clearing rubble inside a building that partially collapsed after an earthquake.

Key to their approach is a simulation process that incorporates models of the robot and the object to rapidly identify characteristics of that object as the robot interacts with it. 

The researchers’ technique is as good at guessing an object’s mass as some more complex and expensive methods that incorporate computer vision. In addition, their data-efficient approach is robust enough to handle many types of unseen scenarios.

“This idea is general, and I believe we are just scratching the surface of what a robot can learn in this way. My dream would be to have robots go out into the world, touch things and move things in their environments, and figure out the properties of everything they interact with on their own,” says Peter Yichen Chen, an MIT postdoc and lead author of a paper on this technique.

His coauthors include fellow MIT postdoc Chao Liu; Pingchuan Ma PhD ’25; Jack Eastman MEng ’24; Dylan Randle and Yuri Ivanov of Amazon Robotics; MIT professors of electrical engineering and computer science Daniela Rus, who leads MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL); and Wojciech Matusik, who leads the Computational Design and Fabrication Group within CSAIL. The research will be presented at the International Conference on Robotics and Automation.

Sensing signals

The researchers’ method leverages proprioception, which is a human or robot’s ability to sense its movement or position in space.

For instance, a human who lifts a dumbbell at the gym can sense the weight of that dumbbell in their wrist and bicep, even though they are holding the dumbbell in their hand. In the same way, a robot can “feel” the heaviness of an object through the multiple joints in its arm.

“A human doesn’t have super-accurate measurements of the joint angles in our fingers or the precise amount of torque we are applying to an object, but a robot does. We take advantage of these abilities,” Liu says.

As the robot lifts an object, the researchers’ system gathers signals from the robot’s joint encoders, which are sensors that detect the rotational position and speed of its joints during movement. 

Most robots have joint encoders within the motors that drive their moveable parts, Liu adds. This makes their technique more cost-effective than some approaches because it doesn’t need extra components like tactile sensors or vision-tracking systems.

To estimate an object’s properties during robot-object interactions, their system relies on two models: one that simulates the robot and its motion and one that simulates the dynamics of the object.

“Having an accurate digital twin of the real-world is really important for the success of our method,” Chen adds.

Their algorithm “watches” the robot and object move during a physical interaction and uses joint encoder data to work backward and identify the properties of the object.

For instance, a heavier object will move slower than a light one if the robot applies the same amount of force.

Differentiable simulations

They utilize a technique called differentiable simulation, which allows the algorithm to predict how small changes in an object’s properties, like mass or softness, impact the robot’s ending joint position. The researchers built their simulations using NVIDIA’s Warp library, an open-source developer tool that supports differentiable simulations.

Once the differentiable simulation matches up with the robot’s real movements, the system has identified the correct property. The algorithm can do this in a matter of seconds and only needs to see one real-world trajectory of the robot in motion to perform the calculations.

“Technically, as long as you know the model of the object and how the robot can apply force to that object, you should be able to figure out the parameter you want to identify,” Liu says.

The researchers used their method to learn the mass and softness of an object, but their technique could also determine properties like moment of inertia or the viscosity of a fluid inside a container.

Plus, because their algorithm does not need an extensive dataset for training like some methods that rely on computer vision or external sensors, it would not be as susceptible to failure when faced with unseen environments or new objects.

In the future, the researchers want to try combining their method with computer vision to create a multimodal sensing technique that is even more powerful.

“This work is not trying to replace computer vision. Both methods have their pros and cons. But here we have shown that without a camera we can already figure out some of these properties,” Chen says.

They also want to explore applications with more complicated robotic systems, like soft robots, and more complex objects, including sloshing liquids or granular media like sand.

In the long run, they hope to apply this technique to improve robot learning, enabling future robots to quickly develop new manipulation skills and adapt to changes in their environments.

“Determining the physical properties of objects from data has long been a challenge in robotics, particularly when only limited or noisy measurements are available. This work is significant because it shows that robots can accurately infer properties like mass and softness using only their internal joint sensors, without relying on external cameras or specialized measurement tools,” says Miles Macklin, senior director of simulation technology at NVIDIA, who was not involved with this research.

This work is funded, in part, by Amazon and the GIST-CSAIL Research Program.

Hybrid AI model crafts smooth, high-quality videos in seconds

What would a behind-the-scenes look at a video generated by an artificial intelligence model be like? You might think the process is similar to stop-motion animation, where many images are created and stitched together, but that’s not quite the case for “diffusion models” like OpenAl’s SORA and Google’s VEO 2.

Instead of producing a video frame-by-frame (or “autoregressively”), these systems process the entire sequence at once. The resulting clip is often photorealistic, but the process is slow and doesn’t allow for on-the-fly changes. 

Scientists from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and Adobe Research have now developed a hybrid approach, called “CausVid,” to create videos in seconds. Much like a quick-witted student learning from a well-versed teacher, a full-sequence diffusion model trains an autoregressive system to swiftly predict the next frame while ensuring high quality and consistency. CausVid’s student model can then generate clips from a simple text prompt, turning a photo into a moving scene, extending a video, or altering its creations with new inputs mid-generation.

This dynamic tool enables fast, interactive content creation, cutting a 50-step process into just a few actions. It can craft many imaginative and artistic scenes, such as a paper airplane morphing into a swan, woolly mammoths venturing through snow, or a child jumping in a puddle. Users can also make an initial prompt, like “generate a man crossing the street,” and then make follow-up inputs to add new elements to the scene, like “he writes in his notebook when he gets to the opposite sidewalk.”

A video produced by CausVid illustrates its ability to create smooth, high-quality content.
AI-generated animation courtesy of the researchers.

The CSAIL researchers say that the model could be used for different video editing tasks, like helping viewers understand a livestream in a different language by generating a video that syncs with an audio translation. It could also help render new content in a video game or quickly produce training simulations to teach robots new tasks.

Tianwei Yin SM ’25, PhD ’25, a recently graduated student in electrical engineering and computer science and CSAIL affiliate, attributes the model’s strength to its mixed approach.

“CausVid combines a pre-trained diffusion-based model with autoregressive architecture that’s typically found in text generation models,” says Yin, co-lead author of a new paper about the tool. “This AI-powered teacher model can envision future steps to train a frame-by-frame system to avoid making rendering errors.”

Yin’s co-lead author, Qiang Zhang, is a research scientist at xAI and a former CSAIL visiting researcher. They worked on the project with Adobe Research scientists Richard Zhang, Eli Shechtman, and Xun Huang, and two CSAIL principal investigators: MIT professors Bill Freeman and Frédo Durand.

Caus(Vid) and effect

Many autoregressive models can create a video that’s initially smooth, but the quality tends to drop off later in the sequence. A clip of a person running might seem lifelike at first, but their legs begin to flail in unnatural directions, indicating frame-to-frame inconsistencies (also called “error accumulation”).

Error-prone video generation was common in prior causal approaches, which learned to predict frames one by one on their own. CausVid instead uses a high-powered diffusion model to teach a simpler system its general video expertise, enabling it to create smooth visuals, but much faster.

CausVid displayed its video-making aptitude when researchers tested its ability to make high-resolution, 10-second-long videos. It outperformed baselines like “OpenSORA” and “MovieGen,” working up to 100 times faster than its competition while producing the most stable, high-quality clips.

Then, Yin and his colleagues tested CausVid’s ability to put out stable 30-second videos, where it also topped comparable models on quality and consistency. These results indicate that CausVid may eventually produce stable, hours-long videos, or even an indefinite duration.

A subsequent study revealed that users preferred the videos generated by CausVid’s student model over its diffusion-based teacher.

“The speed of the autoregressive model really makes a difference,” says Yin. “Its videos look just as good as the teacher’s ones, but with less time to produce, the trade-off is that its visuals are less diverse.”

CausVid also excelled when tested on over 900 prompts using a text-to-video dataset, receiving the top overall score of 84.27. It boasted the best metrics in categories like imaging quality and realistic human actions, eclipsing state-of-the-art video generation models like “Vchitect” and “Gen-3.

While an efficient step forward in AI video generation, CausVid may soon be able to design visuals even faster — perhaps instantly — with a smaller causal architecture. Yin says that if the model is trained on domain-specific datasets, it will likely create higher-quality clips for robotics and gaming.

Experts say that this hybrid system is a promising upgrade from diffusion models, which are currently bogged down by processing speeds. “[Diffusion models] are way slower than LLMs [large language models] or generative image models,” says Carnegie Mellon University Assistant Professor Jun-Yan Zhu, who was not involved in the paper. “This new work changes that, making video generation much more efficient. That means better streaming speed, more interactive applications, and lower carbon footprints.”

The team’s work was supported, in part, by the Amazon Science Hub, the Gwangju Institute of Science and Technology, Adobe, Google, the U.S. Air Force Research Laboratory, and the U.S. Air Force Artificial Intelligence Accelerator. CausVid will be presented at the Conference on Computer Vision and Pattern Recognition in June.

Novel AI model inspired by neural dynamics from the brain

Researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have developed a novel artificial intelligence model inspired by neural oscillations in the brain, with the goal of significantly advancing how machine learning algorithms handle long sequences of data.

AI often struggles with analyzing complex information that unfolds over long periods of time, such as climate trends, biological signals, or financial data. One new type of AI model, called “state-space models,” has been designed specifically to understand these sequential patterns more effectively. However, existing state-space models often face challenges — they can become unstable or require a significant amount of computational resources when processing long data sequences.

To address these issues, CSAIL researchers T. Konstantin Rusch and Daniela Rus have developed what they call “linear oscillatory state-space models” (LinOSS), which leverage principles of forced harmonic oscillators — a concept deeply rooted in physics and observed in biological neural networks. This approach provides stable, expressive, and computationally efficient predictions without overly restrictive conditions on the model parameters.

“Our goal was to capture the stability and efficiency seen in biological neural systems and translate these principles into a machine learning framework,” explains Rusch. “With LinOSS, we can now reliably learn long-range interactions, even in sequences spanning hundreds of thousands of data points or more.”

The LinOSS model is unique in ensuring stable prediction by requiring far less restrictive design choices than previous methods. Moreover, the researchers rigorously proved the model’s universal approximation capability, meaning it can approximate any continuous, causal function relating input and output sequences.

Empirical testing demonstrated that LinOSS consistently outperformed existing state-of-the-art models across various demanding sequence classification and forecasting tasks. Notably, LinOSS outperformed the widely-used Mamba model by nearly two times in tasks involving sequences of extreme length.

Recognized for its significance, the research was selected for an oral presentation at ICLR 2025 — an honor awarded to only the top 1 percent of submissions. The MIT researchers anticipate that the LinOSS model could significantly impact any fields that would benefit from accurate and efficient long-horizon forecasting and classification, including health-care analytics, climate science, autonomous driving, and financial forecasting.

“This work exemplifies how mathematical rigor can lead to performance breakthroughs and broad applications,” Rus says. “With LinOSS, we’re providing the scientific community with a powerful tool for understanding and predicting complex systems, bridging the gap between biological inspiration and computational innovation.”

The team imagines that the emergence of a new paradigm like LinOSS will be of interest to machine learning practitioners to build upon. Looking ahead, the researchers plan to apply their model to an even wider range of different data modalities. Moreover, they suggest that LinOSS could provide valuable insights into neuroscience, potentially deepening our understanding of the brain itself.

Their work was supported by the Swiss National Science Foundation, the Schmidt AI2050 program, and the U.S. Department of the Air Force Artificial Intelligence Accelerator.

Student Spotlight: Aria Eppinger

This interview is part of a series of short interviews from the Department of EECS, called Student Spotlights. Each Spotlight features a student answering their choice of questions about themselves and life at MIT. Today’s interviewee, Aria Eppinger, graduated with her undergraduate degree in 6-7 Computer Science and Molecular Biology in spring of 2024. This spring, she will complete her MEng in 6-7. Her thesis, supervised by Ford Professor of Engineering Doug Lauffenburger  in the Department of Biological Engineering, investigates the biological underpinnings of adverse pregnancy outcomes, including preterm birth and pre-eclampsia, by applying polytope-fitting algorithms.

Tell me about one teacher from your past—here at MIT, at your high school, or even earlier, who had an influence on the person you’ve become.

There are many teachers who had a large impact on my trajectory. I would first like to thank my elementary and middle school teachers for imbuing in me a love of learning. I would also like to thank my high school teachers for not only teaching me the foundations of writing strong arguments, programming, and designing experiments, but also instilling in me the importance of being a balanced person. It can be tempting to be ruled by studies or work, especially when learning and working are so fun. My high school teachers encouraged me to pursue my hobbies, make memories with friends, and spend time with family. As life continues to be hectic, I’m so grateful for this lesson (even if I’m still working on mastering it).

Tell me about one conversation that changed the trajectory of your life.

A number of years ago, I had the opportunity to chat with Warren Buffett. I was nervous at first but soon put to ease by his descriptions of his favorite foods – hamburgers, French fries, and ice cream – and his hitch-hiking stories. His kindness impressed and inspired me, which is something I carry with me and aim to emulate all these years later.

Do you have any pets? Tell us about them—and if you have pictures, please share!

I have one dog who lives at home with my parents. Dodger, named after “Artful Dodger” in Oliver Twist, is as mischievous as beagles tend to be. We adopted him from a rescue shelter when I was in elementary school.

Dodger (left) and the late Patch (right), shared a doghouse built as a group project by Aria, her brother, father, and grandfather. Photo credit: Francesmary Modugno

Are you a re-reader or a re-watcher—and if so, what are your comfort books, shows, or movies?

I don’t re-read many books or re-watch many movies, but I never tire of Jane Austen’s Pride and Prejudice. I bought myself an ornately bound copy when I was interning in NYC last summer.  Austen’s other novels, especially Sense and Sensibility, Persuasion, and Emma, are also favorites, and I’ve seen a fair number of their movie and mini-series adaptations. My favorite adaptation is the 1995 BBC production of Pride and Prejudice because of the cohesion with the original book and the casting of the leads, as well as the touches and plot derivations added by the producer and director to bring the work to modern audiences. The adaptation is quite long, but I have fond memories of re-watching it with some fellow Austenites at MIT.

Speaking of swimming scenes, Eppinger just finished her final season as a member of the MIT Varsity Swimming and Diving Team, where she competed in distance freestyle and breaststroke events. Photo credit: Sydney Chun

If you had to teach a really in-depth class about one niche topic, what would you pick?

There are two types of people in the world – those who eat to live and those who live to eat. As one of the latter, I would have to teach some sort of in-depth class on food. Perhaps, I would teach the science behind baking chocolate cake or churning the perfect ice cream. Or maybe I would teach the biochemistry of digesting. In any case, I would have to have lots of hands-on demos and reserve plenty for taste-testing!

What was the last thing you changed your mind about?

Brisket! I never was a big fan of brisket until I went to a Texas BBQ restaurant near campus, The Smoke Shop BBQ. Growing up, I had never had true BBQ, so I was quite skeptical. However, I enjoyed not only the brisket but also the other dishes. The brussels sprouts with caramelized onions is probably my favorite dish, but it feels like a crime to say that about a BBQ place!

What are you looking forward to about life after graduation? What do you think you’ll miss about MIT?

I’m looking forward to new adventures after graduation, including working in NYC and traveling to new places. I cross-registered to take Intensive Italian at Harvard this semester and am planning a trip to Italy to practice my Italian, see the historic sites, visit the Vatican, and taste the food. Non vedo l’ora di viaggiare all’Italia!

Eppinger has relished her time at MIT. “College is a special time to live with friends in close proximity and to stay up late working on psets in the Baker lounges.” Photo credit: Karla Ravin

While I’m excited for what lies ahead, I will miss MIT. What a joy it is to spend most of the day learning information from a fire hose, taking a class on a foreign topic because the course catalog description looked fun, talking to people whose viewpoint is very similar or very different from my own, and making friends that will last a lifetime.

Merging design and computer science in creative ways

The speed with which new technologies hit the market is nothing compared to the speed with which talented researchers find creative ways to use them, train them, even turn them into things we can’t live without. One such researcher is MIT MAD Fellow Alexander Htet Kyaw, a graduate student pursuing dual master’s degrees in architectural studies in computation and in electrical engineering and computer science.

Kyaw takes technologies like artificial intelligence, augmented reality, and robotics, and combines them with gesture, speech, and object recognition to create human-AI workflows that have the potential to interact with our built environment, change how we shop, design complex structures, and make physical things.

One of his latest innovations is Curator AI, for which he and his MIT graduate student partners took first prize — $26,000 in OpenAI products and cash — at the MIT AI Conference’s AI Build: Generative Voice AI Solutions, a weeklong hackathon at MIT with final presentations held last fall in New York City. Working with Kyaw were Richa Gupta (architecture) and Bradley Bunch, Nidhish Sagar, and Michael Won — all from the MIT Department of Electrical Engineering and Computer Science (EECS).

Curator AI is designed to streamline online furniture shopping by providing context-aware product recommendations using AI and AR. The platform uses AR to take the dimensions of a room with locations of windows, doors, and existing furniture. Users can then speak to the software to describe what new furnishings they want, and the system will use a vision-language AI model to search for and display various options that match both the user’s prompts and the room’s visual characteristics.

“Shoppers can choose from the suggested options, visualize products in AR, and use natural language to ask for modifications to the search, making the furniture selection process more intuitive, efficient, and personalized,” Kyaw says. “The problem we’re trying to solve is that most people don’t know where to start when furnishing a room, so we developed Curator AI to provide smart, contextual recommendations based on what your room looks like.” Although Curator AI was developed for furniture shopping, it could be expanded for use in other markets.

Another example of Kyaw’s work is Estimate, a product that he and three other graduate students created during the MIT Sloan Product Tech Conference’s hackathon in March 2024. The focus of that competition was to help small businesses; Kyaw and team decided to base their work on a painting company in Cambridge that employs 10 people. Estimate uses AR and an object-recognition AI technology to take the exact measurements of a room and generate a detailed cost estimate for a renovation and/or paint job. It also leverages generative AI to display images of the room or rooms as they might look like after painting or renovating, and generates an invoice once the project is complete.

The team won that hackathon and $5,000 in cash. Kyaw’s teammates were Guillaume Allegre, May Khine, and Anna Mathy, all of whom graduated from MIT in 2024 with master’s degrees in business analytics.

In April, Kyaw will give a TedX talk at his alma mater, Cornell University, in which he’ll describe Curator AI, Estimate, and other projects that use AI, AR, and robotics to design and build things.

One of these projects is Unlog, for which Kyaw connected AR with gesture recognition to build a software that takes input from the touch of a fingertip on the surface of a material, or even in the air, to map the dimensions of building components. That’s how Unlog — a towering art sculpture made from ash logs that stands on the Cornell campus — came about.

Gesture Recognition for Feedback-Based Mixed Reality and Robotic Fabrication of the Unlog Tower. Video: Alexander Htet Kyaw

Unlog represents the possibility that structures can be built directly from a whole log, rather than having the log travel to a lumber mill to be turned into planks or two-by-fours, then shipped to a wholesaler or retailer. It’s a good representation of Kyaw’s desire to use building materials in a more sustainable way. A paper on this work, “Gestural Recognition for Feedback-Based Mixed Reality Fabrication a Case Study of the UnLog Tower,” was published by Kyaw, Leslie Lok, Lawson Spencer, and Sasa Zivkovic in the Proceedings of the 5th International Conference on Computational Design and Robotic Fabrication, January 2024.

Another system Kyaw developed integrates physics simulation, gesture recognition, and AR to design active bending structures built with bamboo poles. Gesture recognition allows users to manipulate digital bamboo modules in AR, and the physics simulation is integrated to visualize how the bamboo bends and where to attach the bamboo poles in ways that create a stable structure. This work appeared in the Proceedings of the 41st Education and Research in Computer Aided Architectural Design in Europe, August 2023, as “Active Bending in Physics-Based Mixed Reality: The Design and Fabrication of a Reconfigurable Modular Bamboo System.”

Kyaw pitched a similar idea using bamboo modules to create deployable structures last year to MITdesignX, an MIT MAD program that selects promising startups and provides coaching and funding to launch them. Kyaw has since founded BendShelters to build the prefabricated, modular bamboo shelters and community spaces for refugees and displaced persons in Myanmar, his home country.

“Where I grew up, in Myanmar, I’ve seen a lot of day-to-day effects of climate change and extreme poverty,” Kyaw says. “There’s a huge refugee crisis in the country, and I want to think about how I can contribute back to my community.”

His work with BendShelters has been recognized by MIT Sandbox, PKG Social Innovation Challenge, and the Amazon Robotics’ Prize for Social Good.

At MIT, Kyaw is collaborating with Professor Neil Gershenfeld, director of the Center for Bits and Atoms, and PhD student Miana Smith to use speech recognition, 3D generative AI, and robotic arms to create a workflow that can build objects in an accessible, on-demand, and sustainable way. Kyaw holds bachelor’s degrees in architecture and computer science from Cornell. Last year, he was awarded an SJA Fellowship from the Steve Jobs Archive, which provides funding for projects at the intersection of technology and the arts. 

“I enjoy exploring different kinds of technologies to design and make things,” Kyaw says. “Being part of MAD has made me think about how all my work connects, and helped clarify my intentions. My research vision is to design and develop systems and products that enable natural interactions between humans, machines, and the world around us.” 

3D modeling you can feel

Essential for many industries ranging from Hollywood computer-generated imagery to product design, 3D modeling tools often use text or image prompts to dictate different aspects of visual appearance, like color and form. As much as this makes sense as a first point of contact, these systems are still limited in their realism due to their neglect of something central to the human experience: touch.

Fundamental to the uniqueness of physical objects are their tactile properties, such as roughness, bumpiness, or the feel of materials like wood or stone. Existing modeling methods often require advanced computer-aided design expertise and rarely support tactile feedback that can be crucial for how we perceive and interact with the physical world.

With that in mind, researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have created a new system for stylizing 3D models using image prompts, effectively replicating both visual appearance and tactile properties.

The CSAIL team’s “TactStyle” tool allows creators to stylize 3D models based on images while also incorporating the expected tactile properties of the textures. TactStyle separates visual and geometric stylization, enabling the replication of both visual and tactile properties from a single image input.

“TactStyle” tool allows creators to stylize 3D models based on images while also incorporating the expected tactile properties of the textures.

EECS PhD student Faraz Faruqi, lead author of a new paper on the project, says that TactStyle could have far-reaching applications, extending from home decor and personal accessories to tactile learning tools. TactStyle enables users to download a base design — such as a headphone stand from Thingiverse — and customize it with the styles and textures they desire. In education, learners can explore diverse textures from around the world without leaving the classroom, while in product design, rapid prototyping becomes easier as designers quickly print multiple iterations to refine tactile qualities.

“You could imagine using this sort of system for common objects, such as phone stands and earbud cases, to enable more complex textures and enhance tactile feedback in a variety of ways,” says Faruqi, who co-wrote the paper alongside MIT Associate Professor Stefanie Mueller, leader of the Human-Computer Interaction (HCI) Engineering Group at CSAIL. “You can create tactile educational tools to demonstrate a range of different concepts in fields such as biology, geometry, and topography.”

Traditional methods for replicating textures involve using specialized tactile sensors — such as GelSight, developed at MIT — that physically touch an object to capture its surface microgeometry as a “heightfield.” But this requires having a physical object or its recorded surface for replication. TactStyle allows users to replicate the surface microgeometry by leveraging generative AI to generate a heightfield directly from an image of the texture.

On top of that, for platforms like the 3D printing repository Thingiverse, it’s difficult to take individual designs and customize them. Indeed, if a user lacks sufficient technical background, changing a design manually runs the risk of actually “breaking” it so that it can’t be printed anymore. All of these factors spurred Faruqi to wonder about building a tool that enables customization of downloadable models on a high level, but that also preserves functionality.

In experiments, TactStyle showed significant improvements over traditional stylization methods by generating accurate correlations between a texture’s visual image and its heightfield. This enables the replication of tactile properties directly from an image. One psychophysical experiment showed that users perceive TactStyle’s generated textures as similar to both the expected tactile properties from visual input and the tactile features of the original texture, leading to a unified tactile and visual experience.

TactStyle leverages a preexisting method, called “Style2Fab,” to modify the model’s color channels to match the input image’s visual style. Users first provide an image of the desired texture, and then a fine-tuned variational autoencoder is used to translate the input image into a corresponding heightfield. This heightfield is then applied to modify the model’s geometry to create the tactile properties.

The color and geometry stylization modules work in tandem, stylizing both the visual and tactile properties of the 3D model from a single image input. Faruqi says that the core innovation lies in the geometry stylization module, which uses a fine-tuned diffusion model to generate heightfields from texture images — something previous stylization frameworks do not accurately replicate.

Looking ahead, Faruqi says the team aims to extend TactStyle to generate novel 3D models using generative AI with embedded textures. This requires exploring exactly the sort of pipeline needed to replicate both the form and function of the 3D models being fabricated. They also plan to investigate “visuo-haptic mismatches” to create novel experiences with materials that defy conventional expectations, like something that appears to be made of marble but feels like it’s made of wood.

Faruqi and Mueller co-authored the new paper alongside PhD students Maxine Perroni-Scharf and Yunyi Zhu, visiting undergraduate student Jaskaran Singh Walia, visiting masters student Shuyue Feng, and assistant professor Donald Degraen of the Human Interface Technology (HIT) Lab NZ in New Zealand.

“Periodic table of machine learning” could fuel AI discovery

MIT researchers have created a periodic table that shows how more than 20 classical machine-learning algorithms are connected. The new framework sheds light on how scientists could fuse strategies from different methods to improve existing AI models or come up with new ones.

For instance, the researchers used their framework to combine elements of two different algorithms to create a new image-classification algorithm that performed 8 percent better than current state-of-the-art approaches.

The periodic table stems from one key idea: All these algorithms learn a specific kind of relationship between data points. While each algorithm may accomplish that in a slightly different way, the core mathematics behind each approach is the same.

Building on these insights, the researchers identified a unifying equation that underlies many classical AI algorithms. They used that equation to reframe popular methods and arrange them into a table, categorizing each based on the approximate relationships it learns.

Just like the periodic table of chemical elements, which initially contained blank squares that were later filled in by scientists, the periodic table of machine learning also has empty spaces. These spaces predict where algorithms should exist, but which haven’t been discovered yet.

The table gives researchers a toolkit to design new algorithms without the need to rediscover ideas from prior approaches, says Shaden Alshammari, an MIT graduate student and lead author of a paper on this new framework.

“It’s not just a metaphor,” adds Alshammari. “We’re starting to see machine learning as a system with structure that is a space we can explore rather than just guess our way through.”

She is joined on the paper by John Hershey, a researcher at Google AI Perception; Axel Feldmann, an MIT graduate student; William Freeman, the Thomas and Gerd Perkins Professor of Electrical Engineering and Computer Science and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL); and senior author Mark Hamilton, an MIT graduate student and senior engineering manager at Microsoft. The research will be presented at the International Conference on Learning Representations.

An accidental equation

The researchers didn’t set out to create a periodic table of machine learning.

After joining the Freeman Lab, Alshammari began studying clustering, a machine-learning technique that classifies images by learning to organize similar images into nearby clusters.

She realized the clustering algorithm she was studying was similar to another classical machine-learning algorithm, called contrastive learning, and began digging deeper into the mathematics. Alshammari found that these two disparate algorithms could be reframed using the same underlying equation.

“We almost got to this unifying equation by accident. Once Shaden discovered that it connects two methods, we just started dreaming up new methods to bring into this framework. Almost every single one we tried could be added in,” Hamilton says.

The framework they created, information contrastive learning (I-Con), shows how a variety of algorithms can be viewed through the lens of this unifying equation. It includes everything from classification algorithms that can detect spam to the deep learning algorithms that power LLMs.

The equation describes how such algorithms find connections between real data points and then approximate those connections internally.

Each algorithm aims to minimize the amount of deviation between the connections it learns to approximate and the real connections in its training data.

They decided to organize I-Con into a periodic table to categorize algorithms based on how points are connected in real datasets and the primary ways algorithms can approximate those connections.

“The work went gradually, but once we had identified the general structure of this equation, it was easier to add more methods to our framework,” Alshammari says.

A tool for discovery

As they arranged the table, the researchers began to see gaps where algorithms could exist, but which hadn’t been invented yet.

The researchers filled in one gap by borrowing ideas from a machine-learning technique called contrastive learning and applying them to image clustering. This resulted in a new algorithm that could classify unlabeled images 8 percent better than another state-of-the-art approach.

They also used I-Con to show how a data debiasing technique developed for contrastive learning could be used to boost the accuracy of clustering algorithms.

In addition, the flexible periodic table allows researchers to add new rows and columns to represent additional types of datapoint connections.

Ultimately, having I-Con as a guide could help machine learning scientists think outside the box, encouraging them to combine ideas in ways they wouldn’t necessarily have thought of otherwise, says Hamilton.

“We’ve shown that just one very elegant equation, rooted in the science of information, gives you rich algorithms spanning 100 years of research in machine learning. This opens up many new avenues for discovery,” he adds.

“Perhaps the most challenging aspect of being a machine-learning researcher these days is the seemingly unlimited number of papers that appear each year. In this context, papers that unify and connect existing algorithms are of great importance, yet they are extremely rare. I-Con provides an excellent example of such a unifying approach and will hopefully inspire others to apply a similar approach to other domains of machine learning,” says Yair Weiss, a professor in the School of Computer Science and Engineering at the Hebrew University of Jerusalem, who was not involved in this research.

This research was funded, in part, by the Air Force Artificial Intelligence Accelerator, the National Science Foundation AI Institute for Artificial Intelligence and Fundamental Interactions, and Quanta Computer.