Tiny machine learning design alleviates a bottleneck in memory usage on internet-of-things devices

By Lauren Hinkel

December 16, 2021 | MIT-IBM Watson AI Lab

An MIT team's tinyML vision system outperforms other models in many image classification and detection tasks. Photo courtesy of the researchers.

Machine learning provides powerful tools to researchers to identify and predict patterns and behaviors, as well as learn, optimize, and perform tasks. This ranges from applications like vision systems on autonomous vehicles or social robots to smart thermostats to wearable and mobile devices like smartwatches and apps that can monitor health changes. While these algorithms and their architectures are becoming more powerful and efficient, they typically require tremendous amounts of memory, computation, and data to train and make inferences.

At the same time, researchers are working to reduce the size and complexity of the devices that these algorithms can run on, all the way down to a microcontroller unit (MCU) that’s found in billions of internet-of-things (IoT) devices. An MCU is memory-limited minicomputer housed in compact integrated circuit that lacks an operating system and runs simple commands. These relatively cheap edge devices require low power, computing, and bandwidth, and offer many opportunities to inject AI technology to expand their utility, increase privacy, and democratize their use — a field called TinyML.

Now, an MIT team working in TinyML in the MIT-IBM Watson AI Lab and the research group of Song Han, assistant professor in the Department of Electrical Engineering and Computer Science (EECS), has designed a technique to shrink the amount of memory needed even smaller, while improving its performance on image recognition in live videos.

“Our new technique can do a lot more and paves the way for tiny machine learning on edge devices,” says Han, who designs TinyML software and hardware.

To increase TinyML efficiency, Han and his colleagues from EECS and the MIT-IBM Watson AI Lab analyzed how memory is used on microcontrollers running various convolutional neural networks (CNNs). CNNs are biologically-inspired models after neurons in the brain and are often applied to evaluate and identify visual features within imagery, like a person walking through a video frame. In their study, they discovered an imbalance in memory utilization, causing front-loading on the computer chip and creating a bottleneck. By developing a new inference technique and neural architecture, the team alleviated the problem and reduced peak memory usage by four-to-eight times. Further, the team deployed it on their own tinyML vision system, equipped with a camera and capable of human and object detection, creating its next generation, dubbed MCUNetV2. When compared to other machine learning methods running on microcontrollers, MCUNetV2 outperformed them with high accuracy on detection, opening the doors to additional vision applications not before possible.

The results will be presented in a paper at the conference on Neural Information Processing Systems (NeurIPS) this week. The team includes Han, lead author and graduate student Ji Lin, postdoc Wei-Ming Chen, graduate student Han Cai, and MIT-IBM Watson AI Lab Research Scientist Chuang Gan.

A design for memory efficiency and redistribution

TinyML offers numerous advantages over deep machine learning that happens on larger devices, like remote servers and smartphones. These, Han notes, include privacy, since the data are not transmitted to the cloud for computing but processed on the local device; robustness, as the computing is quick and the latency is low; and low cost, because IoT devices cost roughly $1 to $2. Further, some larger, more traditional AI models can emit as much carbon as five cars in their lifetimes, require many GPUs, and cost billions of dollars to train. “So, we believe such TinyML techniques can enable us to go off-grid to save the carbon emissions and make the AI greener, smarter, faster, and also more accessible to everyone — to democratize AI,” says Han.

However, small MCU memory and digital storage limit AI applications, so efficiency is a central challenge. MCUs contain only 256 kilobytes of memory and 1 megabyte of storage. In comparison, mobile AI on smartphones and cloud computing, correspondingly, may have 256 gigabytes and terabytes of storage, as well as 16,000 and 100,000 times more memory. As a precious resource, the team wanted to optimize its use, so they profiled the MCU memory usage of CNN designs — a task that had been overlooked until now, Lin and Chen say.

Their findings revealed that the memory usage peaked by the first five convolutional blocks out of about 17. Each block contains many connected convolutional layers, which help to filter for the presence of specific features within an input image or video, creating a feature map as the output. During the initial memory-intensive stage, most of the blocks operated beyond the 256KB memory constraint, offering plenty of room for improvement. To reduce the peak memory, the researchers developed a patch-based inference schedule, which operates on only a small fraction, roughly 25 percent, of the layer’s feature map at one time, before moving onto the next quarter, until the whole layer is done. This method saved four-to-eight times the memory of the previous layer-by-layer computational method, without any latency.

“As an illustration, say we have a pizza. We can divide it into four chunks and only eat one chunk at a time, so you save about three-quarters. This is the patch-based inference method,” says Han. “However, this was not a free lunch.” Like photoreceptors in the human eye, they can only take in and examine part of an image at a time; this receptive field is a patch of the total image or field of view. As the size of these receptive fields (or pizza slices in this analogy) grows, there becomes increasing overlap, which amounts to redundant computation that the researchers found to be about 10 percent. The researchers proposed to also redistribute the neural network across the blocks, in parallel with the patch-based inference method, without losing any of the accuracy in the vision system. However, the question remained about which blocks needed the patch-based inference method and which could use the original layer-by-layer one, together with the redistribution decisions; hand-tuning for all of these knobs was labor-intensive, and better left to AI.

“We want to automate this process by doing a joint automated search for optimization, including both the neural network architecture, like the number of layers, number of channels, the kernel size, and also the inference schedule including number of patches, number of layers for patch-based inference, and other optimization knobs,” says Lin, “so that non-machine learning experts can have a push-button solution to improve the computation efficiency but also improve the engineering productivity, to be able to deploy this neural network on microcontrollers.”

A new horizon for tiny vision systems

The co-design of the network architecture with the neural network search optimization and inference scheduling provided significant gains and was adopted into MCUNetV2; it outperformed other vision systems in peak memory usage, and image and object detection and classification. The MCUNetV2 device includes a small screen, a camera, and is about the size of an earbud case. Compared to the first version, the new version needed four times less memory for the same amount of accuracy, says Chen. When placed head-to-head against other tinyML solutions, MCUNetV2 was able to detect the presence of objects in image frames, like human faces, with an improvement of nearly 17 percent. Further, it set a record for accuracy, at nearly 72 percent, for a thousand-class image classification on the ImageNet dataset, using 465KB of memory. The researchers tested for what’s known as visual wake words, how well their MCU vision model could identify the presence of a person within an image, and even with the limited memory of only 30KB, it achieved greater than 90 percent accuracy, beating the previous state-of-the-art method. This means the method is accurate enough and could be deployed to help in, say, smart-home applications.

With the high accuracy and low energy utilization and cost, MCUNetV2’s performance unlocks new IoT applications. Due to their limited memory, Han says, vision systems on IoT devices were previously thought to be only good for basic image classification tasks, but their work has helped to expand the opportunities for TinyML use. Further, the research team envisions it in numerous fields, from monitoring sleep and joint movement in the health-care industry to sports coaching and movements like a golf swing to plant identification in agriculture, as well as in smarter manufacturing, from identifying nuts and bolts to detecting malfunctioning machines.

“We really push forward for these larger-scale, real-world applications,” says Han. “Without GPUs or any specialized hardware, our technique is so tiny it can run on these small cheap IoT devices and perform real-world applications like these visual wake words, face mask detection, and person detection. This opens the door for a brand-new way of doing tiny AI and mobile vision.”

This research was sponsored by the MIT-IBM Watson AI Lab, Samsung, and Woodside Energy, and the National Science Foundation.

Media Inquiries

Journalists seeking information about EECS, or interviews with EECS faculty members, should email eecs-communications@mit.edu.

Please note: The EECS Communications Office only handles media inquiries related to MIT’s Department of Electrical Engineering & Computer Science. Please visit other school, department, laboratory, or center websites to locate their dedicated media-relations teams.

Tiny machine learning design alleviates a bottleneck in memory usage on internet-of-things devices

By Lauren Hinkel

Related topics

Media Inquiries