This AI Model Called SeaFormer Brings Vision Transformers to Mobile Devices


The introduction of the vision transformer and its massive success in the object detection task has attracted a lot of sustentation toward transformers in the computer vision domain. These approaches have shown their strength in global context modeling, though their computational complexity has slowed their version in practical applications.
Despite their complexity, we have seen numerous applications of vision transformers since their release in 2021. They have been unromantic to videos for compression and classification. On the other hand, several studies focused on improving the vision transformers by integrating existing structures, such as convolutions or feature pyramids.
Though, the interesting speciality for us is their using to image segmentation. They could successfully model the global context for the task. These approaches work fine when we have powerful computers, but they cannot be executed on mobile devices due to hardware limitations.
Some people tried to solve this wide-stretching memory and computational requirement of vision transformers by introducing lightweight alternatives to existing components. Although these changes improved the efficiency of vision transformers, the level was still insufficient to execute them on mobile devices.
So, we have a new technology that can outperform all previous models in hand on image segmentation tasks, but we cannot utilize this on mobile devices due to limitations. Is there a way to solve this and bring that power to mobile devices? The wordplay is yes, and this is what SeaFormer is for.
SeaFormer (squeeze-enhanced Midmost Transformer) is a mobile-friendly image segmentation model that is built using transformers. It reduces the computational complexity of midmost sustentation to unzip superior efficiency on mobile devices.
The cadre towers woodcut is what they undeniability squeeze-enhanced midmost (SEA) attention. This woodcut acts like a data compressor to reduce the input size. Instead of passing the unshortened input image patches, SEA sustentation module first pools the input full-length maps into a meaty format and then computes self-attention. Moreover, to minimize the information loss of pooling, query, keys, and values are widow when to the result. Once they are widow back, a depth-wise convolution layer is used to enhance local details.
This sustentation module significantly reduces the computational overhead compared to traditional vision transformers. However, the model still needs to be improved; thus, the modifications continue.
To remoter modernize the efficiency, a generic sustentation woodcut is implemented, which is characterized by the formulation of squeeze sustentation and detail enhancement. Moreover, a lightweight segmentation throne is used at the end. Combining all these changes result in a model capable of conducting high-resolution image segmentation on mobile devices.
SeaFormer outperforms all other state-of-the-art efficient image segmentation transformers on a variety of datasets. Though it can be unromantic for other tasks as well, and to demonstrate that, authors evaluated the SeaFormer for image nomenclature task on the ImageNet dataset. The results were successful as SeaFormer can outperform other mobile-friendly transformers while managing to run faster than them.
Check out the Paper and Github. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 14k ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, tomfool AI projects, and more.
Do You Know Marktechpost has 1.8 Million Pageviews per month and 500,000 AI Community members? |
Want to support us? Become Our Sponsor |
The post This AI Model Called SeaFormer Brings Vision Transformers to Mobile Devices appeared first on MarkTechPost.