Research database

An integrated system that allows quick, effective and trasparent exploration of researchers' expertise, by making available data on scientific production, research projects, patents and other relevant information. Through the search function, it's possible to discover research areas, people, and results of the research conducted at Politecnico di Torino.

CESOIA - Low-Complexity Large Language Models for On-Device Visual Computing Applications

Duration:

01/04/2025 - 30/03/2028

Principal investigator(s):

Fabio Pareschi

Project type:

Non-EU international research

Funding body:

ALTRO INTERNAZIONALE (King Abdullah University of Science and Technology (KAUST))

PoliTo role:

Sole Contractor

Abstract

Since the 2022 release of OpenAI’s ChatGPT interface [1], relying on the GPT-3.5 Large Language Model (LLM), we have witnessed the disruptive and transformative effects of LLMs. These models are capable of describing a wide variety of topics, respond at various levels of abstraction, and communicate effectively in multiple languages. They have proven capable at providing users with accurate and contextually appropriate responses. LLMs have quickly found applications in tasks such as spelling and grammar correction [2], generating text on specified topics [3], integration into automated chatbot services, and even generating source code from loosely defined software specifications [4]. Research on language models, and on their multimodal variants integrating language and vision or other technolo- gies has recently experience and incredible growth. For instance, in the field of computer vision, language models are combined with visual signals to achieve tasks such as verbal scene description and even open-world scenegraph genera- tion [5]. These technologies enable detailed interpretation of everyday objects, inference of relationships among them, and estimates of physical properties like size, weight, distance, and speed. In user interaction and visualization research, LLMs serve as verbal interfaces to control software functionality or adjust visualization parameters [6], [7]. Through prompt engineering or fine-tuning, loosely defined text can be translated into specific commands that execute desired actions within a system, supported by language model APIs. The capabilities of language models continue to improve dramatically from one version to the next. Yet, this comes at the cost of a signficant growth in size of the most advanced. As an example. models released or operated by leading companies such as OpenAI, Meta, Microsoft, and Google now reach the scale of trillions of trainable parameters [1]. Due to their size and complexity, the training and inference of these models are limited to dedicated data centers equipped with sufficient computational and memory resources. Another disruptive application of LLMs involves their use in wearable technology to provide assistive capabilities for users in various environments. For instance, Meta AI is being integrated into Meta’s virtual reality hardware, where images captured by the device are streamed over the network to Meta’s servers for model inference, enabling image interpretation or prompt response tasks 1. The overarching vision is to develop assistive hardware that is as lightweight as sunglasses but capable of aiding users in a wide range of tasks by understanding the scene they are viewing. While such technological advances have the potential to empower users in unprecedented ways, several drawbacks must be considered due to the heavy reliance on network connectivity and cloud-based data processing. First, users will need to maintain extremely high network connectivity to ensure that remote services can perform assistive tasks without latency, which could otherwise degrade the user experience. Second, data privacy poses a significant challenge. Many environments, such as medical settings and various industries, operate under strict data privacy regulations that prohibit the transmission of sensitive information to remote servers, making the use of these technologies prohibitive in such contexts. Lastly, the reliance on subscription models for these assistive services introduces ongoing costs for users, requiring them to periodically purchase licenses, which can accumulate into significant long-term costs. Therefore, a dedicated research effort is focused on developing autonomous AI assistive technologies that can per- form inference directly on a device. To this goal, several (relatively) small language models have been released by Meta, Mistral AI, Microsoft, and Google [8].These models, are designed to potentially operate on mobile devices or augmented reality hardware and are typically ranging in size of billions of trainable parameters, with the current state-of-the-art version consisting of of 1.5 billion of trainable parameters [8]. While their inference speed is sufficient for practical use on high-end mobile devices, the quality of their outputs is not yet comparable to the trillion-parameter language models that operate in data centers. The primary reason for the reduced inference quality of mobile language models is the significantly smaller model capacity. Nevertheless, these models retain their general-purpose utility, with knowledge spanning a wide range of societal sectors. An interesting thought arises from reflecting on these recent developments in AI: is it necessary for language models used on mobile device to be uniformly scaled-down versions of their trillion-parameter counterparts? One can certainly argue that such an approach is required if the goal is to maintain general-purpose functionality. However, in many appli- cations where autonomous, on-device AI assistance is needed, only a specific skill set is required. For instance, a device dedicated to aiding simple medical procedures does not need to include knowledge about Van Gogh’s paintings or the Shakespeare’s Julius Caesar tragedy. Its utility is limited to a specific domain, and so can its knowledge representation. This research project seeks to progress exactly along this direction by answering to the following key question: can we develop technology that specializes mobile language (or multimodal) models, enabling them to (practically) match the performance of their general-purpose large model counterparts within the limited scope of their intended use?

Structures

DET - Department of Electronics and Telecommunications

Keywords

ERC sectors

PE6_12 - Scientific computing, simulation and modelling tools

Sustainable Development Goals

Obiettivo 9. Costruire un'infrastruttura resiliente e promuovere l'innovazione ed una industrializzazione equa, responsabile e sostenibile

Budget

Total cost:	€ 88,062.00
Total contribution:	€ 88,062.00
PoliTo total cost:	€ 88,062.00
PoliTo contribution:	€ 88,062.00