Research database

NGIAtlantic.eu - Distributed Learning for Resilient Virtual Network Management at Scale

Duration:
6 months (2022)
Principal investigator(s):
Project type:
UE-funded research - H2020 - Industrial Leadership – LEIT - ICT
Funding body:
COMMISSIONE EUROPEA (Commissione Europea)
Project identification number:
PoliTo role:
Affiliated entity

Abstract

Next-generation networks are envisioned as the answer to network operators and service providers to replace existing infrastructures and to introduce a new platform able to support new telecommunication businesses and services. They are considered key enablers for delivering new services that are available to any place, at any time, on any device. The presence of new requirements, such as high reliability, zero packet loss, and real-time interaction, posed by data-intensive applications, e.g., augmented/virtual reality, industrial 4.0, or healthcare, exacerbates the need for more performant, scalable, resilient, and self-adapting networks. To support such applications, there is a need to rethink the design of both networks and applications, creating more intelligent and autonomous networks. To provide such (artificial) intelligence, there is an increasing interest in equipping networks with autonomous run-time decision-making capabilities incorporating distributed machine learning (ML) algorithms, to foster automation in network configurations, network management, and network resiliency. While AI/ML technologies continue to evolve at a rapid pace, moving from a paradigm of supervised learning towards distributed self-learning requires solving several challenges in the design and deployment of wide-scale networks. Among those challenges, two of them particularly relevant to the scalability requirements of the Next Generation Internet, that we plan to tackle in this project include: 1.scalability and sustainability of AI/ML models for network management, 2.robustness of learning solutions in practical deployments. Recent studies have partially addressed some of these challenges by employing model-free approaches to efficiently manage network resources. In particular, Reinforcement Learning (RL) finds profitable applicability given its ability to fit the network dynamics well without any prior knowledge [1]. As an example, in [2] we proposed a network management schema using Multi-Agent Reinforcement Learning (MARL), to auto-scale and dynamically accommodate traffic demands, while reacting to network failures. Similarly, in [3] an RL-based auto-scaling method was proposed to determine the optimal number of Virtual Network Function instances. With such auto-scaling solutions, networks can deactivate idle resources that may increase unnecessary (energy) costs and provide redundant facilities to face workload peaks or unexpected failures. Despite being innovative in the approach, these RL models were not designed with scalability in mind. While Deep Reinforcement Learning (DRL) is a first and partial attempt to tame very large (network) datasets, the learning model needs to be redesigned to leverage the computational capabilities of multiple nodes. To this end, on the basis of a five-year collaboration on these themes (e.g., [2]), in this EU-US project we aim at investigating and experimenting with algorithmic and system solutions that efficiently allow the distribution of multi-agent network models for self-scaling and resilient Next Generation networks. We plan to use existing NSF-funded large-scale virtual network testbeds and, to solve such network management problems, we plan on proposing efficient methods for splitting, i.e., decomposing, the decision logic. We will benefit from the hardware available on the US-based testbeds via a training process performed by both CPUs and GPUs. In particular, we will train our ML models over Chameleon Cloud, GENI, and possibly over the FABRIC testbeds, having applied to be beta testers for the latter, currently still under construction.

Structures

Partners

  • POLITECNICO DI TORINO - AMMINISTRAZIONE CENTRALE
  • WATERFORD INSTITUTE OF TECHNOLOGY - IRLANDA - Coordinator

Keywords

ERC sectors

PE6_2 - Computer systems, parallel/distributed systems, sensor networks, embedded systems, cyber-physical systems
PE6_7 - Artificial intelligence, intelligent systems, multi agent systems

Sustainable Development Goals

Obiettivo 9. Costruire un'infrastruttura resiliente e promuovere l'innovazione ed una industrializzazione equa, responsabile e sostenibile

Budget

Total cost: € 3,499,937.50
Total contribution: € 3,499,937.50
PoliTo total cost: € 50,000.00
PoliTo contribution: € 50,000.00

Communication activities