The debate encircling artificial intelligence (AI), alongside issues around climate change, occupies a central priority in today’s public discourse. The rise of generative AI has fueled significant hype over the past two years, particularly following the popularization of chatbots like ChatGPT. Its functionalities have opened doors to a futuristic panorama where nearly everything could be addressed through these systems, but they have also prompted equally intense debates and criticisms about potential risks associated with these tools.
The “magic” of generative AI lies in its ability to create new patterns and responses without human inputs. In other words, the understanding acquired within a generative AI system enables it to generate new data—such as texts, images, and audio—similar to the data used in its initial training phase, without the need for human intervention to classify and standardize this data. To achieve this, Generative AI employs Machine Learning (ML) and Deep Learning (DL), particularly through Large Language Models (LLMs), to process vast amounts of labeled and unlabeled data, facilitating the creation of outcomes from the algorithm’s inherent learning.
This shift from “traditional” AI to Generative AI has highlighted the importance of data quality in training, validation, and testing phases, surpassing the initial logic of “garbage in, garbage out” to a careful attention into outputs, given the inventiveness of generative AI models.
Under the Artificial Intelligence Act (AI Act), training data refers to data used to train an AI system by adjusting its learnable parameters, while validation data is used to evaluate the trained AI system and adjust its non-learnable parameters and learning process.
According to Philipp Hacker, the main risks for machine learning are (i) risks to data quality; (ii) risks of discrimination; and (iii) risks of innovation. Hacker emphasizes that training and validation data play a fundamental role in ensuring accurate machine learning, specifically to mitigate these central risks, especially in “autonomous” inventive activity.
From a regulatory perspective, the recently approved AI Act seeks to enhance data protection for training models to ensure quality outcomes through training, validation, and testing. Article 10 provides a set of rules for high-risk AI systems that rely on data-based model training.
Training, validation and testing data sets shall be subject to appropriate data governance and management practices appropriate for the intended purpose of the AI system. For instance, those practices shall concern in particular: (a) relevant design choices; b) data collection processes and origin of data, and in the case of personal data, the original purpose of data collection; c) relevant data preparation processing operations, such as annotation, labelling, cleaning, updating, enrichment and aggregation; d) the formulation of assumptions, notably with respect to the information that the data are supposed to measure and represent; e) an assessment of the availability, quantity and suitability of the data sets that are needed; f) examination in view of possible biases that are likely to affect the health and safety of persons, negatively impact fundamental rights or lead to discrimination prohibited under Union law, especially where data outputs influence inputs for future operations; g) the identification of relevant data gaps or shortcomings that prevent compliance with this Regulation, and how those gaps and shortcomings can be addressed.
An important step has been taken toward effective regulation of Generative AI, but it is crucial to remain aware that it may not be sufficient. Technology is advancing at an incredible pace, far faster than the legal responses that can be provided!
#ai #aigen #regulation #law # dataprotection #aiact #fdul #lpl