Artificial Intelligence (AI) Training Data and The Copyright Dilemma: Insights for African Developers
- Natasha Karanja and Chebet Koros |
- February 12, 2025 |
- Artificial Intelligence
Introduction
The recent surge in the complexity of AI models can be attributed to two key factors: rapid advancements in AI algorithms1 as well as the accessibility of extensive training datasets.2 Though AI models vary in type and application, a common underpinning is the necessity of data as an essential “input”.3 The value chain for the majority of AI models is data centric, as their algorithms require ‘training data sets’ that assist the models to produce relevant applications that provide “tangible real-world use.”4
While data can be sourced from publicly accessible and legitimate repositories,5 controversy arises when such data is scraped or crawled. This leads to questions about whether the use of such data infringes on copyright laws. Data, per se, may not be copyrighted, however the works contained within the data may be protected by copyright.6 This blog assesses the intersection of AI development and copyright law, specifically, the utilisation of datasets within AI development, through the lens of copyright infringement and the key considerations African developers should apply when developing African AI.7
AI Learning Processes and the Copyright Dilemma
The process of AI training through publicly accessible data involves web crawling and web scraping.8 Web crawling, in the context of AI training, refers to the automated process of systematically browsing the internet to collect and index data from various web pages.9 This process is facilitated by programs known as web crawlers or spiders, which navigate through the vast landscape of online content to extract valuable data.10 On the other hand, and unlike traditional web crawlers, AI-powered scrapers go a step further by extracting and storing specific content from websites.11 This content, which includes text, images, videos, and structured data, serves as the foundation for building intelligent systems.12
This reliance on publicly available data presents a significant challenge. The effectiveness of AI depends on the availability of vast, high-quality datasets for training,13 often leading developers to scrape and crawl web data. However, the legality of these practices becomes murky when copyrighted content is involved. This issue is particularly pressing in Africa, where the digital landscape is still under development.14 The continent faces a double-edged sword: a limited data ecosystem coupled with evolving intellectual property structures.15 This means there is less readily available data to begin with, increasing the likelihood of potentially utilising copyrighted material for AI training.16 Simultaneously, the African legal frameworks governing data usage and copyright are often nascent, leaving developers with little guidance on how to navigate this complex terrain.17
The following sections delve into the copyright infringement issues, exceptions like fair dealing or fair use, and considerations for African developers seeking to navigate this evolving legal landscape responsibly and ethically.
Coding Between the Lines: Copyright Exceptions, AI Development, and the African Context
We consider the following case study that explores the use of AI within predictive disease analysis and how copyright considerations would apply to the training data utilised for the AI training models.
Case study: Predicting Malaria Prevalence using Machine Learning Models using Satellite-based Climate Information, Senegal18
This case study assessed the use of machine learning algorithms to analyze environmental factors contributing to malaria outbreaks. The model utilises climate data in predicting and managing malaria prevalence, where it formulates data sets including climate data, historical malaria incidence and demographic information to form a predictive trend around potential Malaria outbreaks allowing for optimization of resource allocation and timely interventions.19 This proactive approach not only saves lives but also alleviates the economic burden of malaria on affected communities.20 Focusing on the data sets, the developers gather their data from various publicly available data repositories such as Senegal National Malaria Control Program (NCMP-PLNP) and the Climate Hazards Center InfraRed Precipitation with Station Data ( CHIRPS) .21 The combination of data from these data repositories allow for the AI model to curate a comprehensive view of how environmental elements are correlated to the prevalence of malaria cases. 22
The NMCP-PLNP and CHIRPS data sets are generally available for public use, as the NMCP-PLNP includes data from nationally representative surveys such as the Demographic and Health Survey (DHS) and the Malaria Indicator Survey (MIS) that are conducted throughout to monitor malaria trends and program impacts.23 The data gathered is then organised to inform malaria trends and is easily accessible on platforms such as the District Health Information System 2 (DHIS2), an open-source platform for health data management.24 The CHIRPS is a quasi-global rainfall dataset that spans over 35 years from 1981 to present. 25 It uses satellite imagery with in-situ data to curate high resolution rainfall estimates suitable for trend analysis.26 The data set is publicly available and is released on the Creative Commons allowing for use without restrictions. 27
However, issues may arise when the data within these repositories fall under copyright protection. For example, detailed reports and case studies published by the NMCP might have copyright protection, but the underlying data could be available for public use depending on the licensing terms specified in those documents. Therefore in these instances, the dataset itself may not be copyrighted, however any associated documentation or publication may fall under copyright protection.
The distinction between original works and derivative works becomes crucial in this context, as the latter may require permission from the original copyright holder. It is worth noting that copyright exceptions may apply, particularly those related to research and public health. Kenya’s Copyright Act provides an exception to using copyrighted works by way of fair dealing for scientific research. Other jurisdictions recognize that the fair use doctrine may apply if the use of copyrighted materials is deemed transformative and does not negatively impact the market for the original work.28 A key illustration of this is the Authors Guild Inc v. Google Inc29 case, which illustrates the complexities of copying in the digital age, where the court recognized that Google’s scanning of books for its search engine constituted a transformative use that qualified as fair use, but this does not directly translate to AI training data.30
This case study highlights the intricate balance between copyright protection and AI innovation. As AI-driven models increasingly rely on diverse datasets, navigating copyright exceptions and licensing frameworks will be crucial in fostering responsible and legally sound AI development, particularly in the African context.
Key Considerations for Developers
As developers create and deploy artificial intelligence technologies, particularly in sensitive sectors like healthcare, they must account for several critical factors to ensure ethical, legal, and effective outcomes. Key considerations include:
Open licensing
This presents a compelling alternative to traditional copyright restrictions, particularly in the context of African datasets. Instead of prohibiting certain uses of copyrighted material, open licenses, such as the Nwulite Obodo Open Data License (NOODL), grant specific permissions upfront.31 This approach fosters sharing and collaboration while still protecting the rights of creators.32 NOODL, for instance, employs a tripartite agreement structure, involving the licensor and licensees from both developing and developed nations.33 This framework introduces a broader perspective on compensation, moving beyond traditional royalties to encompass reciprocal data sharing, capacity building, and community development projects.34 By adhering to the terms of an open license like NOODL, users can avoid copyright infringement even when engaging in activities like reproduction, adaptation, or distribution of the licensed material.35 This approach promotes equitable access, fair compensation, and community participation in the use of African datasets, offering a more nuanced and inclusive approach to copyright management.36
Fostering Transparency and Accountability
Transparency in AI algorithms and decision-making processes is paramount. Developers should clearly explain how their AI models function and the data they utilize.37 This transparency allows for scrutiny and identification of potential copyright infringements. For instance, if an AI model is trained on copyrighted data without proper licensing, transparency measures can help reveal this and initiate appropriate action. Further, knowing the data used in training can help assess potential biases and ethical implications.38 Additionally, clear lines of responsibility for AI systems’ actions are essential. When an AI infringes on copyright, it should be clear who is accountable: the developer, the user, or both.39
Familiarize with Local Laws
Developers must understand copyright laws relevant to their region. This includes comprehending limitations and exceptions to copyright, licensing agreements and the implications of using copyrighted material in AI training. For example, a new bill aims to force AI companies to disclose the use of copyrighted material in their training data,40 highlighting the growing legal scrutiny in this area. Within the African context, what often exists is the fair dealing provision that provides for exceptions to copyright infringement within a closed list of uses – “scientific research, education, private use , criticism or review of current events” – subject to attribution to the author of the copyrighted data.41 However, the Kenyan Copyright Act of 2001, does not explicitly include AI within its fair dealing provisions. Therefore, there is a pressing need for legal reforms that clarify the scope of fair dealing in the context of AI, ensuring that creators can utilize existing works to foster new innovations while respecting the rights of original authors. In South Africa, the Copyright Amendment Bill proposes to incorporate provisions for fair use. This Bill aims to enhance this provision by clarifying the conditions under which fair use can be applied, particularly in educational and research contexts.42 Understanding these local laws is essential for developers to mitigate legal risks and ensure that their practices align with the evolving legal landscape.
The legal landscape surrounding AI and copyright is constantly evolving, necessitating that developers stay informed about changes in legislation and adapt their practices accordingly. This evolving regulatory environment underscores the importance of legal awareness for developers, as failure to comply with new laws can result in significant legal repercussions.
Conclusion
As AI development continues to advance, the use of copyrighted material in training data presents significant legal and ethical challenges. Developers must navigate the complexities of copyright law, including issues related to copying, consent, and exceptions like fair dealing or fair use. By prioritizing licensing, transparency, and accountability, developers can mitigate the risk of infringement and contribute to a more sustainable and ethical AI ecosystem. As Africa’s digital and legal landscapes evolve, developers have a unique opportunity to lead responsibly and innovatively.
1 Artificial Intelligence Models are in relation to the various models available; Machine learning, Neural networks, Deep learning, Natural process learning, Generative AI.. etc.
2 Guadamuz A, A Scanner Darkly: Copyright Liability and Exceptions in Artificial Intelligence Inputs and Outputs [2024] GRUR International, Journal of European and International IP Law, 2.
3 ibid.
4 Wills K, AI around the World: Intellectual Property Law Considerations and beyond [2022] 102 J Pat & Trademark Off Soc’y 186, 186.
5 Guadamuz (n2)
6 Tyagi K, Copyright, text & data mining and the innovation dimension of generative AI [2024] Journal of Intellectual Property Law & Practice, Vol 19:7,563.
7 Opderbeck W D, Copyright in AI training data: A human-centered approach [2024] Oklahoma Law Review, Vol 76 :4, 951 ; Copyright Infringement in relation to AI training data includes ; “copying , consent and transitory reproduction”
8 ibid.
9 Tyagi(n6).
10ibid.
11 ibid.
12 ibid.
13 Cuntz A, Carsten F & Stamm H, Artificial Intelligence and Intellectual Property: An Economic Perspective [2024] WIPO Economic Research Working Paper No. 77/2024,21.
14 Ajadi S, Can AI help tackle the most pressing challenges in developing countries? GSM Association <https://www.gsma.com/mobilefordevelopment/region/africa/can-ai-help-tackle-the-most-pressing-challenges-in-developing-countries/ > last accessed 20th September 2024.
15 Okaibedi D, Kutoma W & Akintoye S, Responsible AI in Africa , Challenges and Opportunities ; Ade-Ibijola A & Okonkwo C, Artificial Intelligence in Africa : Emerging Challenges ( Palgrave Macmillan, 2022) 106.
16 Cuntz (n13) ; “Many modern AI tools draw on a multitude of data source”
17 Oriakhognu O D, The Right to Research in Africa : Making African Copyright Whole [2022] PIJIP Research Paper, 78,5
<https://digitalcommons.wcl.american.edu/cgi/viewcontent.cgi?article=1080&context=research>last accessed 19th September 2024 ; “existing copyright regimes in Africa, as currently formulated, are not fit for purpose and cannot secure the public interest in the sense that they are incapable of ….promotion of access to information .especially in this new era of AI ”
18 Ileperuma K Jampani M, Sellahewa U , Panjwani S & Amarnath G, Predicting malaria prevalence with machine learning models using satellite-based climate information: technical report [2023] Colombo, Sri Lanka: International Water Management Institute (IWMI) CGIAR Initiative on Climate Resilience.
19 ibid.
20 ibid.
21 ibid.
22 ibid.
23 ibid.
24 The Demographic and Health Survey <https://dhsprogram.com/methodology/survey/survey-display-587.cfm>
26 ibid.
27 ibid.
28 Band J & Gerafi J, The Fair Use / Fair Dealing Handbook [2023] policy bandwidth, 1 :“ More than 40 countries with over one-third of the world’s population have fair use or fair dealing provisions in their copyright laws.”
29 No.13-4829-CV (2dCir. Oct 16,2015)
30 ibid.
31 Nwulite Obodo Open Data License <https://datasciencelawlab.africa/nwulite-obodo-open-data-license/>
32 ibid
33 ibid
34 ibid
35 ibid
36 ibid
37 A Birhane, Steed R, Ojewale V, Vecchione B, & Raj I D, AI auditing: The Broken Bus on the Road to AI Accountability 2nd IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) 2024
38 ibid
39 ibid
40 The proposed Artificial Intelligence Bill introduced in the US Congress, would force AI companies to reveal use of copyrighted art <https://www.theguardian.com/technology/2024/apr/09/artificial-intelligence-bill-copyright-art>
41 Kenyan Copyright Act Cap 30 <https://kenyalaw.org/kl/fileadmin/pdfdownloads/Acts/CopyrightAct_No12of2001.pdf>
42South African Copyright Ammendment Bill 2022 <https://www.parliament.gov.za/storage/app/media/uploaded-files/Copyright%20Amendment%20Bill%20Draft.pdf>