Mozilla has released a new suite of open-source toolkits designed to help developers create ethical, transparent datasets for training artificial intelligence systems—without relying on copyrighted material.
Many of today’s large language models (LLMs) are trained on massive datasets scraped from the internet, often including copyrighted content used without consent. This practice raises serious legal and ethical concerns. Mozilla’s new release directly addresses this issue by offering tools that support responsible data collection and AI development.
The toolkits, developed in partnership with EleutherAI over the past year, are hosted on Mozilla.ai Blueprints—a platform focused on open-source prototyping for AI applications. They provide developers with ready-to-use workflows, source code, and examples tailored for ethical dataset construction.

“Just like the early days of open-source software, today’s open data ecosystem relies on community values and shared responsibility,” said Ayah Bdeir, Senior Advisor for AI Strategy at the Mozilla Foundation.
“These toolkits are designed to make dataset creation more accessible and ethical, while laying the groundwork for responsible AI infrastructure. Our collaboration with EleutherAI reflects our joint commitment to this mission.”
Toolkit 1: Self-Hosted Audio Transcription with Whisper
The first toolkit enables developers to transcribe audio locally using a self-hosted version of the open-source Whisper model, accessed via a server called Speaches. This setup replicates the functionality of commercial transcription APIs while keeping all data under the user’s control.
This is particularly important for developers working with sensitive or private audio data that must not be shared with third-party cloud providers. The toolkit offers a streamlined deployment process using either Docker or command-line setup, making it accessible even for developers with limited infrastructure experience.
Toolkit 2: Converting Complex Documents to Markdown with Docling
The second toolkit tackles another critical task in AI development: converting diverse document types into clean, standardized text formats.
The command-line tool, Docling, transforms PDFs, Word files, HTML documents, and more into structured Markdown. It also includes OCR capabilities for scanned documents and images with embedded text, along with basic image handling. This makes it especially useful for building open-text corpora to train LLMs or develop retrieval-augmented generation (RAG) systems.
Designed for scale, Docling supports batch processing for handling large volumes of documents efficiently—essential for teams working on substantial data pipelines.
Research-Driven Tools for Ethical AI

The launch is part of a broader Mozilla–EleutherAI collaboration focused on developing best practices for open LLM datasets. This effort included convening over 30 researchers and practitioners from open-source AI startups, nonprofit labs, and civil society groups to define clear standards for ethical dataset creation.
Their findings were published in the research paper “Towards Best Practices for Open Datasets for LLM Training”, and the toolkits provide a direct pathway for developers to put these practices into action.
In today’s AI landscape, concerns over legal liability often discourage transparency about training data sources, limiting innovation and accountability. Mozilla and EleutherAI argue that the path forward lies in creating publicly accessible, responsibly curated datasets built on open licenses and clear usage rights.
“Openness and transparency are the future of AI,” said Stella Biderman, Executive Director at EleutherAI.
“By equipping developers with practical, usable tools, we’re laying the groundwork for AI systems that are more interpretable, secure, and trustworthy.”
Empowering Developers to Build AI Responsibly
The new Mozilla–EleutherAI toolkits mark a significant step toward democratizing ethical AI development. They lower the barrier to entry for building datasets that respect legal boundaries and societal values, helping developers and organizations train models on a more responsible foundation.
As the industry grapples with the consequences of opaque training data, Mozilla’s open-source approach offers a compelling alternative—one rooted in transparency, collaboration, and long-term trust.
Comments are closed