WMDQS 2025 : 1st Workshop on Multilingual Data Quality Signals at COLM 2025

posted by user: grupocole || 3537 views || tracked by 2 users: [display]

WMDQS 2025 : 1st Workshop on Multilingual Data Quality Signals at COLM 2025

When	Oct 10, 2025 - Oct 10, 2025
Where	Montréal, Canada
Submission Deadline	Jun 23, 2026
Notification Due	Jul 24, 2025

Categories NLP computational linguistics artificial intelligene

Call For Papers

Dear colleagues,

We are pleased to announce the first call for papers of the
*1st Workshop on Multilingual Data Quality Signals at COLM 2025*

Important information:
🗓️ CfP Deadline: June 23, Workshop: October 10
📍 Montréal, Canada
🌐 https://wmdqs.org/

Scope

Recent research has shown that large language models (LLMs) not only need large quantities of data, but also need data of sufficient quality. Ensuring data quality is even more important in a multilingual setting, where the amount of acceptable training data in many languages is limited. Indeed, for many languages even the fundamental step of language identification remains a challenge, leading to unreliable language labels and thus noisy datasets for underserved languages.

In response to these challenges, we will be holding the first Workshop on Multilingual Data Quality Signals (WMDQS) in tandem with COLM. We invite the submission of long and short research papers related to data quality in multilingual data.

Even though most previous work on data quality has been targeted at LLM development, we believe that research in this area can also benefit other research communities in areas such as web search, web archiving, corpus linguistics, digital humanities, political sciences and beyond. We therefore encourage submissions from a wide range of disciplines.

WMDQS will also include a shared task on language identification for web text. We invite participants to submit novel systems which address current problems with language identification for web text. We will provide a training set of annotated documents sourced from Common Crawl to aid development.

Topics

We welcome submissions of (1) original research papers, (2) review/opinion papers, (3) online systems on the topics listed below, and (4) extended abstracts. We especially welcome work-in-progress projects and all novel ideas covering research in multilinguality, underserved/low-resource languages, under-represented linguistic communities and all types of work covering data quality signals. Suggested areas include:

- Data pipelines for data annotation and data filtering
- Undesirable content detection in a multilingual setting
- Multilingual or language independent content ranking
- Human annotation platforms and systems
- Multilingual tokenization mechanisms
- Small language models and embeddings
- Linguistic studies in underserved languages
- Corpus creation and curation methods, especially for underserved languages
- Machine translation
- Digital humanities
- Historical and constructed languages

Shared task

The lack of training data—especially high-quality data—is the root cause of poor language model performance for many languages. One obstacle to improving the quantity and quality of available text data is language identification (LangID or LID). Lang ID remains far from solved for many languages. Several of the commonly used LangID models were introduced in 2017 (e.g. fastText and CLD3). The aim of this shared task is to encourage innovation in open-source language identification and improve accuracy on a broad range of languages.

All accepted authors will be invited to contribute a larger paper, which will be submitted to a high-impact NLP venue.

Important dates for the Workshop:
Workshop paper submission deadline: June 23, 2025
Workshop paper acceptance notification: July 24, 2025
Workshop: October 10, 2025

Important dates for the Shared Task:
1st Deadline to contribute annotations: July 7, 2025
1st Annotations released (train split): July 14, 2025
Abstract Deadline: July 21, 2025
Decision Notification: July 24, 2025
Camera Ready Deadline: September 21, 2025

(All deadlines are 23:59 AoE.)

Organizers:
For any questions, please drop a mail to wmdqs-pcs@googlegroups.com

Program Chairs:
Pedro Ortiz Suarez (Common Crawl Foundation)
Sarah Luger (MLCommons)
Laurie Burchell (Common Crawl Foundation)
Kenton Murray (Johns Hopkins University)
Catherine Arnett (EleutherAI)

Organizing Committee:
Thom Vaughan (Common Crawl Foundation)
Sara Hincapié (Factored)
Rafael Mosquera (MLCommons)

Related Resources

HCC-AI 2026 WORKSHOP ON HUMAN-CENTERED CYBERSECURITY, DATA PRIVACY, AND AI RISKS

Ei/Scopus-AI2A 2026 2026 6th International Conference on Artificial Intelligence, Automation and Algorithms (AI2A 2026)

IEEE-CCISC 2026 2026 IEEE International Conference on Computer Communication, Information System and Cybersecurity (CCISC 2026)

DEPLING 2023 International Conference on Dependency Linguistics

DSML 2026 7th International Conference on Data Science and Machine Learning

IEEE CSCloud 2026 The 13th IEEE International Conference on Cyber Security and Cloud Computing

IDEAL 2026 27th International Conference on Intelligent Data Engineering and Automated Learning

SEA 2026 15th International Conference on Software Engineering and Applications

DATA ANALYTICS 2026 The Fifteenth International Conference on Data Analytics

IDSTA 2026 2026 Seventh International Conference on Intelligent Data Science Technologies and Applications