posted by user: grupocole || 172 views || tracked by 1 users: [display]

WMDQS 2025 : 1st Workshop on Multilingual Data Quality Signals at COLM 2025

FacebookTwitterLinkedInGoogle

Link: https://wmdqs.org/
 
When Oct 10, 2025 - Oct 10, 2025
Where Montréal, Canada
Submission Deadline Jun 23, 2026
Notification Due Jul 24, 2025
Categories    NLP   computational linguistics   artificial intelligene
 

Call For Papers


Dear colleagues,


We are pleased to announce the first call for papers of the
*1st Workshop on Multilingual Data Quality Signals at COLM 2025*


Important information:
🗓️ CfP Deadline: June 23, Workshop: October 10
📍 Montréal, Canada
🌐 https://wmdqs.org/


Scope

Recent research has shown that large language models (LLMs) not only need large quantities of data, but also need data of sufficient quality. Ensuring data quality is even more important in a multilingual setting, where the amount of acceptable training data in many languages is limited. Indeed, for many languages even the fundamental step of language identification remains a challenge, leading to unreliable language labels and thus noisy datasets for underserved languages.

In response to these challenges, we will be holding the first Workshop on Multilingual Data Quality Signals (WMDQS) in tandem with COLM. We invite the submission of long and short research papers related to data quality in multilingual data.

Even though most previous work on data quality has been targeted at LLM development, we believe that research in this area can also benefit other research communities in areas such as web search, web archiving, corpus linguistics, digital humanities, political sciences and beyond. We therefore encourage submissions from a wide range of disciplines.

WMDQS will also include a shared task on language identification for web text. We invite participants to submit novel systems which address current problems with language identification for web text. We will provide a training set of annotated documents sourced from Common Crawl to aid development.


Topics

We welcome submissions of (1) original research papers, (2) review/opinion papers, (3) online systems on the topics listed below, and (4) extended abstracts. We especially welcome work-in-progress projects and all novel ideas covering research in multilinguality, underserved/low-resource languages, under-represented linguistic communities and all types of work covering data quality signals. Suggested areas include:

- Data pipelines for data annotation and data filtering
- Undesirable content detection in a multilingual setting
- Multilingual or language independent content ranking
- Human annotation platforms and systems
- Multilingual tokenization mechanisms
- Small language models and embeddings
- Linguistic studies in underserved languages
- Corpus creation and curation methods, especially for underserved languages
- Machine translation
- Digital humanities
- Historical and constructed languages

Shared task

The lack of training data—especially high-quality data—is the root cause of poor language model performance for many languages. One obstacle to improving the quantity and quality of available text data is language identification (LangID or LID). Lang ID remains far from solved for many languages. Several of the commonly used LangID models were introduced in 2017 (e.g. fastText and CLD3). The aim of this shared task is to encourage innovation in open-source language identification and improve accuracy on a broad range of languages.

All accepted authors will be invited to contribute a larger paper, which will be submitted to a high-impact NLP venue.

Important dates for the Workshop:
Workshop paper submission deadline: June 23, 2025
Workshop paper acceptance notification: July 24, 2025
Workshop: October 10, 2025

Important dates for the Shared Task:
1st Deadline to contribute annotations: July 7, 2025
1st Annotations released (train split): July 14, 2025
Abstract Deadline: July 21, 2025
Decision Notification: July 24, 2025
Camera Ready Deadline: September 21, 2025

(All deadlines are 23:59 AoE.)


Organizers:
For any questions, please drop a mail to wmdqs-pcs@googlegroups.com

Program Chairs:
Pedro Ortiz Suarez (Common Crawl Foundation)
Sarah Luger (MLCommons)
Laurie Burchell (Common Crawl Foundation)
Kenton Murray (Johns Hopkins University)
Catherine Arnett (EleutherAI)

Organizing Committee:
Thom Vaughan (Common Crawl Foundation)
Sara Hincapié (Factored)
Rafael Mosquera (MLCommons)


Related Resources

Ei/Scopus-CCNML 2025   2025 5th International Conference on Communications, Networking and Machine Learning (CCNML 2025)
DEPLING 2023   International Conference on Dependency Linguistics
Ei/Scopus-SGGEA 2025   2025 2nd Asia Conference on Smart Grid, Green Energy and Applications (SGGEA 2025)
SIGI 2025   11th International Conference on Signal and Image Processing
ACM SAC 2025   40th ACM/SIGAPP Symposium On Applied Computing
CGASP 2025   International Conference on Computer Graphics, Animation & Signal Processing
Ei/Scopus-MLBDM 2025   2025 5th International Conference on Machine Learning and Big Data Management (MLBDM 2025)
Ei/Scopus-AI2A 2025   2025 5th International Conference on Artificial Intelligence, Automation and Algorithms (AI2A 2025)
NLP4KGC 2025   4th NLP4KGC: Natural Language Processing for Knowledge Graph Construction
IEEE-ACAI 2025   2025 IEEE 8th International Conference on Algorithms, Computing and Artificial Intelligence (ACAI 2025)