CMLC 2026 : 12th Workshop on the Challenges in the Management of Large Corpora

posted by user: grupocole || 23 views || tracked by 1 users: [display]

CMLC 2026 : 12th Workshop on the Challenges in the Management of Large Corpora

Link: https://corpora.ids-mannheim.de/cmlc-2026.html

When	May 11, 2026 - May 16, 2026
Where	Palma de Mallorca, Spain
Submission Deadline	Feb 16, 2026
Notification Due	Mar 12, 2026
Final Version Due	Mar 30, 2026

Categories NLP artificial intelligence computational linguistics

Call For Papers

12th Workshop on the Challenges in the Management of Large Corpora (CMLC)

1st Call for Papers

The next meeting of CMLC will be held as part of the LREC-2026 conference in Palma, Mallorca.

See https://corpora.ids-mannheim.de/cmlc-2026.html for up-to-date information.

Workshop description

As in the previous CMLC meetings, we wish to explore common areas of interest across a range of issues in language resource management, corpus linguistics, natural language processing, natural language generation, and data science.

Large textual datasets require careful design, collection, cleaning, encoding, annotation, storage, retrieval, and curation to be of use for a wide range of research questions and to users across a number of disciplines. A growing number of national and other very large corpora are being made available, many historical archives are being digitised, numerous publishing houses are opening their textual assets for text mining, and many billions of words can be quickly sourced from the web and online social media.

A mixed blessing of the times is that much of those texts, in mono- and multi-lingual arrangements can now be created automatically by exploiting Large Language Models at various scales. That, on the one hand, makes it possible to inflate the amounts of data where normally data would be scarce: in under-resourced languages or language varieties, in specific genres or for intricate and rarely attested constructions. On the other hand, such procedures immediately raise concerns regarding the authenticity and quality of such data, casting doubt on the possibility of adequately (truthfully, verifiably, reproducibly) addressing the kind of research questions that provoked the rapid but tainted increase of the available data volumes in the first place. Similar doubts may be directed at mass creation of secondary and tertiary data ordinarily crucial for linguistic research: apart from potential legal constraints on the use of the initial amounts of human-created data, new questions arise as to the legal status of the derived data, the ways to create e.g. provenance metadata of the derived resources, and the level of trust regarding mass-produced grammatical (and other) annotation layers.

These new as well as more traditional questions lie at the base of the list of topics that management of large corpora (for any currently suitable definition of “large”) invokes or at least strongly brushes against.

Topics of interest

This year's event adds new items to the standard range of CMLC themes and addresses some of LREC-2026 focus topics:

Interoperability and accessibility

• How to make corpora as accessible as possible

• Interoperable APIs for query and analysis software

• Provision of multiple levels of access for different tasks

Machine/Deep Learning

• Data preparation for machine learning input

• Creation, curation, maintenance and dissemination of language models based on machine learning (e.g. word embeddings and entire deep learning networks)

• Legal issues concerning language model distribution

Linguistic content challenges

• Dealing with the variety of language: multilinguality, minority and/or underrepresented languages, historical texts, noisy OCR texts, user-generated content, etc.

• Diversity and inclusion in language resources

• Integration of human computation (crowdsourcing) and automatic annotation

• Quality management of annotations

• Ensuring linguistic integrity of data through deduplication, correction of typos and errors, removal of incomplete or malformed sentences, and filtering harmful, offensive and toxic content, etc.

• Integrating different linguistic data types (text, audio, video, facsimiles, experimental data, neuroimaging data, …)

Technical challenges

• Storage and retrieval solutions for large text corpora: primary data (potentially including facsimiles, etc.), metadata, and annotation data

• Corpus versioning and release management

• Scalable and efficient NLP tooling for annotating and analysing large datasets: distributed and GPGPU computing; using big data analysis frameworks for language processing

• Dealing with streaming data (e.g. Social Media) and rapidly changing corpora

• Environmental impact of big language data computing

• Engineering and management of research software

Exploitation challenges

• Legal and privacy issues

• Query languages, data models, and standardisation

• Licensing models of open and closed data, coping with intellectual property restrictions

• Innovative approaches for aggregation and visualisation of text analytics

• Repurposing or extending application areas of existing corpora and tools

In the tradition of CMLC, we invite reports on national corpus initiatives; submitters of these reports should be prepared to present a poster.

Important dates

• Deadline for paper submission: 16.02.2026

• Notification of acceptance: 12.03.2026

• Deadline for the submission of camera-ready papers: 30.03.2026

• Meeting: details TBA

Paper submission

We invite anonymised extended abstracts for oral presentations on the topics listed below, as PDF created according to LREC-2026 templates.

CMLC has always reserved a track for national corpus project reports, and to this end, we invite poster proposals of 500-750 words. National project reports need not be anonymised.

Submissions are accepted solely through the START system (URL TBA).

A volume of proceedings will be published online by ELRA.

LRE 2026 Map and the "Share your LRs!" initiative

When submitting a paper from the START page, authors will be asked to provide essential information about resources (in a broad sense, i.e. also technologies, standards, evaluation kits, etc.) that have been used for the work described in the paper or are a new result of your research. Moreover, ELRA encourages all LREC authors to share the described LRs (data, tools, services, etc.) to enable their reuse and replicability of experiments (including evaluation ones).

Programme Committee

Names will be added as Programme Committee members confirm their participation.

Organising Committee

• Piotr Bański (IDS Mannheim)

• Dawn Knight (Cardiff University)

• Marc Kupietz (IDS Mannheim)

• Andreas Witt (IDS Mannheim)

• Alina Wróblewska (ICS PAS, Warsaw)

Homepage

CMLC series homepage is located at http://corpora.ids-mannheim.de/cmlc.html