posted by user: finsbd || 6113 views || tracked by 5 users: [display]

[IJCAI-2020] FinSBD-2 Shared Task 2020 : Sentence Boundary Detection in PDF Noisy Text in the Financial Domain

FacebookTwitterLinkedInGoogle

Link: https://sites.google.com/nlg.csie.ntu.edu.tw/finnlp2020/shared-task-finsbd-2
 
When Mar 13, 2020 - May 8, 2020
Where Yokohama, Japan
Submission Deadline May 15, 2020
Categories    segmentation/tokenization   NLP   text preprocessing   machine learning
 

Call For Papers

Greetings,

We would like to invite you to submit to FinSBD-2, the 2nd shared task
on Sentence Boundary Detection in PDF Noisy Text in the Financial Domain, in
conjunction with IJCAI-PRICAI 2020, July 11-13th, 2020, Yokohama, Japan!

Call for Participation: FinSBD-2
https://sites.google.com/nlg.csie.ntu.edu.tw/finnlp2020/shared-task-finsbd-2 [2]

Register here: https://forms.gle/NixDGuVjrdFMjYhR9 [4]

Collocated with FIN-NLP 2020 workshop: http://finnlp.nlpfin.com [1]

Submission deadline: May 8, 2020

Workshop date: IJCAI-PRICAI 2020 @ July 11-13th, 2020, Yokohama, Japan

Motivation
========

Sentences

Sentences are basic units of the written language. Detecting the beginning and
end of sentences, or sentence boundary detection (SBD), is the foundational
first step in many Natural Language Processing (NLP) applications such as POS
tagging; syntactic, semantic, and discourse parsing; information extraction; or
machine translation.

Despite its important role in NLP, Sentence Boundary Detection has so far not
received enough attention. Especially for noisy texts extracted from
machine-readable files (generally PDF file format) such as financial documents.
They also contain many visual demarcations indicating a hierarchy
of sections including bullets and numbering. There are many sentence fragments
and titles, and not just complete sentences. The prospectuses more often than
not contain punctuation errors. And in order to structure the dense information
in a more easily read format, lists are often used.

Lists

This year, we have included the task of extracting lists due to their unique
structure and common occurrence in financial documents.

A list can be similar to a sentence that enumerates several items of the same
category. For example, the “Simple List” from Figure 1 [6] can be easily read as one
normal sentence. However, looking at Figure 2 [6], the list cannot be read as one
sentence; although it is one unit, because there are multiple sentences included and
there is a visible hierarchy of information. It is therefore important to make
the distinction between sentences and lists and, for these lists, to create a
hierarchy that organizes the items. Mastering this distinction and item hierarchy
can pave the way for more accurate information extraction.

Task Description
=============

Last year we organized the first edition of FinSBD focusing on extracting
well-segmented sentences from Financial prospectuses in PDF format by detecting
their beginning and ending boundaries in two languages: English and French. In
addition to an improved version of the previously proposed task, this year we
are extending this task to include the detection of lists and list items, as
well as their hierarchy.

FinSBD'2 is split into two sub-tasks:
- Extracting sentence boundaries, including list and list item boundaries.
- Organizing the lists items hierarchically.

For each given PDF, a JSON will be provided containing:
- text extracted (key "text")
- sentence boundaries (key "sentence")
- list boundaries (key "list")
- list item boundaries (key "item")
- list item boundaries of level 1 (key "item1")
- list item boundaries of level 2 (key "item2")
- list item boundaries of level 3 (key "item3")
- list item boundaries of level 4 (key "item4")

Item boundaries overlap with item boundaries of different levels. Each item
level represents its depth within the list.

Boundaries are represented by indexes of starting and ending characters that the
system has to predict.

We also included the PDF coordinates of each boundaries as metadata (which can
be used for visualization on PDF if needed).

Example
=======
{
"text": "Ce document fournit des informations aux investisseurs ...",
"sentence": [{"start": 17, "end": 53, "coordinates":...}, ...],
"list": [{"start": 1080, "end": 1267, "coordinates":...}, ...],
"item": [...],
"item1": [...],
"item2": [...],
"item3": [...],
"item4": [...]
}

Sub-task 1 consists in predicting boundaries of sentences, lists and list items.

Sub-task 2 consists in predicting boundaries of item1, item2, item3 and item4.
We can also see sub-task 2 as refining item boundaries into 4 classes of
boundaries (item = item1 + item2 + item3 + item4).

Last year, participants were only given indexes of tokens. This year, we are
providing indexes of characters as well as coordinates of boundaries to allow
different kind of character or word tokenization and/or possible usage of
spatial and visual cues. Therefore, we hope to encourage novel approaches based
on multimodality, especially since lists are often spatially structured to
convey information visually.

Improved annotation guidelines will also be provided to explain how the new and
richer dataset was created. Participants can choose to work on both languages,
or submit systems for one language only. They can participate in one or both
sub-tasks.

This task is open to everyone. The only exception are the co-chairs of the
organizing team, who cannot submit a system, and who will serve as an authority
to resolve any disputes concerning ethical issues or completeness of system
descriptions.

Evaluation
========
For each sub-task, the evaluation metrics will be computed based on boundaries
which are pairs of character indexes ("start" and "end"). The F-score will be
the official metric and an evaluation script will be provided to all the teams.

Prize
====
A USD$1000 prize will be rewarded to the best-performing teams.

Important dates
============
First announcement of the shared task and beginning of registration: 13 March
Release of training data and scoring script: before 30 March
Test set made available: 1 May
Registration deadline: 8 May
Systems' outputs collected: 8 May
Shared task system paper submissions due: 15 May
Notification of acceptance: 31 May
Camera-ready version of shared task system papers due: 15 June
FinNLP 2020 Workshop: 11-13 July

Contact
======
For any questions on the shared task please contact us on fin.sbd.task@gmail.com [5]

Shared Task Organizing committee
===========================
Abderrahim AIT-AZZI, Fortia Financial Solutions
Willy AU, Fortia Financial Solutions
Bianca CHONG, Fortia Financial Solutions
Dialekti VALSAMOU-STANISLAWSKI, Fortia Financial Solutions

Sincerely,

The FinSBD Organizers

IJCAI-20

Read more: https://sites.google.com/nlg.csie.ntu.edu.tw/finnlp2020/shared-task-finsbd-2

[1] FinNLP: http://finnlp.nlpfin.com
[2] FinSBD-2: https://sites.google.com/nlg.csie.ntu.edu.tw/finnlp2020/shared-task-finsbd-2
[3] IJCAI-20: https://ijcai20.org/
[4] Registration form: https://forms.gle/NixDGuVjrdFMjYhR9
[5] mailto: fin.sbd.task@gmail.com
[6] Figures: https://sites.google.com/nlg.csie.ntu.edu.tw/finnlp2020/shared-task-finsbd-2#h.p_C3-XwKh04H-F

Related Resources

Topical collection Springer 2025   CFP: Sense-Making and Collective Virtues among AI Innovators. Aligning Shared Concepts and Common Goals
IEEE Big Data - MMAI 2024   IEEE Big Data 2024 Workshop on Multimodal AI
Abu Dhabi, UAE 2025   The First Workshop and Shared Task on Multilingual Counterspeech Generation
COLING 2025   [2nd CFP] The 1st Workshop and Shared Task on Multilingual Counterspeech Generation
Ei/Scopus-ACAI 2024   2024 7th International Conference on Algorithms, Computing and Artificial Intelligence(ACAI 2024)
SPIE-Ei/Scopus-DMNLP 2025   2025 2nd International Conference on Data Mining and Natural Language Processing (DMNLP 2025)-EI Compendex&Scopus
GermEval2024 GerMS-Detect 2024   GermEval2024 Shared Task GerMS-Detect -- Sexism Detection and Annotator Disagreement Prediction in German Online News Fora @Konvens 2024
AMLDS 2025   IEEE--2025 International Conference on Advanced Machine Learning and Data Science
IEEE CACML 2025   2025 4th Asia Conference on Algorithms, Computing and Machine Learning (CACML 2025)