posted by user: irehbein || 1993 views || tracked by 4 users: [display]

SPMRL-SANCL 2014 : Special track on the Syntactic Analysis of Non-Canonical Language


When Aug 24, 2014 - Aug 24, 2014
Where Dublin
Submission Deadline May 2, 2014
Notification Due Jun 6, 2014
Final Version Due Jun 27, 2014
Categories    syntax   statistical parsing   non-canonical language   syntactic annotation

Call For Papers

Special track on the Syntactic Analysis of Non-Canonical Language


The SANCL special track will be part of the Joint Workshop on
Statistical Parsing of Morphologically Rich Languages and Syntactic
Analysis of Non-Canonical Languages - SPMRL-SANCL 2014

Co-located with COLING 2014, August 24 in Dublin, Ireland

Important dates (updated!)

Submission deadline: June 06, 2014
Author notification: July 01, 2014
Camera-ready deadline: July 13, 2014
Workshop: Aug 24, 2014

Main workshop:

SANCL Special Track:

SANCL Poster submissions
In addition to regular paper submissions, we solicit poster submissions
addressing the syntactic analysis of frequent phenomena of non-canonical
languages which are difficult to annotate and parse using conventional
annotation schemes. A case in point are the representation of verbless
utterances in a dependency scheme, the pros and cons of different
representations of disfluencies for statistical parsing, or the analysis
of complex hashtags which incorporate and merge different syntactic
arguments into one token.

Poster submissions should focus on one or more of the topics listed
below. They should either be submitted as a short paper (up to 7
single-column pages + references, to be included in the proceedings
and presented as a poster at the workshop) or be submitted as an
abstract (max. 500 words excluding examples/references, to be presented
as a poster at the workshop). Abstract submissions should sketch an
analysis for a given problem while short paper submissions should also
present at least preliminary experimental results showing the
feasibility of the approach.

Topics for poster submissions:

Unit of analysis
For canonical, written text the relevant unit for syntactic analysis
is defined by the sentence boundaries. In CMC (computer mediated
communication), on the other side, sentence boundaries are not always
marked in a systematic way, and for spoken language, we can not revert
to sentence boundaries at all. Decisions concerning the relevant unit of
analysis will influence corpus-linguistic research (e.g. measures like
sentence length, syntactic complexity) as well as parsing results. On
the token level, it is also not clear what should be used as the unit of
analysis. In spoken language as well as in conceptually spoken registers
like CMC, multiple tokens are often merged into one new token (2,4-6), or
long compound words are split into separate units (5). It is not yet clear
whether it is preferable to address these issues during preprocessing,
e.g. by tokenizing and normalising the text, or whether this would result
in a "lossy translation", as argued by Owoputi et al. 2013, which should
be avoided.

(1) @Hii_ImFruiity nuin much at all juss chillin waddup w yu ?
-- Owoputi et al. 2013: OCT27 data set

We ask for contributions on the optimal unit of analysis for non-canonical
languages which do not come already separated into sentence-like units
(e.g. spoken language, tweets, historical data), and for contributions
on best practices for tokenizing spoken language and CMC.

Elliptical structures and missing elements
Non-canonical languages often include sentences where syntactic arguments
are not expressed at the surface level. This raises the question how
we can provide a meaningful analysis for these structures, especially
in a dependency grammar framework. One way to deal with the problem is
to insert missing predicates as dummy verbs into the tree to be able
to provide a dependency analysis for these structures (e.g. Seeker &
Kuhn 2012; Dipper, Lüdeling & Reznicek 2013, see NoSta-D annotation
guidelines). The question remains whether this approach is feasible
for automatic processing, especially for the highly underspecified and
ambiguous input often provided by NCLs, or whether a constituency-based
analysis offers more elegant means to analyse elliptical structures.

We ask for contributions discussing the optimal representation for
elliptical structures.

(2) Doesn't change the result though. -- From DCU's Football Treebank

Hashtags & friends
Newly emerging text types from the Social Media have triggered new,
creative means of communication which help users to overcome the
limitations of expressing themselves in a written medium. Twitter hashtags
are one case in point, not only allowing the users to add a semantic tag
to their tweet, but also to add comments, context information, irony
and sarcasm, to express personal feelings, or to evaluate. Formally,
they are not bound to one particular part-of-speech but can include
whole phrases or sentences, which implies that the common practise to
tag them using the the label HASHTAG does not do them justice. This is
even more so the case for hashtags encoding one or more arguments of the
predicate, as in (10). Hashtags provide a rich source of information
which has already been exploited in sentiment analysis and opinion
mining (e.g. Mohammad et al. 2013, Kunneman et al 2013; also see for
an overview of the different functions of hashtags). We are interested in
approaches towards a syntactic analysis of hashtags (and related phenomena
such as complex inflective constructions in German CMC (Schlobinski
2001)) which allow us to make better use of the information encoded in
hashtags. What are the new challenges for analysing these phenomena? What
can be learned from research on similar phenomena, e.g. on MWE?

(3) #itsnothebeer I don't like but the taste -- From Twitter

Disfluencies (e.g. fillers, repairs) are a common phenomenon in spoken
language and also occur in written, but conceptually spoken language
such as CMC.

(4) He uh graduated from medical school this year and uh, I mean he's
in uh, ... Soho in New York.
-- SBC046, Du Bois et al. 2000: Santa Barbara corpus of spoken
American English

There are different ways of representing disfluencies. In the Switchboard
corpus, fillers are included in the tree, and for repairs, both the
repair and the reparandum are attached to the same node. In the German
Verbmobil treebank, fillers have been removed and so-called speech
errors and repetitions are not integrated in the tree but instead are
attached to the root node. The different representations are expected
to have an impact on statistical parsing as well as on the usefulness
of the resources for linguistic research.

We ask for contributions discussing the best way of representing
disfluencies in the syntax tree.

Code mixing
In informal spoken language as well as in CMC, a considerable amount
of the data includes code mixing. This provides a huge challenge for
automatic processing, and even more so as there is no agreed upon
theoretical distinction between loanwords and foreign words. Should we
annotate foreign language material using the same annotation scheme as for
the target language, especially in cases where the grammatical differences
between the languages involved do not easily allow us to do so?

(5) es tut mir so leid vallah ich wollte kommen ama unuttum
it does me so harm my God I wanted come but forget-pst-1-sg
"I am so sorry, really, I wanted to come but I forgot"
-- From Twitter

We ask for contributions discussing best practices for the syntactic
analysis of code mixing.

For more examples and information, please visit:

SANCL Special Track Organizers

Özlem Cetinoglu (IMS, Germany)
Ines Rehbein (Postdam University, Germany)
Djamé Seddah (Université Paris Sorbonne & Inria's Alpage project)
Joel Tetreault (Yahoo! Labs, US)

Related Resources

ICDAR Competitions 2023   The 17th International Conference on Document Analysis and Recognition - Competitions Track
AAAI Fast Track 2023   The 37th AAAI Conference on Artificial Intelligence (Fast Track)
MDA AI&PR 2023   18th International Conference on Mass Data Analysis of Images and Signals with Applications in Medicine, r/g/b Biotechnology, Food Industries and Dietetics, Biometry and Security,
ISSTA 2023   The ACM SIGSOFT International Symposium on Software Testing and Analysis (First Round)
Edited Book in Springer-Verlag 2022   Call for Book Chapters-Machine Learning and Deep Learning for Time Series Processing and Analysis
TACAS 2023   29th International Conference on Tools and Algorithms for the Construction and Analysis of Systems
SANER 2023   The 30th IEEE International Conference on Software Analysis, Evolution and Reengineering
ISSTA 2023   The ACM SIGSOFT International Symposium on Software Testing and Analysis (Second Round)
ICCDA 2023   2023 The 7th International Conference on Compute and Data Analysis (ICCDA 2023)
RECI 2022   The Second International Workshop on Reliability Engineering and Computational Intelligence