Document detail
ID

oai:arXiv.org:2409.14657

Topic
Computer Science - Computation and...
Author
Sarveswaran, Kengatharaiyer
Category

Computer Science

Year

2024

listing date

9/25/2024

Keywords
annotation linguistic tamil
Metrics

Abstract

Treebanks are important linguistic resources, which are structured and annotated corpora with rich linguistic annotations.

These resources are used in Natural Language Processing (NLP) applications, supporting linguistic analyses, and are essential for training and evaluating various computational models.

This paper discusses the creation of Tamil treebanks using three distinct approaches: manual annotation, computational grammars, and machine learning techniques.

Manual annotation, though time-consuming and requiring linguistic expertise, ensures high-quality and rich syntactic and semantic information.

Computational deep grammars, such as Lexical Functional Grammar (LFG), offer deep linguistic analyses but necessitate significant knowledge of the formalism.

Machine learning approaches, utilising off-the-shelf frameworks and tools like Stanza, UDpipe, and UUParser, facilitate the automated annotation of large datasets but depend on the availability of quality annotated data, cross-linguistic training resources, and computational power.

The paper discusses the challenges encountered in building Tamil treebanks, including issues with Internet data, the need for comprehensive linguistic analysis, and the difficulty of finding skilled annotators.

Despite these challenges, the development of Tamil treebanks is essential for advancing linguistic research and improving NLP tools for Tamil.

;Comment: 10 pages

Sarveswaran, Kengatharaiyer, 2024, Building Tamil Treebanks

Document

Open

Share

Source

Articles recommended by ES/IODE AI

Bone metastasis prediction in non-small-cell lung cancer: primary CT-based radiomics signature and clinical feature
non-small-cell lung cancer bone metastasis radiomics risk factor predict cohort model cect cancer prediction 0 metastasis radiomics clinical