Nexing



 
 
 
 
Outline

Objective

The main objective of the TagShare project is to develop a set of linguistic resources and software component tools to support the computational processing of Portuguese.

These resources and tools are geared to shallow morpho-syntactic processing: They are aimed at automatically associating lexemes with basic linguistic information that can be uncovered by means of computationally efficient procedures that look into the word structure and/or into a very limited amount of context (e.g. POS tagging, inflectional analysis, multi-word lexeme recognition, named entity recognition, etc.).

Approach

Automatic POS tagging consists in determining the part of speech of a word, a task that goes beyond mere table lookup as the same wordform may have, in different contexts, different morpho-syntactic categories. The complexity of this procedure is further increased by the fact that new expressions keep being introduced in the lexicon. Several approaches have been developed to build taggers, among which it is possible to find methods based on hidden Markov models, neural networks, decision trees, rule transformation,maximum entropy, etc. Most of these approaches can be said to have matured to a point where they offer quite reasonable levels of accuracy.

On a par with the use of data intensive approaches, advances have been obtained in handling as much morpho-syntactic structure as possible with efficient, finite-state methods. The use of these methods permits to go beyond the information coded in mere categorial tags. Morphological analyzers, named entity recognizers, NP chunkers, etc. are some of the applications that have been developed in this vein.

Motivation

Although the above mentioned methods are language-independent, only a very few of this kind of tools are announced to have been developed for modern Portuguese. The bad news are that even these tools have their access restricted by commercial reserve or by ad-hoc authorization for use. The god news are that there are conditions to rapidly overcome this situation. The pursuing of the project goals capitalize on existing language resources for Portuguese whose development had been previously advanced at the participants centers. It capitalizes also on toolkits, development environments, etc. that were successfully matured mostly for English, but are language independent in their principles, and freely available to be used in academic research and in the development of tools for other languages, including Portuguese.
 
 
 
 

Participants

Research centers

The TagShare project is an enterprise of two academic research centers affiliated with the University of Lisbon. The participating centers have been conducting research on artificial intelligence, cognitive science and natural language science and technology:

FCUL - The Faculty of Sciences of the University of Lisbon, Department of Informatics.

CLUL - The Center of Linguistics of the University of Lisbon, Corpus Linguistics Group.

Team

Florbela Barreto, CLUL
António Horta Branco (coord.), FCUL
Eduardo Ferreira, FCUL
José Bettencourt Gonçalves, CLUL
Marco Gonzalez, FCUL

João Silva, FCUL
Fernanda Bacelar do Nascimento, CLUL
Pedro Martins, FCUL
Amália Mendes, CLUL
Filipe Nunes, FCUL

 
 
 
 

Funding

The project is funded by the FCT-Foundation for Science and Technology of the MCT-Portuguese Ministery of Science and Technology under the contract POSI/PLP/47058/2002. The project life is planned to span over 24 months, starting on March 2004.
 
 
 
 

Results

Publications  

    • Branco, António, Lino Rodrigues, João Silva e Sara Silveira, 2008, "LXService: Web Services of Language Technology for Portuguese”, Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC2008).
    • Barreto, Florbela, António Branco, Eduardo Ferreira, Amália Mendes, Maria Fernanda Nascimento, Filipe Nunes and João Silva, 2006, "Open Resources and Tools for the Shallow Processing of Portuguese: The TagShare Project”, Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC2006), pp.1438-1443, Génova, Itália.
    • Barreto, Florbela, António Branco, Eduardo Ferreira, Amália Mendes, Maria Fernanda Nascimento, Filipe Nunes and João Ricardo Silva, 2006, “Linguistic Resources and Software for Shallow Processing Actas do XXI Encontro Anual da Associação Portuguesa de Linguística.
    • Branco, António, Francisco Costa, and Filipe Nunes, forth., “Processamento da Ambiguidade Flexional Verbal: Para uma caracterização do espaço do problema”, In Actas do XXII Encontro Anual da Associação Portuguesa de Linguística, Universidade de Coimbra, Faculdade de Letras.
    • Branco, António, Amália Mendes and Ricardo Ribeiro (eds.), 2004, Language Technology for Portuguese: Shallow Processing Tools and Resources. Lisbon, Edições Colibri, 144pp.
    • Branco, António, Amália Mendes and Ricardo Ribeiro (eds.), 2003, Tagging and Shallow Processing of Portuguese: Workshop Notes of TASHA'2003. Lisbon, University of Lisbon, Faculty of Sciences, Department of Informatics, Technical Report TR-2003-28.
    • Branco, António, Filipe Nunes and João Silva, 2006, "Verb Analysis in an Inflective Language: Simpler is better", University of Lisbon, TagShare project, Internal report.
    • Branco, António and João Silva, forth., “Nominal Lematization with a Minimal Lexicon", In Actas do XXII Encontro Anual da Associação Portuguesa de Linguística, Universidade de Coimbra, Faculdade de Letras.
    • Branco, António and João Silva, 2006, "Dedicated Nominal Featurization of Portuguese". Lecture Notes in Artificial Intelligence 3960, Berlin, Springer, ISSN03029743, pp.244-247.
    • Branco, António and João Silva, 2006, “LX-Suite: Shallow Processing Tools for Portuguese”, Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL2006), Trento, Italy, pp.179-182.
    • Branco, António and João Silva, 2005, "Accurate Annotation: an Efficiency Metric". In Nicolas Nicolov, Kalina Bontcheva, Galia Angelova and Ruslan Mitkov (eds.), Recent Advances in Natural Language Processing III, Amsterdam, John Benjamins, pp.173-182.
    • Branco, António and João Silva, 2004, "Swift Development of State of the Art Taggers for Portuguese". In António Branco, Amália Mendes and Ricardo Ribeiro (orgs.), Language Technology for Portuguese: Shallow Processing Tools and Resources. Lisbon, Edições Colibri, pp. 29-46.
    • Branco, António and João Silva, 2004, "Evaluating Solutions for the Rapid Development of State-of-the-Art POS Taggers for Portuguese ". In Maria Teresa Lino, Maria Francisca Xavier, Fátima Ferreira, Rute Costa and Raquel Silva (eds.), Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC2004), Paris, ELRA, pp.507-510.
    • Branco, António and João Silva, 2003, "Contractions: breaking the tokenization-tagging circularity", Lecture Notes in Artificial Intelligence 2721, Berlin, Springer, ISSN 0302-9743, pp.167-170.
    • Branco, António and João Silva, 2003, "Tokenization of Portuguese: resolving the hard cases", Technical Report TR-2003-4, Department of Informatics, University of Lisbon.
    • Branco, António and João Silva, 2003, "Portuguese-specific Issues in the Rapid Development of State of the Art Taggers", In António Branco, Amália Mendes and Ricardo Ribeiro (eds.), Tagging and Shallow Processing of Portuguese: Workshop Notes of TASHA'2003. Lisbon, University of Lisbon, Faculty of Sciences, Department of Informatics, Technical Report TR-2003-28, pp.7-10.
    • Martins, Pedro, 2006, LX-Inflector: Implementation Report and User Manual, University of Lisbon, TagShare project, Internal report.
    • Mendes, Amália, Raquel Amaro, M. Fernanda Bacelar do Nascimento, 2004, "Morphological Tagging of a Spoken Portuguese Corpus Using Available Resources". In António Branco, Amália Mendes and Ricardo Ribeiro (orgs.), Language Technology for Portuguese: Shallow Processing Tools and Resources. Lisbon, Edições Colibri, pp. 47-62.
    • Mendes, Amália, Raquel Amaro, M. Fernanda Bacelar do Nascimento, 2003, "Reusing Available Resources for Tagging a Spoken Portuguese Corpus", In António Branco, Amália Mendes and Ricardo Ribeiro (eds.), Tagging and Shallow Processing of Portuguese: Workshop Notes of TASHA'2003. Lisbon, University of Lisbon, Faculty of Sciences, Department of Informatics, Technical Report TR-2003-28, pp.25-28.
    • Nunes, Filipe, 2006, LX-Lemmatizer: Implementation Report and User Manual, University of Lisbon, TagShare project, Internal report.
    • TagShare, 2006, Manual de Etiquetação e Convenções, University of Lisbon, Project TagShare, Internal report.

 

Software and services

  • LX-chunker: sentence chunker
  • LX-tokenizer: tokenizer
  • LX-Tagger: full coverage, disambiguating POS tagger
    For a demo, check LX-Suite
  • Nominal featurizer
  • Nominal lemmatizer
  • Nominal conjugator
    For a demo, check LX-Inflector
  • LX-Lemmatizer: verbal lemmatizer and featurizer
  • LX-Conjugator: verbal conjugator
  • LX-NER: named entity recognizer
  • LX-Corpus: corpus exploration online service

 

Language Engineering Resources

  • LX-Corpus: 1 MToken Annotated corpus, up to NER with IOB scheme
  • Portuguese word lists of closed categorial classes

 
 
 
 
Cooperation

Cooperation with related projects

Members of the TagShare project are participating in the following related projects:

  • C-ORAL-ROM - Integrated Reference Corpora for Spoken Romance Languages, CLUL.
  • CRPC - Corpus de Referência do Português Contemporâneo, CLUL.
  • ENABLER - European National Activities for Basic Language Resources, CLUL.
  • GRAMAXING - Computational Grammar for Deep Linguistic Processing of Portuguese, FCUL.
  • LTRC - Language Typology Resource Center, FCUL.
  • RLD - Recursos Linguísticos Disponíveis: Corpora e Léxicos, CLUL.


 
 
 
 

Meetings

As a preparatory action for the project activities, a workshop was held on the issues under the scope of the TagShare goals. This was the TASHA'2003 - Workshop on Tagging and Shallow Processing Tools and Resources for Portuguese , held at the University of Lisbon, October 3, 2003, an event associated with the XIX Annual Meeting of the Portuguese Linguistics Society.