Skip to content

accurat-toolkit/ILSP_FMC

Repository files navigation

ILSP_FMC

ILSP Focused Monolingual Crawler (FMC)

ILSP Focused Crawler (ILSP-FC) is a research prototype for acquiring domain-specific monolingual and bilingual corpora. The required input from the user consists of a list of seed URLs pointing to relevant web pages and a list of terms that describe a topic.

ILSP-FC integrates modules for text normalization, language identification, document clean-up, text classification, bilingual document alignment (i.e. identification of pairs of documents that are translations of each other) and sentence alignment.

If the user does not provide a list of terms, the software can be used as a general crawler.

ILSP-FC is being developed by researchers of the ILSP/Athena RIC and currently being used in the European Language Resource Coordination Data effort. ELRC Data implements the acquisition of language resources and language processing services, as well as their provision to the language resource repository of the Connecting Europe Facility (CEF) eTranslation platform, which helps European and national public administrations exchange information across language barriers in EU.

An initial version of the crawler was produced during PANACEA, an EU FP7 project for the acquisition and production of Language Resources. It was then extended during the QTLaunchPad project, a European Commission-funded collaborative research initiative dedicated to overcoming quality barriers in machine and human translation and in language technologies; and the FP7-PEOPLE Abu-MaTran project for enhancing industry-academia cooperation in the adoption of machine translation technologies.

Current versions of the tool can be found here: http://nlp.ilsp.gr/redmine/projects/ilsp-fc/

Releases

No releases published

Packages

No packages published

Languages