Overview
INTEX is a linguistic
development environment based on the technology of Finite State Automata and
Transducers (FSTs), enhanced transducers and more generally, Recursive Transition
Networks (RTNs). INTEX is the only available integrated user-friendly, platform
that allows linguists to describe a natural language from its alphabet, up to
the syntactic level, that comes with built-in, large-coverage
dictionaries and grammars, and can parse texts of several million words in real
time. The following are some INTEX functionalities:
(a) INTEX includes tools to
format texts to prepare them for linguistic analyses. For instance, INTEX
includes grammars used to recognize sentences with a very high precision (with
an accuracy above 99% on French journalistic texts), to tag unambiguous
compounds and frozen expressions (so that irrelevant ambiguities are not
considered), as well as to solve contractions and elisions (e.g., don't
= do not, cannot = can not):
The Finite State Transducer that identifies sentences in English
journalistic texts
(b) INTEX includes several
handcrafted large-coverage, built-in dictionaries; it allows users to create
and maintain their own dictionaries. Users describe the inflectional morphology
of a language; INTEX applies this description to DELAS-type dictionaries in
order to automatically inflect them; the resulting DELAF-type dictionaries and
graphs can then be applied to texts in linear time:
Dictionaries and graphs are applied to texts to identify simple words,
compounds and frozen expressions
(c) INTEX applies a set of
selected lexical resources (in the form of dictionaries or morphological
grammars) to texts. Lexical entries are simple words (sequences of letters,
e.g. table), morphemes (affixes of simple words, e.g. -ation),
compounds (sequences of simple words, e.g. washing machine) or frozen
expressions (contiguous compounds, e.g. to take ... into account):
All the words that have been identified during the consultation of the
selected dictionaries and graphs
(d) Looking-up words in
dictionaries produces several solutions in the general case; the result of the
consultation represents ambiguities between simple words, and simple and
compounds. The result of this process, which can be lightened (thanks to
disambiguation grammars) is the input of the INTEX syntactic parser:
Text is represented by a Finite State Transducer
(e) Users apply local
grammars to remove word ambiguities in texts ; INTEX includes several tools to
edit, maintain and debug these local grammars:
Apply local grammars to texts
In "Linear Tag" mode, disambiguated forms are replaced with
the corresponding lexical entry
(f) INTEX syntactic parser uses
Recursive Transition Networks in order to build the trees that may represent
the structure of each sentence of a text. RTNs provide a total control of the
structure of each tree, which can be independent from the structure of the
grammar:
Derivation tree produced when applying Recursive Transition Networks to
a text
(g) INTEX indexes in texts
all utterances of a given word (grouping its inflected forms), of a list of
words (listed in a dictionary), of a given cat-egory (e.g. all feminine plural
adjectives) or, more generally, of any syntactic pattern given in the form of a
regular expression or a Finite State Automaton; the resulting index can be used
to extract corpora from the text, build concordances, or can be analyzed with
INTEX statistical tools. For example, in the following screen shot, the user
has indexed the regular expression: (<be> (<ADV>+<E>)
going to + will) <V:W> to get all the expressions in Futur (<be> matches any conjugated
form of the verb to be, <ADV> matches any simple or compound
adverb, <E> stands for the empty string, <V:W> matches any verb in
the infinitive):
Index a regular expression
Build complex concordances
(h) INTEX can also apply enhanced
transducers (transducers with variables à la SED) to
texts to perform search & replace, or search & insert operations.
Applying enhanced transducers in cascades or in loops allows users to perform
powerful operations on texts:
Moving a sequence of adverbs to the right of the past participle
(i) Several tools are
included to help edit, maintain and debug grammars which are represented
graphically. When possible, sets of graphs can be compiled into minimal
deterministic Finite State Automata; they become instantly re-usable in other
graphs:
Generate a language represented by a library of Finite-State graphs