Linguistic Development Environment

Presentation -- Texts and Reference -- Unitex -- Mailing List -- Download INTEX -- Resources -- Version history -- The Author


NooJ

The latest version of INTEX, NooJ, has been developped since 2002.
Click here for more information on NooJ.


Events

The 9th INTEX/NooJ Workshop, Belgrade, Serbia: June 1-3, 2006

The 8th INTEX/NooJ Workshop, Besançon, France: May 30-June 2, 2005

The 7th INTEX/NooJ Workshop, Tours, France: June 7-9, 2004

The 6th INTEX Workshop, Sofia, Bulgaria: May 28-30, 2003

The 5th INTEX Workshop, Marseille, France: May 2-3, 2002, dedicated to the memory of Maurice Gross

INTEX Session at ACH, New York City, USA: June 14-17, 2001

4th INTEX Workshop: Bordeaux, France: June 11-12, 2001

3rd INTEX Workshop: Liège, Belgium: June 13-14, 2000


Links & related projects

If you have developed other INTEX related projects and wish to be listed here, please send an email to max.silberztein@univ-fcomte.fr.

An INTEX site at the Maison des Sciences de l’Homme Ledoux, Université de Franche-Comté: http://intex.univ-fcomte.fr

An INTEX site at NYU, http://www.nyu.edu/pages/linguistics/intex

An INTEX friend site at the University of Belgrade: http://nlp.matf.bg.ac.yu/intex

An INTEX friend site at the Bulgarian Association for Computational Linguistics, Sofia: http://www.bacl.org/intex_for_bulgarian.html

The PROLEX project, from the Université de Tours: www.li.univ-tours.fr/Fichiers/Fichiers_HTML/Themes/BdTln_Projet_Prolex.htm

Andrew Gordon at the University of Southern California's Institute for Creative Technology has been leading an effort to author INTEX graphs for the recognition of expressions of commonsense psychology in English text: http://www.ict.usc.edu/~gordon/cprp.html

Analysis of schizophrenic discourses: Reboul A., Sabatier P., Noël-Jorand M-C. Le discours des schizophrènes : une étude de cas. Revue française de Psychiatrie et de Psychologie Médicale. 2001, 49, pp 6-11.

-- The Bulgarian Association for Computational Linguistics has developed a Bulgarian module for INTEX, see http://www.bacl.org/intex_for_bulgarian.html

-- Tita Kiriacopoulou’s team at the University of Thessaloniki has developed a Greek module for INTEX, see its description.

-- Annibale Elia’s team at the University of Salerne has developed an Italian module for INTEX, see http://www.linguistics.unisa.it/ for an INTEX demo version of their linguistic data.

-- Elisabete Ranchhod’s team at the LabEL/CAUTL laboratory, Univ. of Lisbon has built a Portuguese module for INTEX 4.33, using the latest morphological module, see http://label.ist.utl.pt/en/recursos-intex.html

 


Presentation

INTEX is a linguistic development environment that includes large-coverage dictionaries and grammars, and parses texts of several million words in real time. INTEX includes tools to create and maintain large-coverage lexical resources, as well as morphological and syntactic grammars. Dictionaries and grammars are applied to texts in order to locate morphological, lexical and syntactic patterns, remove ambiguities, and tag simple and compound words. INTEX is used by several research centers to rapidly construct extractors to identify semantic units in large texts, such as Proper names of persons, locations, technical expressions of finance, etc. INTEX can build lemmatized concordances and indices of large texts with respect to all types of Finite State patterns. INTEX is used in over 80 laboratories as an information retrieval system, to parse literary texts, to quantify language variations, to teach second languages, as a terminological extractor, and in several universities to teach computational linguistics to graduate students.

Screen shots


Texts and Reference

REFERENCE:

The University of Belgrade is compiling a bibliography that contains over 200 references, of works being conducted on/with INTEX, and also of works that are of direct interest for INTEX users.

Silberztein, 1993. Dictionnaires électroniques et analyse automatique de textes : le système INTEX. 240 p., Masson Ed.: Paris.

 

This book describes the DELA system (and in particular the construction of the DELAC-DELACF system of dictionaries), the use of the Finite-State technology in the various processes that make up the lexical analysis of natural languages, and the first implementation of INTEX.

The French manual for v 4.12 (PDF document).

A tutorial for v4.12 (PDF).

The latest, updated English manual for v 4.33 (PDF).

Silberztein, 1999a. Text Indexing with INTEX, in Computers and the Humanities #33:3, Kluwer Academic Publishers. Silberztein, 1999b. INTEX: a Finite State Transducer toolbox, in Theoretical Computer Science #231:1, Elsevier Science.

The first DELA system of large-coverage morphological electronic dictionary was described in:

Courtois, Silberztein Eds, 1990. Dictionnaires électroniques du français. Langue française. Larousse: Paris.

 

This is the special issue dedicated to the French DELA system of electronic dictionaries ; in particular Blandine Courtois describes the DELAS dictionary, and Max Silberztein describes the DELAC dictionary. This model has been successfully adapted to other languages, and today DELA dictionaries are available for a dozen languages.

 

Leclère, Christian. 1998. "Travaux récents en Lexique-grammaire". In "Le Lexique-grammaire", Béatrice Lamiroy (ed.), Travaux de Linguistique n° 37, Louvain-la-Neuve : Duculot, pp. 155-186.

This bibliography includes many references on the linguistic data included in INTEX:

-- Many researchers have participated to the construction of the French and English DELAS and DELAC dictionaries, at the LLI laboratory, at the LADL laboratory, and at the GRELIS laboratory. DELA-type dictionaries are also available for a dozen languages.

-- The Lexicon-Grammar series of syntactic dictionaries was designed and built by Maurice Gross and his team at the LADL laboratory. Lexicon-grammars are also available for a dozen languages.

PROCEEDINGS OF INTEX WORKSHOPS:

 

Dister Ed., 2000. Actes des Troisièmes Journées INTEX. In Informatique et Statistique dans les Sciences Humaines. Université

de Liège, n° 36.

Fairon Ed., 1999. Analyse lexicale et syntaxique: le système INTEX, Actes des Premières et Secondes Journées INTEX. Linguvisticae Investigationes vol. XXII: 1998-1999.

Muller, Royauté, Silberztein Eds, 2004. INTEX pour la Linguistique et le Traitement Automatique des Langues. Cahiers de la MSH Ledoux. Presses Universitaires de Franche-Comté, 2004 (21 articles, 372 pages).

 

Unitex & INTEX

 

The Unitex software was developed at the Linguistic group (Prof. Eric Laporte) of the Institut Gaspard Monge, Université de Marne-La-Vallée, without the consent, nor even the knowledge of INTEX’s author.

 

Unitex, its interface, its methodology, its standalone programs, its file formats, some of its linguistic data, as well as its documentation, are copies of INTEX’s.

 

For months, no mention of INTEX or its author was present, neither in the Unitex documentation nor in the various WEB sites associated with Unitex.

 

See the statement of the Dean of Arts and Humanities of the Université de Franche-Comté concerning this sad state of affairs.

 

See the analysis report of the “similarities” between Unitex and INTEX.

 

Please do not encourage this way of conducting public research.

 


DOWNLOAD INTEX

Latest version is 4.33

 

This version contains the French and English that were built at the LADL (Université Paris 7, CNRS), the Spanish module that was built at the Autonomous University of Barcelona (Prof. Xavier Blanco), and the Portuguese module built at the LabEL/CAUTL laboratory, Univ. of Lisbon (Prof. Elisabete Ranchhod).

Distribution Policy

 

This software can be freely downloaded and used by individuals (researchers and students) affiliated with a University, for their individual needs, and non-commercial purposes only.

 

Private and public organizations, laboratories and departments who wish to use INTEX in Research & Development or Education projects, should contact the LASELDI laboratory.

 

None of the programs and linguistic resources included in the INTEX package should be copied, redistributed, incorporated into other software, or published without their author’s consent and proper citation (shouldn’t this be obvious?).

 

1. If you agree with these terms, download the installation file Intex.zip (zip file, 26 MB).

 

2. Launch the SETUP.EXE Installation program to install INTEX on your system (e.g. in C:\Program files),

 

3. Launch INTEX; it will display a Machine Identification Number and ask for an Installation Key. Two possibilities:

 

(3a) to get a personal (DEMO) version of INTEX: you type in the following information:

 

License number: 1

Contact: PERSO

Institution: DEMO

Installation Key: pxN9pINF8

 

(3b) to get a licensed version of INTEX: you need to:

 

  • Send an email to max.silberztein@univ-fcomte.fr with the Subject: Get INTEX. Your email address should have a university suffix, such as ".edu", “univ-xxx.fr” or ".ac.uk"
  • In the body of your email, enter the name of the person responsible for your work (typically the head of your laboratory, or your PhD adviser), your faculty or university, and your Machine Identification Number (displayed above the Installation Key in the Registration window), as follows:

 

Contact: John Smith

Institution: University of North Texas

Machine ID: 12345

 

  • You should receive an automatic answer by email that will look like this:

 

License number: 123

Contact: John Smith

Institution: University of North Texas

Machine ID: 12345

=> Installation key: ab1234cde

 

4. Launch INTEX again; enter the above information to register, and enjoy! Check out the INTEX documentation and reference above.

5. Remember that you can get extra language modules from their authors; check the links above for additional modules.

 


Resources for INTEX

 

The INTEX package contains the French and English modules, which contain the latest version of the DELAF and DELACF dictionaries, offered to the INTEX community by Blandine Courtois, author and co-author of these dictionaries, while working at the LADL (Université Paris 7-CNRS). For information about the DELA dictionaries, see the Reference. Other DELAF and DELACF dictionaries compatible with the versions 4.3x of INTEX are available, see below.

 

Before you download anything: Limitations Of Use

 

Following are specialized linguistic resources and tools offered by their author to the INTEX community. None of the linguistic resources below can be redistributed, incorporated into other software, or published without their author’s consent and proper citation (shouldn’t this be obvious?).

 

-- Annibale Elia’s team at the University of Salerne has developed an Italian module for INTEX, see http://www.linguistics.unisa.it/ for an INTEX demo version of their linguistic data.

-- Xavier Blanco at the Autonomous University of Barcelona has built a new Spanish module for INTEX, using the 4.33 morphological parser, see Spanish_demo.zip

 

-- Dictionary of compound nominal determiners (French)

 

-- Latest version of Prolintex: dictionary of French proper names for INTEX (French)

 

-- A text fully tagged with INTEX (the first tagged text that takes compounds into account !): Du côté de chez Swann (French)

 

-- An example of a nice student project for INTEX: Various local grammars for French expressions (French)

 

-- Lemmatization of compound tenses (English)

 

-- Grammar for Time and Date (English)

 

-- Statistic Module (Tool)

 

-- A tool to help users remove remaining ambiguities in partially tagged texts: diatag (Tool)

 

 

Local Grammar for the English lemmatization of compound tenses

Designed by Maurice Gross, LADL, Université Paris 7.

Here is a rather large set of local grammars for recognition of auxiliaries, modals, aspectuals, etc. (more info.)

 

The Portrait of a Lady (by Henry James, 1881)

CHAPTER 1

Under certain circumstances there are few hours in life more agreeable than the hour dedicated to the ceremony known as afternoon tea. There are circumstances in which, whether you partake of the tea or not--some people of course never do--the situation is in itself delightful. Those that I have in mind in beginning to unfold this simple history offered an admirable setting to an innocent pastime. The implements of the little feast had been disposed upon the lawn of an old English country-house, in what I should call the perfect middle of a splendid summer afternoon. Part of the afternoon had waned, but much of it was left, and what was left was of the finest and rarest quality. Real dusk would not arrive for many hours; but the flood of summer light had begun to ebb, the air had grown mellow, the shadows were long upon the smooth, dense turf. They lengthened slowly, however, and the scene expressed that sense of leisure still to come which is perhaps the chief source of one's enjoyment of such a scene at such an hour. From five o'clock to eight is on certain occasions a little eternity; but on such an occasion as this the interval could be only an eternity of pleasure. The persons concerned in it were taking their pleasure quietly, and they were not of the sex which is supposed to furnish the regular votaries of the ceremony I have mentioned.

DOWNLOAD

Local Grammar for English expressions of time and date

Designed by Maurice Gross, LADL, Université Paris 7

DOWNLOAD

Dictionary of  French compound nominal determiners

Contact : Pierre-André Buvet, Université de Franche-Comté, France (pab@lli.univ-paris13.fr)

This dictionary includes over 3,000 compound nominal determiners classified in 15 classes (cf. Info.txt file of the archive). Further information are available in Les déterminants nominaux quantifieurs, 1993 (Phd Thesis, LLI-Paris XIII). Ref.:P.-A. Buvet., 1994, "Déterminants : les noms", Linguisticae Investigationes XVIII:1, Amsterdam : John Benjamins B.V.

abondance de ce,abondance de.NDET+Dnom14 

années-lumières de son,année-lumière de.NDET+Dnom2

billion d',billion de.NDET+Dnom1

billion de ces,billion de.NDET+Dnom1

pouce métrique du,pouce métrique de.NDET+Dnom2

[...]

DOWNLOAD

Prolintex: dictionary of French proper names for INTEX

Le Laboratoire d'Informatique de l'Université de Tours pilote un projet sur le traitement automatique des noms propres, le projet Prolex. Les dictionnaires Prolintex ont été réalisés dans ce cadre pour mettre à la disposition de la communauté INTEX des dictionnaires de noms propres. Une premiere version, avec des codes un peu trop complexes, avait été présentée aux journées INTEX en 1999 et mise en ligne sur le site INTEX. Une deuxième version (aux codes simplifiés) avait été annoncée aux journées INTEX en 2001. Elle est maintenant disponible à l'adresse : http://tln.li.univ-tours.fr/Tln_Prolintex.html

Pour une amélioration de la recherche des noms propres, nous proposons d'utiliser pour le "preprocessing" d'INTEX un graphe plus complet que le graphe standard, lui aussi disponible à cette meme adresse. Ce graphe avait fait l'objet d'une communication aux journées INTEX en 2000. Merci de nous faire parvenir vos commentaires et suggestions=  d'amelioration...

Tools for the statistical analysis of matching sequences in INTEX

Contact : Jean Senellart, Systran (senellart@systran.fr)

The statistical module is now a standalone program that can be launched more than once, in parallel to an INTEX session.

One advantage is that it is now possible to compare different queries in the same text, one query in different texts, different queries in different texts, even in different languages.

For instance, compare how the English regular expression: "will + shall + <be> going to <V>") is used in "The Portrait of A Lady", as opposed to the French regular expression: "<V:F> + <aller> (<ADV>+<E>) <V:W>") in "La femme de trente ans".

ATTENTION : this program is compatible only with INTEX Versions 4.23e and above.

DOWNLOAD 

DiaTag

Contact : Jean-Marie Viprey, LASELDI, Université de Franche-Comté (jean-marie.viprey@univ-fcomte.fr)

DiaTag allows users to tag manually the compound and simple words that were left ambiguous after the INTEX Disambiguation process. DiaTag uses the snt file of the text, as well as the two dictionary files DLC and DLF (vocabulary of the text). For each ambiguous simple or compound word, DiaTag displays all possible lexical solutions, as well as a concordance sorted on the left or right context, and allows users to choose the right lexical entry for each utterance of the word in the text. Results are incrementally saved, so that users can work in several sessions, and correct previous choices. DiaTag is particularly well adapted to the tagging of texts up to 2 Mega Bytes.

ATTENTION : this program is compatible only with INTEX Versions 4.3x

DOWNLOAD

Du côté de chez Swann, Marcel Proust

Contact : Katia Zellagui, LASELDI, Université de Franche-Comté (katia.zellagui@univ-fcomte.fr)

This text has been fully tagged with INTEX:

-- I have first removed from the DLF and DLC vocabulary files all the lexical entries that never occur in this text;

-- Then I have built two sets of local grammars: one set of general disambiguation rules (i.e. that can be used with other texts), and one set of “good” rules that work perfectly for this particular text, and would be good enough for most French texts;

-- I have removed the remaining ambiguities on compounds (compound vs sequence of simple words), and on simple words, with DiaTag.

DOWNLOAD

Various local grammars for French expressions

Contact : Solène Boichat, Université de Franche-Comté (solene.boichat@wanadoo.fr)

This .zip file contains various local grammars for French expressions of time, date, length, height, width and altitude.

DOWNLOAD


History

Latest version 4.33:

 

-- A few bugs have been fixed, related to: synchronization of the concordance and the text, the EDIT DLM button, morphological ambiguities associated with more than one lexical constraint, unsolved variable $XL in the output tag of a morphological when lemma was implicit, Alphabet file with an invalid format, recognition of the characters “/” and “>” in texts, processing of accented letters during dictionary format check.

 

-- The 4.3x morphological module has been enhanced furthermore; it is now fully compatible with the DELAS-DELAF module, so that certain phenomena can be equally described either with inflectional FSTs, or with morphological FSTs. Languages with high inflection, such as Hungarian, Korean or Russian can now be processed with INTEX without having to construct ridiculously large DELAFs. Even for English and Romance languages, the ability to formalize prefixation and suffixation without introducing redundancies in a DELAF-type dictionary is more natural. The new module is more integrated with the transformational analyzer (see its documentation, chap. 12).

 

-- The disambiguation program (interg.exe) has been modified so that irrelevant ambiguities between compound words or frozen expressions and sequences of simple words do not interfere with local grammars. More precisely, the disambiguation process does no longer destroy accidentally lexical hypotheses that do not follow an explicit path of the local grammar.

 

-- The lexicon-grammar compiler has been seriously optimized (1,000+ time faster on a Pentium 4, 512 MB RAM, to compile the C1d table)

 

-- Towards NOOJ: the lexicon-grammar compiler’s outputs have been renamed as “.cfg” files (context-free grammar), and can now be reused by the INTEX syntactic parsers (including the “Locate Pattern” window). This allows users to describe free syntactic structures in lexicon-grammar tables, and apply them to parse texts.

 

 

Version 4.32:

 

-- Enhanced transducers are now associated with the inflectional and derivational modules, so that INTEX can now perform automatic transformational analysis and generation. For instance:

 

From the parsed text: (N0 John) (V eats) (N1 an apple), the rule: $N0 is $V_K by $N1 produces the result: an apple is eaten by John

From the parsed text: (N0 cette affaire) (ETRE est) (ABLE risible), the rule On peut $Able_V de $N0 produces the text: On peut rire de cette affaire

 

Morphological operations can be cascaded, so for instance: émission_V_N0_p = émettre_N0_p = émetteur_p = émetteurs

 

-- A user-defined concordance program, that allows users to sort concordances according to any word or token inside matching sequences

-- Towards NOOJ: the new Text-FST now represents the result of the tokenizer & morphological parser

 

Version 4.31:

 

-- variables in enhanced transducers are now named and can be embedded, e.g. “$(NP … $(DET … $) … $)”

-- a Finite-State tokenizer capable of analyzing complex and ambiguous compounds in Germanic languages and tokenizing Asian languages

-- a morphological parser fully compatible with the DELA system capable of handling derivational morphology

-- the inflectional module can process accents independently from letters

-- a much better installation key and encryption system that offer authors of binary dictionaries (DELAFs or DELACFs) a better security against reverse engineering

 

Version 4.30:

 

-- functionally identical to 4.24, but the architecture of the system and several file formats have been modified

-- latest versions of the English and French DELA dictionaries (thanks to Blandine Courtois)

 

List of enhancements and most important bug fixes in version 4.24:

 

-- A hierarchical view that displays the organization of a grammar

-- Added the option to automatically remove lexical entries associated with Xxx codes (e.g. .XERR and .XIN) from the text FST

-- Added a special delimiter character to the alphabet that can be used to tokenize Asian languages (kosawat@univ-mlv.fr)

-- Added @i variables to the FST outputs built with the lexicon-grammar compiler to lemmatize frozen (and not so frozen) expressions (simona.vietri@tiscalinet.it)

-- Eliminated a bug that occurs when running grammars that contain a subgraph in which the terminal node is unreachable (marchand@tedm.ucl.ac.be)

 

List of enhancements and most important bug fixes in version 4.23e:

 

-- the text font specified in the Alphabet file is now used when displaying the MFT text

-- fixed the “1 character-length” bug in the concordance when entering words in quotes

-- added a first version of a grammar for French determiners (PLEASE HELP ME TO IMPROVE IT!)

-- a new, right to left sort command for dictionaries and lists of tokens

-- the statistical module is now an independent tool (several instances can run at the same time: compare how

different graphs behave in different texts, even in different languages!)

-- added a new licensing system that can produce one key for multiple installations

-- a memory leak during the construction of large text FSTs has been fixed

-- a I/O synchronization problem when user scans the text-FST very rapidly has been fixed

 

List of enhancements and most important bug fixes in version 4.23d:

 

-- tokens are up to 512 characters long; this is enough to tag expressions such as {je vous prie d’accepter l’expression de mes sentiments les plus chaleureux,Sincerely.EXP}

-- the syntactic parser is more robust, and can deal with larger texts and grammars

-- the syntactic tree has a better look

 

List of enhancements and most important bug fixes in version 4.23c:

 

-- try the brand new debugger engine & interface! (rebuilt from scratch)

-- a new, more stable & faster RTF driver for those nasty large concordances

-- right click in graph windows to get to a contextual menu

-- the missing NEWLINE/CARRIAGE RETURN bug has been fixed

-- Compounds in dictionaries and grammars can include digits, ‘,’ and ‘.’ (make sure you protect the ‘,’ and ‘.’ with a backslash ‘\’).

-- color incompatibilities when the background of a graph is not white have been solved

-- No more confusions between similar syntactic/semantic codes, e.g. ‘N+NA’ and ‘N+NAN

-- Syntactic/semantic codes are now handled properly in disambiguation rules

-- Synchronization problems Concordance, Text and Text FST could occur when the text includes Carriage Return / New line sequences

 

List of enhancements and most important bug fixes in version 4.23b:

 

SERIOUS OPTIMIZATION:

 

-- RECON, RECOR, RECORIND and RECONIND (Text > Locate pattern),

-- DICOE (Text > Apply Lexical Resources with some frozen expressions),

-- FST2TXT (Text > Preprocess text),

-- ETIQG and VERIFG (Text > Disambiguation)

-- GR2FST (FSGraph > Tools > Compile)

 

should now run at least 30 time faster than in 4.22;

 

Bug fixes & Enhancements:

 

-- The ‘Locate pattern’ programs would crash in certain configurations if the Alphabet file contained some accented uppercase letters

-- Syntactic or semantic features incorrectly matched against certain complex lexical entries

-- gr2fst crashed when one of the embedded graphs was recognizing <E> and only <E>

-- the '$(' and '$)' were not properly processed in certain circumstances (bug introduced in 4.23)

-- Applying an FST in Merge mode did not always produce a correct result when matching sequences were over 2048 byte long

 

List of enhancements and bug fixes in version 4.23:

 

-- programs that apply grammars are no longer limited to handling only 64 graphs

-- some strings in double quotes were not retrieved by reconind & recorind (index search)

-- table2fst did not add blanks between words in different columns

-- some unambiguous compounds with apostrophes were blocking dicoc

-- indexer no longer crashes if it finds tokens of over 512 characters

-- the graph editor and the inflection program now manage graphs stored in c:\ directory

-- fst2txt has been optimized; preprocessing runs up to 4 times faster

-- recor and recorind run up to 3 times faster

-- it is now possible to tag texts by applying only dictionaries for compounds

-- simple word tags do not prevent frozen expressions from being recognized anymore

-- a Windows 95/98/ME bug was preventing INTEX to see its installation on some

large disk ; the new test avoids the problem.

 

List of enhancements and bug fixes in version 4.22:

 

-- a new debugger is now fully integrated in the INTEX environment and uses all lexical resources associated with the currently loaded text.

-- the bug related to the beginning and end of sentence match in the disambiguation process has been fixed.

-- the protection system in the FSGraph editor that would prevent users from entering labels ending with spaces or “+” characters has been more a pain than something useful; it has been removed.

-- some problems related to the use of embedded graphs stored in other directories have been fixed.

-- matching any string protected by double quotes is now compatible with the standard INTEX policy on spaces. Thus for instance, <MOT>”:” matches “table:” as well as “table    :”. At the same time, <MOT>” :” matches only “table :” (not “table:” nor “table    :”).

-- One can compile a non deterministic fst into a C++ transition table.

-- the verbose mode of the indexer program has been removed.

 

List of the major bugs that have been corrected in version 4.21:

 

-- some combinations of inflectional codes did not correctly match a disambiguation rule (visible for some English conjugated verbs),

-- some short lexical symbols did not match lexical entries that have unambiguous inflectional codes (bug introduced, january 4, when correcting the previous bug...),

-- gr2fst crashed if the initial node of an embedded graph was refering to a non-existing graph file, or if an embedded graph recognized the empty string and nothing else,

-- for some recursive grammars, the fst compiled with grf2fst would not always produce the same matches as recorind (which CF parser uses the text index), even for small depths of recursions,

-- genere crashed if the generated output was longer than 4,096 characters (typically when a fst output is in a loop).

-- recorind (apply a GRF file, and check 'Use Text Index') is up to 3 time faster;

-- one can specify lemmas *and* categories in symbols or tags, e.g. <can.V> or {can.V:C};

-- the built-in symbol <MIX> matches forms in which an uppercase letter follows a lowercase letter (e.g. 'McCarthy');

-- the built-in symbols <U> (Uppercase letter) and <W> (loWercase letter) can be used in the morphological module;

-- Empty deterministic finite state transducers were represented by one state (initial & terminal), no transition. The new representation for empty transducers is: two states, one transition (1)-<E>->(2), (1) is initial, (1) & (2) are terminal. This representation is consistent with all INTEX programs (including flexion).

-- an environment variable INTEXVRB ('verbose'). If set to 'YES', programs called in a shell will display intermediary results; if undefined or set to 'NO', they will not display intermediary results;

-- the command line for indexer.exe was too long. The new usage is:

            indexer.exe {cdls} Text ResultsDirectory

ResultsDirectory is where the five resulting files idx, ida, frq, fr0 and stt are stored.

ResultsDirectory must exist and the user must have read/write/execute permissions for it;

-- a concordance with both left and right context lengths set to 0 now produces the list of all matching sequences.

 

Old milestones...

The "LADL, Université Paris 7" versions (v1.0 to v3.5, 1993-1997): The first integrated version, running under the NextStep Operating System (a UNIX-like OS similar to Mac X), was released in 1993. It was written in Objective C (a mix of Smalltalk and C) and Display PostScript. It was then adapted to the OpenStep OS and could run on NeXt boxes as well as SUN workstations and PCs:

INTEX 1.x : a graph editor plus a set of UNIX commands to process finite-state automata and apply them to texts.

INTEX 2.x : an integrated GUI for text analysis and concordances

INTEX 3.x : from NextStep to OpenStep for PCs, NeXT, SUN and HP workstations.

The "GRELIS" versions (v4.x, since 1997): While at the Université of Franche-Comté, in the GRELIS laboratory, I first tried to adapt the OpenStep/PostScript code to OS2, then to DOS/Windows95 without any success. I finally decided to rewrite the whole system from scratch (v.4.x) in C (for the linguistic engine) and in C++ (for the GUI) using the Windows API and the Borland C++ Builder development environment; I took advantage of the opportunity to write a new linguistic engine based completely on finite-state transducers; represent both DELAF and DELAC-type dictionaries as FSTs; add a morphological module, and integrate the lexical parser with a new syntactic parser thanks to the Text-MFT representation.

INTEX 4.0x : INTEX for DOS/Windows 95; an inflectional module

INTEX 4.1x : A brand new 32-bit engine: all linguistic data is represented by finite-state transducers

INTEX 4.2x, 4.3x : see above

 


Mailing List

Most users subscribe to the mailing list info-intex, hosted by the New York University (NYU) Information Technology Services. info-intex users post on a regular basis:

ü      discussions related to linguistic representations,

ü      discussions related to the programming interface,

ü      various announcements of interest for INTEX users,

ü      description of technical problems and upgrades,

ü      examples of fun uses of INTEX, tips about INTEX, etc.

To contribute to the list by email, send a message to:  info-intex@forums.nyu.edu
To access the list via an Internet browser, go to: http://forums.nyu.edu and enter the keyword: info-intex
To access the list via a News browser, go to: news://forums.nyu.edu/info-intex
For any questions related to your subscription email to: owner-info-intex@forums.nyu.edu


The Author: Max Silberztein

I constructed the first package of Finite State tools for Natural Language Processing, as well as the French DELAC-DELACF dictionaries for compound words, for my PhD research from 1986 to 1989 at the LADL (University of Paris 7-CNRS), under the supervision of Prof. Maurice Gross. The thesis was later published as:

Max Silberztein, 1993. Dictionnaires électroniques et analyse automatique de textes : le système INTEX. Masson Ed.: Paris.

A few “first-time” specifics about my PhD thesis:

-- a morphological parser programmed by a mere lookup procedure of DELAF-type dictionaries automatically expanded from DELAS-type dictionaries, as opposed to parsers (often programmed in PROLOG) that generated word lemmas by splitting complex utterances.

-- the construction of the DELAC electronic dictionary for compounds, formalized from several lists of “frozen nouns”, “idioms”, “co-locations” and “complex terms” listed by several teams of linguists, under the direction of Maurice Gross (Université Paris 7) and Gaston Gross (Université Paris 13). The resulting DELAC-DELACF-type dictionaries are used by INTEX to recognize compounds automatically in texts, exactly like the morphological parser recognizes simple words.

-- the thesis dealt with the full complexity of lexical parsers of natural languages: recognition, representation and processing of affixes and morphemes, simple words, compound words and frozen expressions

-- a representation of texts that formalizes all types of lexical ambiguities: between simple words, simple and complex words, complex words, etc., as an Acyclic Finite-State Automaton, and then later, in the Université de Franche-Comté (4.x) versions, as a Finite-State Transducer.

-- the first disambiguation module that uses local grammars represented by Finite-State Transducers to represent left and right contexts of ambiguous words. This optional module uses an original “intersection“ algorithm to lighten the Finite-State Transducer of the text. The resulting rules can take  compounds and frozen expressions into account, and are not limited in length, as opposed to most current disambiguation programs that consider contexts in fixed-size windows (often two or three tokens).

I wish to express many thanks to my colleagues and students, as well as to all the INTEX users who have contributed (and continue to do so) to help enhance INTEX with their patience, criticisms, creative ideas and ambitious expectations.


Comments, Questions on this WEB site, or on INTEX?