LEMMATIZATION OF COMPOUND FORMS OF VERBS

Maurice Gross
LADL
University Paris 7

In general, lemmatization is performed on verbs conjugated by means of suffixes, that is on verbs that are morphologically simple. In English, we then have paradigms such as:

to work: work, works, worked, working
to eat: eat, eats, ate, eaten, eating

But, there is no reason why is working or has eaten, should not be included in the above paradigms and lemmatized in the same way; after all, they are full-fledged conjugated forms. However, from the point of view of parsing, there is a difficulty in recognizing compound tenses, because inserts may occur between the parts of a compound form:

Jo is today working on an essay
Bob has not much eaten

Hence, inserts have to be recognized in order to bring together the parts of the compound verb.

1. Inserts

Inserts are of various types, ranging from simple adverbs to complex combinations of adverbial phrases; some of these phrases can even be sentential, in which case, their length is unbounded and their analysis requires the full power of a sentence parser. Nonetheless, it is possible to construct detailed grammars for many adverbial phrases.

The negation not has a special status as an insert, it occurs between auxiliaries and main verbs. But it interferes in other ways with the auxiliary system. Firstly, it is integrated to the auxiliary form cannot and to many contracted forms (isn't, shouldn't, etc.). Secondly, it is introduced in most verbs by means of the auxiliary do. Do has not compound forms itself and is thus limited to the forms do, does and did. Thirdly, with some auxiliary verbs (e.g. to be, to have) and with to need, not is introduced without do, it must then have a special treatment linked to the occurrence of these auxiliaries.

As a consequence, we lemmatize negative verbs, such as do not V, as Vs in the negative form, hence we treat negative verbs as compound.

Resources

To parse adverbials, the following resources are available:

- in the dictionary of simple words (i.e. DELAF), adverbs are marked by the symbol ADV;
- in a lexicon-grammar, frozen adverbs have been represented (M. Gross 1991), they are used by the parsing procedure with the same symbol ADV;
- various inserts, such as time adverbials and some sentential inserts have been described in terms of local grammars. Again, the parser treats them like the other ADV forms.

When all these occurrences of adverbials are parsed, practically all compound forms of verbs found with inserts in corpora can be recognized.

2. Auxiliary verbs

The notion of auxiliary verb is hard to define. On an intuitive basis, one could consider auxiliary verbs as verbs that add some meaning to the meaning of a main verb. With such a definition, the following examples where the main verb is to sleep contain auxiliary verbs:

Bob (is + ought + begins + wants) to sleep
Bob (is + went on + thought of) sleeping

Moreover, there is no reason to limit the analysis to morphologically simple verbs: other forms consisting of nouns built with verbs such as to be, to have and frozen forms are auxiliaries from the same semantic point of view:

Bob (is unable to + found a way + has a right) to sleep
Bob (is on the verge of + has trouble + came close to) sleeping

From a syntactic point of view, our examples present sharp differences, hence we classified them according to grammatical categories that are more or less recognized in the domain. Although categories of auxiliary verbs are described in all kinds of grammars, constructing a full list for them is not an easy task, in the absence of coherent definitions. When operational definitions are given, that is, syntactic definitions, reaching a consensus about a list of lexical items is an exercise that has never been attempted. One can safely predict that due to the variety of interests competing on the market of linguistic theories, no agreement is possible today. We nonetheless propose a concrete classification of these verbs (i.e. lists), largely based on the various descriptions available in current grammars.

We have subdivided Auxiliary verbs in five categories:

- tense auxiliaries,
- passive auxiliaries,
- aspectual verbs (noted VAsp),
- modality verbs (noted VMod),
- verbs with sentential complements (noted VS).

The first three categories are reasonably complete, the last one provides only examples of verbs, their lists should be substantially extended. The INTEX system can use these categories to tag verbs.

2.1. Tense auxiliaries are rather sharply delimited:

- past tense auxiliaries: be and have combined with past participles, some verbs take the auxiliary to be, most verbs take to have, some take both: (is + has)(arrived + gone); the progressive forms be and have been combined with present participles;
- modals: can, could, may, might, will, would, shall, should, ought to, used to, be to, have to. These verbs are restricted conjugations :
- some have no infinitive form (can, will, ought to, used to),
- some cannot be conjugated or are highly defective:
 
Bob (is + was) to sell his car
* Bob will be to sell his car

Simple tenses apply more or less regularly to auxiliary verbs. Tensed auxiliary verbs are listed, including forms contracted with subject pronouns (graph Tense V). Some contractions are ambiguous, for example I'd = I had or = I would, but the right context disambiguate them. Contractions with nominal subjects are observed: My cousin's gone, The best part's gone, they can be locally mistaken for possessive cases, eliminating the ambiguity requires a deeper analysis of sentences. Roughly the same forms with negations are described in a separate graph (graph Tense Not V).

2.2. Passive auxiliaries: be, become, get, grow, remain, stay, combined with past participles of transitive verbs are treated separately. These extensions of the auxiliary be are not accepted by all verbs, many constraints will have to be refined accordingly; for example, thes above auxiliaries have aspectual meanings. Sentences that have an auxiliary of the form be Adjective, be N, be Prep N or have N are not fundamentally different from passive sentences, they can be treated together. Verbs like to appear, to look, to seem have forms similar to the variants of be, although they will be analyzed as reductions of sentences containing to be (She seemed satisfied = She seemed to be satisfied). Examples of these more general verbs have been included in various graphs. Passive forms are described in several distinct graphs (BeTVed).

2.3. Aspectual verbs have been distinguished, because they modify the time complements of the main verb. In the sentences that follow, acceptabilities are reversed, when an aspectual verb such as to begin is introduced:

Bob read both books in six hours
? Bob read both books at 6 p.m.
? Bob began to read both books in six hours
Bob began to read both books at 6 p.m.

The compatibility between the verb and the two adverbials of date and duration can be described in a natural way, if one considers to begin to read as a simple verbal unit.

2.4. Modality verbs

Z.S. Harris 1964 has syntactically defined a set of verbs or operators noted U that cover the set of aspectual and modality verbs. Sentences containing U verbs are of the form N0 U (Prep) V0 W and cannot be derived from more complex sentences, as it is the case for to want for example:

Bob wants him to sell his car
= Bob wants to sell his car when Bob = him

Although the last sentence has the form of a sentence with auxiliary verb U, to want is not a U but a more complex operator (i.e. VS). The list of modality verbs, noted VMod, is sometimes difficult to separate from the list of the verbs with full sentential complements, noted VS. Examples of Us are:

Bob (attempted + failed +tried) to sell his car

We call auxiliaries the whole set of Us, including complex verbal phrases such as be able to, be in the position of, have the ability to.

2.5. Verbs with full sentential complements

We have already given some examples of the more complex operators, in the active form: to want, to dream, etc., in the passive form: to be authorized, to be forced. We have only considered their constructions where they appear as similar to aspectual or modality verbs, constructions that are obtained through a variety of syntactic transformations, hence, to want is auxiliary-like in Bob wanted to leave, whereas to recommand is not in Bob recommanded to leave, since Bob is not the subject of to leave.

Let us sum up the various cases of auxiliary verbs that have been considered:

Aspectual and modality verbs: they combine with verbs in the infinitive or in the gerund form. The different formal types are:

- simple verbs: to help, to need,
- verbs with a preposition: insist on Ving, tend to V,
- adjectives constructed with be: be able to V, be aware of Ving,
- prepositional phrases constructed with be:
 
be in charge of Ving,
be on the verge of Ving,
be on Poss0 way to V,

Poss0 is a possessive adjective that must refer to N0, the subject.

- noun phrases constructed with to have (graph Have N),

Bob has the ability to remove the cars

- verbs with sentential complements have forms that behave exactly like aspectual or modality verbs:

Bob (accepted + wanted) to leave
Bob (insisted on + thought of) leaving

- frozen sentences with other verbs than to have or to be have similar constructions and the same semantic functions:

N0 turn out to V0 W
N0 get the green light to V0 W
= N0 was given the green light to V0 W

- verbs with sentential complements have forms that behave exactly like adjectival modals. For example, the verbs (graph VS):

Bob ordered Joe to leave early
The impact sent the car spinning

have passive forms:

Joe was ordered to leave early
The car was sent spinning

which are similar to:

I was able to sell his car

where the be Adjective form is a modality similar to can in:

I could sell his car

We have listed a certain number of these forms and we consider that they can be lemmatized in the same way as the proper auxiliaries (VPassiveBePrepVing, VPassiveBeToV).

Our reason to include verbs with full sentential complements in the lemmatization process is to show how such lexical items can be parsed and interpreted within the local grammars of auxiliaries.

Aspectual and modality verbs can combine in different orders:

Bob began insisting on leaving early
Bob insists on beginning to work early

As a first approximation, we consider that a given verb V can accept up to three types of auxiliary verbs: tense auxiliaries (Aux) and aspectual and modality verbs (VAsp, VMod). Tense auxiliaries only modify verbs, whereas aspectual and modality verbs may have nominal complement equivalent to verbal ones: (begin the reading vs begin to read), hence aspectual and modality verbs may behave like ordinary verbs and have to be lemmatized accordingly.

There are complex restrictions on the combinations of VAsps and VMods, they can be studied at a later stage, we did not take this factor into account, and let all combinations apply more or less recursively. The situation allowed is even more complex, since we introduce a sample of generalized auxiliary verbs, those with sentential complements (VS), whose constructions : N0 VS V0 W and N0 VS (Prep) Ving0 W) is identical to that of the VAsps and VMods.

3. Ambiguities

3.1. Systematic ambiguities

1) A form such as:

plan to leave at night

is recognized with the generalized modal verb to plan. However, plan is also a noun that governs the same infinitive verb phrases, as in:

He presented a (plan to eliminate)v pesticides
The states use the Civil Justice Reform (Act to require)V

where parentheses indicate 'wrong' analyses. This ambiguity can only be resolved by exploring the left context. We did so by introducing a graph of noun phrases (NHave) that describes forms such as:

a (E + (E + very) clear) plan to leave at night

and that recognizes some prenominal modifiers and determiners. We have also introduced partial descriptions of adverbial phrases such as in order to V and in view of Ving since by the same token, their left context Prep Det resolves similar ambiguities. We could also have represented left contexts of verbs, which, in general, are subject noun phrases, often subject pronouns. Adjoining pronouns to the graphs of conjugated auxiliaries is a natural extension, since we already had to introduce many of them that are contracted with the verb (e.g. I'm, we're). But in this presentation, our goal is restricted, and we only suggest various generatizations which can be implemented at a further stage.

2) Some of the modal and aspectual verbs have nominal complements parallel to their verbal ones, that is, they enter both structures:

N0 V Prep V0 W (=: Bob began to read the text)
N0 V Prep N (=: Bob began the reading of the text)
(Prep is zero here)

This situation may generate false analyses such as: (help)Aux(low)V, where the noun or adjective low is mistaken for a verb in the sentence:

They help low income families

The following examples are of a similar type:

These (are daunting)V goals (model: They are flying planes)
It is (cost)V effective (model: is gone)
a lot of (work)Aux (left)ADV to (do)v
doesn't (clean)ADV (up)v his room
(helping)Aux (even)v young children

3) Some ambiguities may result from a particular choice of grammatical codes in the dictionary. For example, verb particles, such as about, up, down, around, off, are coded Adverb in some dictionaries. As a consequence, our local grammars will analyse a sequence such as:

All this is about giving Bob a chance

with about Adverb, hence is giving with be parsed as the progressive form of to give. By using the code Prt for Particle, this difficulty is avoided.

3.2. Accidental ambiguities:

1) Homographs. In the sentences:

Why do people behave so?
(When + Where) do (right + wrong) people behave so?
They do go to school

the nouns or adjectives people, right, wrong, school are also verbs; when preceded by do, the programme recognized verbal complexes. The same ambiguity occurs with other inversions of the subject, as in:

Only in good times do people want to read

Other types of homography involve words which are, on the one hand, participles or adjectives and on the other, verbal forms:

(is encouraging) news
(are dead) ends
(Head Start) funding (Head Start is a proper name).

2) My friends (at first)ADV (glance)v seemed lost

My friends (at first glance)ADV (seemed)v

In this example, the longest match principle resolves the ambiguity (i.e. the adverb at first is a prefix of the longer adverb at first glance).

3) The utterance come to power can be analyzed either as a phrasal verb were power is a noun or with the auxiliary come to and the verb to power.

4) The auxiliary be to in the sentence Our son is to arrive at noon is ambiguous, in the sense its sentence structure can be confused with sentences of the type:

Our goal is to arrive at noon

Resolving this ambiguity requires a detailed analysis of the sentence, and in particular, a list of the nouns such as goal which allow this construction. We can make human nouns a special case, since they cannot be found in this subject position. We could introduce human subject pronouns for the auxiliary verb to be to. Such pronouns are easy to recognize and are numerous in various texts. Then, when a human pronoun is found, the ambiguity (i.e. the wrong analysis) would disappear. The general case with full human noun phrases is harder to solve.

 

 

References

Gross, Maurice 1975. Méthodes en syntaxe, Paris: Hermann, 412 p.

Gross, Maurice 1990. Grammaire transformationnelle du français. 3-Syntaxe de l'adverbe, Paris: ASSTRIL, 670p.

Harris, Zellig 1964. The Elementary Transformations, Philadelphia: University of Pennsylvania, TDAP N°54. Reprinted in Papers in Structural and Transformational Linguistics, 1970, Dordrecht: Reidel, pp.482-532.

Rosenbaum, Peter S. 1967. The Grammar of English Predicate Constructions, Cambridge, Mass.: The MIT Press.