[alicebot-developer] The wildcard-filling problem

Gary Poster alicebot-developer@list.alicebot.org
Tue, 4 Oct 2005 21:22:10 -0400


On Oct 4, 2005, at 6:30 PM, Helio Perroni Filho wrote:

> --- Gary Poster <gary@modernsongs.com> escreveu:
>
>
>> I have some unreleased Python code that tackled this
>> problem, among many others. (...)
>>
>> I tokenized -- so I stopped caring about whitespace,
>> which I think you are saying you care about also --
>> into objects, without the mapping you are talking
>> about.
>>
>
> That's interesting, but how do you deal with the case
> where an element in the original sentence is split in
> two -- for example, changing "he's" to "he is"?

Heh, looks like my memory was a bit off.  Here's an excerpt from the  
pertinent part of the docs:

----8<----

Sentence
========
a tokenized "sentence" is a tuple subclass of tuples.  Each
composite tuple is called a "token" in the AIMLes docs and is  
comprised of
four elements:

- the tokenized value;
- the tokenized type;
- the source that generated this value, if it is the first token
   generated from the source; and
- the total number of tokens generated from this particular source.

Sentences also have two special attributes: memory and  
nextSentences.  Memory
is a dictionary of predicate memory values that should be applied if  
this
sentence is used.  nextSentences is a read-only (and lazily calculated)
attribute that  returns an iterable of the next sentence  
possibilities after
this one, or None.

----8<----

Thus, given this input:

   >>> source = ("howdy pardner! Whaddya think of this "
   ... "picture <img src='http://example.com/my_teddy_bear.png' />?")


The tests come up with two competing (scored) substitutions (one that  
tries a typed 'image' match,  generated from an XML parse, and one  
that elides it); an expansion of "whaddya" to "what do you"; and a  
compression of 'howdy pardner' to 'hi'.  Note that white space in the  
source value (the third element of each tuple) in fact is not an  
object and does honor *normalized* whitespace--both contrary to what  
I said before.  Also note that case and normalized white space is  
intentionally maintained in the tokenized value (the first element of  
each tuple), although some of the substitutions were not as careful  
as they should have been.  That is normalized only if necessary later  
on.  My general approach was that I wanted to allow accurate matches  
if they were available, at the potential cost of possibly non-trivial  
additional work for the system.

My doctests come out with this output, then:

   >>> pp = pprint.PrettyPrinter(width=65).pprint
   >>> pp([tuple(sentence) for sentence in sentences])
   [((u'hi', None, u'howdy pardner', 1),
     (u'! ', None, u'! ', 1),
     (u'what ', None, u'Whaddya ', 3),
     (u'do ', None, None, 3),
     (u'you', None, None, 3),
     (u'think ', None, u'think ', 1),
     (u'of ', None, u'of ', 1),
     (u'this ', None, u'this ', 1),
     (u'picture ', None, u'picture ', 1),
     (u'http://example.com/my_teddy_bear.png', u'image', u'', 1),
     (u'?', None, u'?', 1)),
    ((u'hi', None, u'howdy pardner', 1),
     (u'! ', None, u'! ', 1),
     (u'what ', None, u'Whaddya ', 3),
     (u'do ', None, None, 3),
     (u'you', None, None, 3),
     (u'think ', None, u'think ', 1),
     (u'of ', None, u'of ', 1),
     (u'this ', None, u'this ', 1),
     (u'picture ', None, u'picture ', 1),
     (u'?', None, u'?', 1))]

I'm not showing other aspects of the sentence, but that's a start.   
There's a *lot* more to it.  I was a tad ambitious.  ;-)  All that's  
left to give it a whirl is to hook up some valid, well-formed AIML to  
it.  The GPL license of Dr. Wallace's files meant I could never  
really use my code for work, so I put it aside for now.  Maybe I'll  
pick it up later.

Gary