[alicebot-developer] The wildcard-filling problem
Gary Poster
alicebot-developer@list.alicebot.org
Tue, 4 Oct 2005 21:22:10 -0400
On Oct 4, 2005, at 6:30 PM, Helio Perroni Filho wrote:
> --- Gary Poster <gary@modernsongs.com> escreveu:
>
>
>> I have some unreleased Python code that tackled this
>> problem, among many others. (...)
>>
>> I tokenized -- so I stopped caring about whitespace,
>> which I think you are saying you care about also --
>> into objects, without the mapping you are talking
>> about.
>>
>
> That's interesting, but how do you deal with the case
> where an element in the original sentence is split in
> two -- for example, changing "he's" to "he is"?
Heh, looks like my memory was a bit off. Here's an excerpt from the
pertinent part of the docs:
----8<----
Sentence
========
a tokenized "sentence" is a tuple subclass of tuples. Each
composite tuple is called a "token" in the AIMLes docs and is
comprised of
four elements:
- the tokenized value;
- the tokenized type;
- the source that generated this value, if it is the first token
generated from the source; and
- the total number of tokens generated from this particular source.
Sentences also have two special attributes: memory and
nextSentences. Memory
is a dictionary of predicate memory values that should be applied if
this
sentence is used. nextSentences is a read-only (and lazily calculated)
attribute that returns an iterable of the next sentence
possibilities after
this one, or None.
----8<----
Thus, given this input:
>>> source = ("howdy pardner! Whaddya think of this "
... "picture <img src='http://example.com/my_teddy_bear.png' />?")
The tests come up with two competing (scored) substitutions (one that
tries a typed 'image' match, generated from an XML parse, and one
that elides it); an expansion of "whaddya" to "what do you"; and a
compression of 'howdy pardner' to 'hi'. Note that white space in the
source value (the third element of each tuple) in fact is not an
object and does honor *normalized* whitespace--both contrary to what
I said before. Also note that case and normalized white space is
intentionally maintained in the tokenized value (the first element of
each tuple), although some of the substitutions were not as careful
as they should have been. That is normalized only if necessary later
on. My general approach was that I wanted to allow accurate matches
if they were available, at the potential cost of possibly non-trivial
additional work for the system.
My doctests come out with this output, then:
>>> pp = pprint.PrettyPrinter(width=65).pprint
>>> pp([tuple(sentence) for sentence in sentences])
[((u'hi', None, u'howdy pardner', 1),
(u'! ', None, u'! ', 1),
(u'what ', None, u'Whaddya ', 3),
(u'do ', None, None, 3),
(u'you', None, None, 3),
(u'think ', None, u'think ', 1),
(u'of ', None, u'of ', 1),
(u'this ', None, u'this ', 1),
(u'picture ', None, u'picture ', 1),
(u'http://example.com/my_teddy_bear.png', u'image', u'', 1),
(u'?', None, u'?', 1)),
((u'hi', None, u'howdy pardner', 1),
(u'! ', None, u'! ', 1),
(u'what ', None, u'Whaddya ', 3),
(u'do ', None, None, 3),
(u'you', None, None, 3),
(u'think ', None, u'think ', 1),
(u'of ', None, u'of ', 1),
(u'this ', None, u'this ', 1),
(u'picture ', None, u'picture ', 1),
(u'?', None, u'?', 1))]
I'm not showing other aspects of the sentence, but that's a start.
There's a *lot* more to it. I was a tad ambitious. ;-) All that's
left to give it a whirl is to hook up some valid, well-formed AIML to
it. The GPL license of Dr. Wallace's files meant I could never
really use my code for work, so I put it aside for now. Maybe I'll
pick it up later.
Gary