[alicebot-developer] Consecutive stars and performance -- ouch!

Dr. Richard S. Wallace alicebot-developer@list.alicebot.org
Sun, 3 Mar 2002 19:04:15 -0800


There is an improvement to the Graphmaster algorithm we can make, to avoid
this particular performance problem.  I noticed it first with the 24 star
category too.

We can store the "height" of the tree at each node.  A leaf node n has
height(n)=0.
An internal node has height(n) = min({height(m) : m in children(n)}.

In other words, store at each node the minimum number of words needed to
reach
a leaf node from here.  If the length remaining of your input is L, and you
reach a node n
where L < height(n), you can halt the matching along this path.

For example, the 24-star category has a sequence of nodes n1...n24.
Suppose the input consists of the word zzzyzzz repeated 23 times.

height(n1) may be very small, because this is the root node.  Here we are
competing
with all the other patterns.  Because a few other catgegory paths begin with
* *, the node n2
may also have a small height value.

But a few levels down the tree, we will come to a node n_i such that
height(n_i) = 24-i.
height(n23) = 1, height(n24) = 0.

By the time we reach the node n_i, the ancestor wildcards have already
aborbed at least i of the zzzyzzz's.

The remaining number of zzzyzzz's is 23-i < 24-i, so there is no need to try
other combinations along this path.

All of this could be added to matching with one or two lines of code.

if (remaining_words < node.height) return false;

A little more code is needed to assign the height values when building the
graph.

Rich



----- Original Message -----
From: "Paul Rydell" <paul@rydell.com>
To: <alicebot-developer@list.alicebot.org>
Sent: Sunday, March 03, 2002 2:24 PM
Subject: [alicebot-developer] Consecutive stars and performance -- ouch!


> I was playing around with Program E trying to discover bugs by loading
> different AIML sets and found some pretty interesting performance issues
> regarding consecutive wildcards.
>
> In Dr. Wallace's ALICE AIML set there is a pattern with 24 *'s in a row.
> Program E was matching the pattern correctly and quickly when an input was
> 24 words or longer. But when the input was 15-23 words it was very slow --
> it was taking at least a few minutes to find the correct match, the single
> star pattern.
>
> I tried it out in Program D and found the same performance issues. If my
> input was 19 words (using the word "zzzyzzz") it was taking 27 seconds. 20
> words took 54 seconds. 21 words took 111 seconds. 22 words took 241
seconds.
> 23 words took 469 seconds. It gets even more painful if the previous input
> is long. I did some quick calculations and think that if you had a
> conversation with Program D (loaded with the ALICE AIML set) and said:
>
> "zzzyzzz zzzyzzz zzzyzzz zzzyzzz zzzyzzz zzzyzzz zzzyzzz zzzyzzz zzzyzzz
> zzzyzzz zzzyzzz zzzyzzz zzzyzzz zzzyzzz zzzyzzz zzzyzzz zzzyzzz zzzyzzz
> zzzyzzz zzzyzzz zzzyzzz zzzyzzz zzzyzzz"
>
> and then said:
> "zzzyzzz zzzyzzz zzzyzzz zzzyzzz zzzyzzz zzzyzzz zzzyzzz zzzyzzz zzzyzzz
> zzzyzzz zzzyzzz zzzyzzz zzzyzzz zzzyzzz zzzyzzz zzzyzzz zzzyzzz zzzyzzz
> zzzyzzz zzzyzzz zzzyzzz zzzyzzz zzzyzzz"
>
> It would take approximately 40-60 years to find the match! (* : * : *)
>
> So where am I going with this? Some optimizations are probably needed for
> multiple consecutive wildcards. But maybe that 24 star pattern should be
> removed from the AIML set for now so people don't unwittingly become
victims
> of a very easy to launch DOS attack on their bots.
>
> --Paul
>
> ps- Thanks everyone for the notes of congratulations on the birth of our
> son!
>
>
>
> _______________________________________________
> alicebot-developer mailing list
> alicebot-developer@list.alicebot.org
> http://list.alicebot.org/mailman/listinfo/alicebot-developer