[alicebot-developer] A.L.I.C.E Web page Spider

Helio Perroni Filho xperroni at yahoo.com
Fri Jun 2 05:12:12 PDT 2006


--- Ty Ademosu <tyademosu at hotmail.com> escreveu:

> I need a spider created that will crawl a webpage
> and grammatically parse the page and create AIML
> (Artificial Inelligence Markup Language) data. This
> data will be saved into an AIML file and used to
> teach a chatterbot the contents of the web page.

You must understand that this is not a trivial
request: you could start a whole project -- perhaps
several -- to fulfill it. Then again, there are some
paths I can envision for providing something of that
sort in a relatively short time.

First, the spider. I don't see much need of a
full-fledged web robot, unless you intend the search
to occur during a conversation. Even if this is the
case, you could use some predefined resource, such as
Wikipedia (http://www.wikipedia.org), and
programatically drive it to retrieve pages (in Java,
I'd use the java.net.URL class to do this). Otherwise,
you can just manually download the pages you want to
parse.

Second, the grammatic parser. I found it surprisingly
difficult to Google out a Free / Open Source parser,
but you can look into Source Forge
(http://sourceforge.net) for some promising projects.
Anyway, why not use AIML itself for this? Most
Alicebots already know how to break inputs into
sentence lists; then, with the right set of AIML
files, the bot could create the request / response
pairs. For example, to retrieve responses for queries
of the form "What is a ..?", you could use this
category:

[category]
  [pattern]* IS A *[/pattern]
  [template]
    (request: What is a <star/>?)
    (response: A <star/> is a <star index="2"/>)
  [/template]
[/category]

Third, the formatter: saving to a file the outputs
from feeding webpages to the "pre-bot" outlined above,
you can latter pass them to a simple
regular-expression parser, that would transform them
into AIML files. I could do this manually inside jEdit
(http://www.jedit.org/), but most programming
languages provide enough RegExp support for automating
the task.

These are my two cents on the subject. I believe that
this would be a practical approach to the problem,
which would allow you to get started at once: with an
AIML parser and a RegExp-enabled text editor, you
could already hack a test implementation.

-- 
Ja mata ne.
Helio Perroni Filho


__________________________________________________
Fale com seus amigos  de graça com o novo Yahoo! Messenger 
http://br.messenger.yahoo.com/ 


More information about the alicebot-developer mailing list