[alicebot-developer] HTML-stripping program

Leonard H. Chalk lchalk at proai.com
Mon Jul 17 20:05:25 PDT 2006


Check out OpenNLP, I use it's sentence splitter in my bot and it is the
best I have seen.  You can find more information at
http://opennlp.sourceforge.net/.  If you want to start with HTML, then
use an XSLT transform something like: 
<?xml version="1.0" encoding="UTF-8" ?>

<xsl:stylesheet version="1.0" 
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/">
  <xsl:apply-templates/>
</xsl:template>

<xsl:template match=text()>
  <xsl:value-of select="."/>
</xsl:template>

</xsl:stylesheet>

 (sorry, I just free handed the xslt and didn't test it, but it should
be close).  Xslt is the way to go to strip html out.  I recommend using
OpenNLP to get the sentences after that.



More information about the alicebot-developer mailing list