solr - Lucene Custom Analyzer for indexing and query -
i working on lucene 4.7 , trying migrate 1 of analyzers use in our solr configuration.
<analyzer> <charfilter class="solr.htmlstripcharfilterfactory"/> <tokenizer class="solr.whitespacetokenizerfactory"/> <filter class="solr.stopfilterfactory" ignorecase="true" words="stopwords.txt"/> <filter class="solr.worddelimiterfilterfactory" generatewordparts="1" generatenumberparts="1" catenatewords="1" catenatenumbers="1" catenateall="0" splitoncasechange="0" splitonnumerics="0" preserveoriginal="1" /> <filter class="solr.lowercasefilterfactory"/> <filter class="solr.porterstemfilterfactory"/> </analyzer>
but, cannot figure out how use htmlstripcharfilterfactory , worddelimiterfilterfactory configuration above. also, query in solr analyzer follows, how can achieve same in lucene.
<analyzer type="query"> <tokenizer class="solr.whitespacetokenizerfactory"/> <filter class="solr.stopfilterfactory" ignorecase="true" words="stopwords.txt" /> <filter class="solr.lowercasefilterfactory"/> <filter class="solr.porterstemfilterfactory"/> </analyzer>
the analysis package documentation explains how use charfilter
. wrap reader in overridden initreader
method.
i'm assuming problem worddelimiterfilter
don't know how set configuration options using? construct int pass constructor combining appropriate constants binary , (&
). such as:
int config = worddelimiterfilter.generate_number_parts & worddelimiterfilter.generate_word_parts; //etc.
so, in end might end like:
//stopwordanalyzerbase grants convenient ways handle stop word sets. public class myanalyzer extends stopwordanalyzerbase { private final version version = version.lucene_47; private int worddelimiterconfig; public myanalyzer() throws ioexception { super(version, loadstopwordset(new filereader("stopwords.txt"), matchversion)); //might load config front, along stop words worddelimiterconfig = worddelimiterfilter.generate_word_parts & worddelimiterfilter.generate_number_parts & worddelimiterfilter.catenate_words & worddelimiterfilter.catenate_numbers & worddelimiterfilter.preserve_original; } @override protected tokenstreamcomponents createcomponents(string fieldname, reader reader) { tokenizer source = new whitespacetokenizer(version, reader); tokenstream filter = new worddelimiterfilter(source, worddelimiterconfig, null); filter = new lowercasefilterfactory(version, filter); filter = new stopfilter(version, filter, stopwords); filter = new porterstemfilter(filter); return new tokenstreamcomponents(source, filter); } @override protected reader initreader(string fieldname, reader reader) { return new htmlstripcharfilter(reader); } }
note: i've moved stopfilter
after lowercasefilter
. makes case insensitive, long stop word definitions in lowercase. don't know if problematic due the worddelimiterfilter
. if so, there loadstopwordset
method support case insensitivity, i, frankly, don't know how use it.
Comments
Post a Comment