solr - Lucene Custom Analyzer for indexing and query -


i working on lucene 4.7 , trying migrate 1 of analyzers use in our solr configuration.

 <analyzer>    <charfilter class="solr.htmlstripcharfilterfactory"/>     <tokenizer class="solr.whitespacetokenizerfactory"/>     <filter class="solr.stopfilterfactory" ignorecase="true" words="stopwords.txt"/>       <filter class="solr.worddelimiterfilterfactory"              generatewordparts="1"              generatenumberparts="1"              catenatewords="1"             catenatenumbers="1"             catenateall="0"             splitoncasechange="0"             splitonnumerics="0"             preserveoriginal="1"     />     <filter class="solr.lowercasefilterfactory"/>     <filter class="solr.porterstemfilterfactory"/>   </analyzer> 

but, cannot figure out how use htmlstripcharfilterfactory , worddelimiterfilterfactory configuration above. also, query in solr analyzer follows, how can achieve same in lucene.

 <analyzer type="query">     <tokenizer class="solr.whitespacetokenizerfactory"/>     <filter class="solr.stopfilterfactory"             ignorecase="true"             words="stopwords.txt"             />     <filter class="solr.lowercasefilterfactory"/>     <filter class="solr.porterstemfilterfactory"/>   </analyzer> 

the analysis package documentation explains how use charfilter. wrap reader in overridden initreader method.

i'm assuming problem worddelimiterfilter don't know how set configuration options using? construct int pass constructor combining appropriate constants binary , (&). such as:

int config = worddelimiterfilter.generate_number_parts & worddelimiterfilter.generate_word_parts; //etc. 

so, in end might end like:

//stopwordanalyzerbase grants convenient ways handle stop word sets. public class myanalyzer extends stopwordanalyzerbase {      private final version version = version.lucene_47;     private int worddelimiterconfig;      public myanalyzer() throws ioexception {         super(version, loadstopwordset(new filereader("stopwords.txt"), matchversion));         //might load config front, along stop words         worddelimiterconfig =              worddelimiterfilter.generate_word_parts &             worddelimiterfilter.generate_number_parts &             worddelimiterfilter.catenate_words &             worddelimiterfilter.catenate_numbers &             worddelimiterfilter.preserve_original;     }      @override     protected tokenstreamcomponents createcomponents(string fieldname, reader reader) {         tokenizer source = new whitespacetokenizer(version, reader);         tokenstream filter = new worddelimiterfilter(source, worddelimiterconfig, null);         filter = new lowercasefilterfactory(version, filter);         filter = new stopfilter(version, filter, stopwords);         filter = new porterstemfilter(filter);         return new tokenstreamcomponents(source, filter);     }      @override     protected reader initreader(string fieldname, reader reader) {         return new htmlstripcharfilter(reader);     } } 

note: i've moved stopfilter after lowercasefilter. makes case insensitive, long stop word definitions in lowercase. don't know if problematic due the worddelimiterfilter. if so, there loadstopwordset method support case insensitivity, i, frankly, don't know how use it.


Comments

Popular posts from this blog

apache - Remove .php and add trailing slash in url using htaccess not loading css -

javascript - jQuery show full size image on click -