Saturday, September 1, 2018

Solr Custom Similarity - Using a field from the indexed document

Leave a Comment

We are currently on a very old version of Lucene V 4.X and are now migrating to Solr V 7.4.0 cloud. We had a custom Similarity Class that we use to influence the score using an indexed field ("RANK") we have in the documents.

Here is how the classes looks like -

CustomSimilarity.java

public class CustomSimilarity extends Similarity {     private final Similarity sim;     private final double coefficiency;     private String popularityRank;     static InfoStream infoStream;      public CustomSimilarity() {         this.sim = new CustomPayloadSimilarity();         this.coefficiency = 0.1;         this.popularityRank = "RANK";         infoStream = new LoggingInfoStream();     }      @Override     public long computeNorm(FieldInvertState state) {         return sim.computeNorm(state);      }      @Override     public SimWeight computeWeight(float queryBoost, CollectionStatistics collectionStats, TermStatistics... termStats) {         final Explanation idf = termStats.length == 1 ? ((PclnPayloadSimilarity) sim).idfExplain(collectionStats, termStats[0]) : ((PclnPayloadSimilarity) sim)             .idfExplain(collectionStats, termStats);         float[] normTable = new float[256];         for (int i = 1; i < 256; ++i) {             int length = SmallFloat.byte4ToInt((byte) i);             float norm = ((PclnPayloadSimilarity) sim).lengthNorm(length);             normTable[i] = norm;         }         normTable[0] = 1f / normTable[255];         return new IDFStats(collectionStats.field(), queryBoost, idf, normTable);     }      public float sloppyFreq(int distance) {         return 1.0f / (distance + 1);     }      public float scorePayload(int doc, int start, int end, BytesRef payload) {         return 1;     }      @Override     public SimScorer simScorer(SimWeight weight, LeafReaderContext context) throws IOException {         final IDFStats idfstats = (IDFStats) weight;         final NumericDocValues rank1Value = context.reader().getNumericDocValues(popularityRank);         infoStream.message("PCLNSimilarity", "NumericDocValues-1 >> rank1Value = " + rank1Value);         System.out.println("NumericDocValues-1 >> rank1Value = " + rank1Value);          return new SimScorer() {              @Override             public Explanation explain(int doc, Explanation freq) throws IOException {                 return super.explain(doc, freq);             }              @Override             public float score(int doc, float freq) throws IOException {                 // float weightValue = idfstats.queryWeight;                 // // logger.trace("weight " + weightValue + "freq " + freq);                 //                 // float score = 0.0f;                 // if (rank1Value != null) {                 // score = (float) rank1Value.longValue() + score;                 // }                 //                 // if (coefficiency > 0) {                 // score = score + (float) coefficiency * weightValue;                 // }                 // return score;                 return (float) rank1Value.longValue();             }              @Override             public float computeSlopFactor(int distance) {                 return sloppyFreq(distance);             }              @Override             public float computePayloadFactor(int doc, int start, int end, BytesRef payload) {                 return scorePayload(doc, start, end, payload);             }         };     }      static class IDFStats extends SimWeight {         private final String field;         /** The idf and its explanation */         private final Explanation idf;         private final float boost;         private final float queryWeight;         final float[] normTable;          public IDFStats(String field, float boost, Explanation idf, float[] normTable) {             // TODO: Validate?             this.field = field;             this.idf = idf;             this.boost = boost;             this.queryWeight = boost * idf.getValue();             this.normTable = normTable;         }     }  } 

CustomPayloadSimilarity.java

public class CustomPayloadSimilarity extends ClassicSimilarity {      @Override     public float tf(float freq) {         return 1;     }      @Override     public float scorePayload(int doc, int start, int end, BytesRef payload) {         if (payload != null) {             return PayloadHelper.decodeFloat(payload.bytes, payload.offset);         } else {             return 1.0F;         }      }      @Override     public Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics termStats) {         final long df = termStats.docFreq();         final long docCount = collectionStats.docCount() == -1 ? collectionStats.maxDoc() : collectionStats.docCount();         final float idf = idf(df, docCount);         return Explanation.match(idf, "idf(docFreq=" + df + ", docCount=" + docCount + ")");       }  } 

As you can notice, since we want to retain the parity (sort of) between older and newer TFIDF implementation, we are still using older algorithm and haven't switch to BM25Similarity.

With the above code, I am unable to retrieve the value of RANK field from the document. So essentially, the following line is returning some value which I am unable to log to the solr.log file - final NumericDocValues rank1Value = context.reader().getNumericDocValues(popularityRank);

but return (float) rank1Value.longValue() throws the following exception -

"java.lang.IndexOutOfBoundsException at java.nio.Buffer.checkIndex(Buffer.java:546) at java.nio.DirectByteBuffer.getInt(DirectByteBuffer.java:685) at org.apache.lucene.store.ByteBufferGuard.getInt(ByteBufferGuard.java:128) at org.apache.lucene.store.ByteBufferIndexInput$SingleBufferImpl.readInt(ByteBufferIndexInput.java:415) at org.apache.lucene.util.packed.DirectReader$DirectPackedReader28.get(DirectReader.java:248) at org.apache.lucene.codecs.lucene70.Lucene70DocValuesProducer$4.longValue(Lucene70DocValuesProducer.java:490) at com.priceline.rc.solr.similarity.CustomSimilarity$1.score(CustomSimilarity.java:117) at org.apache.lucene.search.TermScorer.score(TermScorer.java:65) at org.apache.lucene.search.TopScoreDocCollector$SimpleTopScoreDocCollector$1.collect(TopScoreDocCollector.java:64) at org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:263) at org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:214) at org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:662) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:463) at org.apache.solr.search.SolrIndexSearcher.buildAndRunCollectorChain(SolrIndexSearcher.java:217) at org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1622) at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1439) at org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:586) at org.apache.solr.handler.component.QueryComponent.doProcessUngroupedSearch(QueryComponent.java:1435) at org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:375) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:298) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199) at org.apache.solr.core.SolrCore.execute(SolrCore.java:2539) at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:709) at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:515) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:377) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:323) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1634) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1253) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1155) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) at org.eclipse.jetty.server.Server.handle(Server.java:531) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102) at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126) at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:760) at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:678) at java.lang.Thread.run(Thread.java:745)\n" 

Any advice?

1 Answers

Answers 1

You are trying to get a value from NumericDocValues without setting the current document with advanceExact(). Remember that there's a single NumericDocValues for that accounts for every document, you still need to tell it which document you are referring to before requesting a value. In your score function try adding advanceExact(doc) before calling rank1Value.longValue().

It should be like this:

if(advanceExact(doc))     return (float) rank1Value.longValue(); else     return 0; // or whatever value you want as default 
If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment