Wednesday, March 1, 2017

Mapping international character to multiple options

Leave a Comment

What I want to achieve is ability for people to search for individuals without being language aware, but not punishing those people, that are. What I mean is:

Given I build index:

  1. Jorgensen
  2. Jörgensen
  3. Jørgensen

I want to be able allow such conversions:

  1. ö to o
  2. ö to oe
  3. ø to oe
  4. ø to oe

so if someone searches for: QUERY | RESULT(I include only ID's, but it would be full records in reality)

  • Jorgensen return - 1,2,3
  • Jörgensen return - 1,2
  • Jørgensen return - 1,3
  • Joergensen return - 2,3

Starting with that I tried to create index analyzer and filter that:

{ "settings": {     "analysis": {       "analyzer": {         "my_analyzer": {           "tokenizer": "keyword",           "char_filter": [             "my_char_filter"           ]         }       },       "char_filter": {         "my_char_filter": {           "type": "mapping",           "mappings": [             "ö => o",             "ö => oe"           ]         }       }     }   } } 

But that is invalid, because it tries to map to same character.

What ma I missing? Do I need multiple analyzers? Any direction would be appreciated.

2 Answers

Answers 1

Since custom mapping isn't enough in your case, as show comments above, let's play with your data and char normalization.
In your case, normalization using unidecode isn't enough due ø and oe conversions. Example:

import unicodedata def strip_accents(s):     return ''.join(         c for c in unicodedata.normalize('NFD', s)         if unicodedata.category(c) != 'Mn'     )  body_matches = [     u'Jorgensen',     u'Jörgensen',     u'Jørgensen',     u'Joergensen', ] for b in body_matches:     print b,strip_accents(b)  >>>> Jorgensen Jorgensen >>>> Jörgensen Jorgensen >>>> Jørgensen Jørgensen >>>> Joergensen Joergensen 

So, we need a custom translation. By now I've only set those chars that you shown, but feel free to complete the list.

accented_letters = {     u'ö' : [u'o',u'oe'],     u'ø' : [u'o',u'oe'], } 

Then, we can normalize words and store them in a special property, body_normalized for instance, and index them as a field of your Elasticsearch records
Once they are inserted, you could perform two types of search:

  1. exact search: User input isn't normalized and Elasticsearch query search against body field, that isn't normalized too.
  2. simliar search. User input is normalized and we'll search againts body_normalized field

Let's see an example

body_matches = [     u'Jorgensen',     u'Jörgensen',     u'Jørgensen',     u'Joergensen', ] print "------EXACT MATCH------" for body_match in body_matches:     elasticsearch_query = {         "query": {             "match" : {                 "body" : body_match             }         }     }     es_kwargs = {          "doc_type"  : "your_type",          "index" : 'your_index',          "body" : elasticsearch_query     }      res = es.search(**es_kwargs)     print body_match," MATCHING BODIES=",res['hits']['total']      for r in res['hits']['hits']:         print "-",r['_source'].get('body','')  print "\n------SIMILAR MATCHES------" for body_match in body_matches:     body_match = normalize_word(body_match)     elasticsearch_query = {         "query": {             "match" : {                 "body_normalized" : body_match             }         }     }     es_kwargs = {          "doc_type"  : "your_type",          "index" : 'your_index',          "body" : elasticsearch_query     }      res = es.search(**es_kwargs)     print body_match," MATCHING NORMALIZED BODIES=",res['hits']['total']      for r in res['hits']['hits']:         print "-",r['_source'].get('body','') 

You can see a running example in this notebook

Answers 2

I would try this http://rurl.us/Dw4Sv I think it should help you out.

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment