What I want to achieve is ability for people to search for individuals without being language aware, but not punishing those people, that are. What I mean is:
Given I build index:
- Jorgensen
- Jörgensen
- Jørgensen
I want to be able allow such conversions:
- ö to o
- ö to oe
- ø to oe
- ø to oe
so if someone searches for: QUERY | RESULT(I include only ID's, but it would be full records in reality)
- Jorgensen return - 1,2,3
- Jörgensen return - 1,2
- Jørgensen return - 1,3
- Joergensen return - 2,3
Starting with that I tried to create index analyzer and filter that:
{ "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "keyword", "char_filter": [ "my_char_filter" ] } }, "char_filter": { "my_char_filter": { "type": "mapping", "mappings": [ "ö => o", "ö => oe" ] } } } } }
But that is invalid, because it tries to map to same character.
What ma I missing? Do I need multiple analyzers? Any direction would be appreciated.
2 Answers
Answers 1
Since custom mapping isn't enough in your case, as show comments above, let's play with your data and char normalization.
In your case, normalization using unidecode
isn't enough due ø and oe conversions. Example:
import unicodedata def strip_accents(s): return ''.join( c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn' ) body_matches = [ u'Jorgensen', u'Jörgensen', u'Jørgensen', u'Joergensen', ] for b in body_matches: print b,strip_accents(b) >>>> Jorgensen Jorgensen >>>> Jörgensen Jorgensen >>>> Jørgensen Jørgensen >>>> Joergensen Joergensen
So, we need a custom translation. By now I've only set those chars that you shown, but feel free to complete the list.
accented_letters = { u'ö' : [u'o',u'oe'], u'ø' : [u'o',u'oe'], }
Then, we can normalize words and store them in a special property, body_normalized
for instance, and index them as a field of your Elasticsearch records
Once they are inserted, you could perform two types of search:
- exact search: User input isn't normalized and Elasticsearch query search against
body
field, that isn't normalized too. - simliar search. User input is normalized and we'll search againts
body_normalized
field
Let's see an example
body_matches = [ u'Jorgensen', u'Jörgensen', u'Jørgensen', u'Joergensen', ] print "------EXACT MATCH------" for body_match in body_matches: elasticsearch_query = { "query": { "match" : { "body" : body_match } } } es_kwargs = { "doc_type" : "your_type", "index" : 'your_index', "body" : elasticsearch_query } res = es.search(**es_kwargs) print body_match," MATCHING BODIES=",res['hits']['total'] for r in res['hits']['hits']: print "-",r['_source'].get('body','') print "\n------SIMILAR MATCHES------" for body_match in body_matches: body_match = normalize_word(body_match) elasticsearch_query = { "query": { "match" : { "body_normalized" : body_match } } } es_kwargs = { "doc_type" : "your_type", "index" : 'your_index', "body" : elasticsearch_query } res = es.search(**es_kwargs) print body_match," MATCHING NORMALIZED BODIES=",res['hits']['total'] for r in res['hits']['hits']: print "-",r['_source'].get('body','')
You can see a running example in this notebook
Answers 2
I would try this http://rurl.us/Dw4Sv I think it should help you out.
0 comments:
Post a Comment