Wednesday, February 15, 2017

Exclude HTML tags and some UNICODE Characters while searching in SQLite

Leave a Comment

UPDATE # 4

I've successfully run the firstchar example, but now the problem is with using regex. Even after including header file, it is not recognizing regex operator. Any clue how can this be resolved?

UPDATE # 2

I've compiled sqlite3 library in my project. I am now looking for anyone to help me with writing a function for my regex, attach it to database and call it from query.

UPDATE # 3

I've written some code from this example. Here it is

extern "C" void Java_com_kfmwa916_testapp_DatabaseHandler_createFunction() {     sqlite3 *db;     //Open database     sqlite3_open("MyDBName.db", &db);      //Attach function to database     sqlite3_create_function(db, "firstchar", 1, SQLITE_UTF8, NULL, &firstchar, NULL, NULL); } 

And firstchar function is,

static void firstchar(sqlite3_context *context, int argc, sqlite3_value **argv) {     if (argc == 1) {         char *text = (char *) sqlite3_value_text(argv[0]);         if (text && text[0]) {             char result[2];             result[0] = text[0]; result[1] = '\0';             sqlite3_result_text(context, result, -1, SQLITE_TRANSIENT);             return;         }     }     sqlite3_result_null(context); } 

And the used it in my query like

SELECT firstchar(text) FROM dummy 

But it is giving error

no such function firstchar() 

Any help is highly appreciated.

Original Question

I am working with Arabic Language saved in UNICODE Format in SQLite. I want to implement a search. But there's a problem.

Let's say the text is

<html>    <head>      <style>        @font-face {          font-family: "Al_Mushaf";          src: url('fonts/al_mushaf.ttf');        }        @font-face {          font-family: "Jameel Noori Nastaleeq";          src: url('fonts/jameel_noori.ttf');        }      </style>    </head>        <body>      <h3 style='font-family:"Al_Mushaf"'>              صحابہ کرام کا انبیائے کرام کی سنّت پر عمل              میٹھے میٹھے اسلامی بھائیو!صدائے مدینہ لگانا انبیائے کِرام عَلَیْہِمُ السَّلَام کی اس قَدْر پیاری سنّت ہے کہ صحابۂ کِرام عَلَیْہِمُ الرِّضْوَان نے بھی اسے خُوب اپنایا اور وہ بھی حضرت سَیِّدُنا داؤد عَلَیْہِ السَّلَام کی طرح اپنے گھر والوں کو جگایا کرتے جیسا کہ حضرت سَیِّدُنا عبد اللہ بن عُمَر رَضِیَاللّٰہُ تَعَالٰی عَنْہُما فرماتے ہیں کہ میرے والِدِ مُحْتَرَم اَمِیرُ الْمُوْمِنِین حضرت سَیِّدُنا عُمَر فَارُوقِ اَعْظَم رَضِیَاللّٰہُ تَعَالٰی عَنْہ رات میں جس قَدْر ربّ تعالیٰ چاہتا،نَماز پڑھتے رہتے،یہاں تک کہ جب رات کا آخری وَقْت ہوتا تو اپنے گھر والوں کو بھی نَماز کے لیے جگا دیتے اور ان سے فرماتے: اَلصَّلٰوة یعنی نماز۔ پھر یہ آیت مُبارَکہ تِلاوَت فرماتے:              وَاۡمُرْ اَہۡلَکَ بِالصَّلٰوۃِ وَ اصْطَبِرْ عَلَیۡہَا ؕ لَا نَسْـَٔلُکَ رِزْقًا ؕ نَحْنُ نَرْزُقُکَ ؕ وَالْعٰقِبَۃُ لِلتَّقْوٰی (پ۱۶،طٰهٰ:۱۳۲)      </h3>    </body>      </html>

And it is stored in SQLite Database. Now I want to search html, it will return the result and if I search مبارکہ it won't return a result because in actual text, it is مُبارَکہ (with these extra UNICODE).

I want to ignore all HTML tags and these extra UNICODE Characters while searching so that html shouldn't return a result while مبارکہ should return a result.

What I found so far;

  1. Make extra column and put stripped text into it and then search (I can't do it because there are thousands of books and they will take a lot of memory)

  2. UDF Like SQL (I couldn't find any suitable example/tutorial to implement it)

  3. Using REGEXP (I couldn't figure it out yet how to do this, I just know that I've to implement it myself)

  4. SQL query using LIKE and GLOB operators and wildcard characters.

I'm stuck for two days and couldn't find a working solution. Option#4 is desirable but any working solution will do the charm.

Meanwhile, I've to keep application memory efficient and optimized searching.

Any help is highly appreciated.

UPDATE

I've made regex to ignore html tags and text between style tag and used it in query with REGEXP.

Now there are two problems,

  1. I want to ignore these extra characters too. I know their UNICODEs, just need to know how to append it in the regex. This is my regex;

    (?![^<]*>)(?!<style[^>]*?>)(TEXT)(?![^<]*?<\/style>)

  2. I've used it in query like

    SELECT text FROM dummy WHERE text REGEXP <myregex>

    It's not giving an error but not returning the desired result too.

1 Answers

Answers 1

Answer to Original Question

NOTE: As I have recently learned it, I maybe wrong at many places, kindly correct my mistakes

There are two solutions

  1. Use REGEXP Operator with SQLite Query
  2. Implement your own User Defined Function using NDK

The problem with first one is that it returns either true or false but I need data. And the problem with both methods is that you have to use C/C++ Library in your Android Project. So I decided to create my own user defined function.

You can find many tutorials on how to use NDK in your project, but won't find any complete example of using 3rd Party libraries in your project.

After a lot of searching/studying, I combined things from many different places and was able to complete my task. Below are some steps on how you can do it. I also intend to write a complete step-by-step tutorial.

Getting things ready

  1. First you need libraries which you want to use in your project. In my case, I need sqlite3 amalgamated library, which can be downloaded from here. Extract them in cpp folder of your project.
  2. You might have familiar with CMakeLists.txt file by now when you included NDK in your project. It's time to add these libraries in CMakeLists.txt file. For that, go to your Project pane, you'll see External Build Files there and inside it you'll see the desired file. Open it and edit it as follows,

# Sets the minimum version of CMake required to build the native  # library. You should either keep the default value or only pass a  # value of 3.4.0 or lower.    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++11")  cmake_minimum_required(VERSION 3.4.1)    # Creates and names a library, sets it as either STATIC  # or SHARED, and provides the relative paths to its source code.  # You can define multiple libraries, and CMake builds it for you.  # Gradle automatically packages shared libraries with your APK.    add_library( # Sets the name of the library.               native-lib                 # Sets the library as a shared library.               SHARED                 # Provides a relative path to your source file(s).               # Associated headers in the same location as their source               # file are automatically included.               src/main/cpp/native-lib.cpp )    include_directories(${CMAKE_SOURCE_DIR}/src)  add_library(sqlite3 STATIC src/main/cpp/sqlite3.c src/main/cpp/sqlite3.h src/main/cpp/sqlite3ext.h)  add_executable(sqlite src/main/cpp/sqlite3.c src/main/cpp/sqlite3.h src/main/cpp/sqlite3ext.h)  set_target_properties(sqlite PROPERTIES OUTPUT_NAME sqlite3)    # Searches for a specified prebuilt library and stores the path as a  # variable. Because system libraries are included in the search path by  # default, you only need to specify the name of the public NDK library  # you want to add. CMake verifies that the library exists before  # completing its build.      find_library( # Sets the name of the path variable.                log-lib                  # Specifies the name of the NDK library that                # you want CMake to locate.                log )    # Specifies libraries CMake should link to your target library. You  # can link multiple libraries, such as libraries you define in the  # build script, prebuilt third-party libraries, or system libraries.    target_link_libraries( # Specifies the target library.                         native-lib                           # Links the target library to the log library                         # included in the NDK.                         sqlite3                         log )

You have to first add libraries using add_library and then link it to your class which you've made, it is by default named native-lib.cpp.

  1. Build your project and you are ready to go.

Implementing Function

Now here comes the main part. Open native-lib.cpp and include required files and headers. What you have to do;

  1. Make a function which you will be calling from YourActivity.java. You'll know the pattern once you see your file. In my case it is Java_com_kfmwa916_testapp_SearchResult_createFunction(JNIEnv * env, jobject object, jstring search) where

Java is a keyword

com_kfmwa916_testapp is your package

SearchResult is your Java class

createFunction is the name of the function.

  1. Create your function. In my case, I have to apply certain regex in search. Here is mine,

    static void strip_text(sqlite3_context *context, int argc, sqlite3_value **argv) { if(argc == 1) {     __android_log_print(ANDROID_LOG_VERBOSE, "TAG", "inside strip_text");     char *result = (char *) sqlite3_value_text(argv[0]);     std::string text(result);     std::regex regex_head("YOUR REGEX");      if (!text.empty()) {         text = std::regex_replace(text, regex_head, "");          sqlite3_result_text(context, text.c_str(), -1, SQLITE_TRANSIENT);          __android_log_print(ANDROID_LOG_VERBOSE, "STRIPPED TEXT", "%s", text.c_str());         return;         }      }     sqlite3_result_null(context);  } 
  2. Create sqlite3 instance, open database, attach this function to database and use it in your query. Here is a code snippet

    extern "C" void Java_com_kfmwa916_testapp_SearchResult_createFunction(JNIEnv * env, jobject object, jstring search) {     const char * search_term = env->GetStringUTFChars(search, 0);     env->ReleaseStringUTFChars(search, search_term);      std::string q(search_term);      std::string query = "SELECT text FROM dummy WHERE LIKE('%" + q + "%', strip_text(text))=1";      __android_log_print(ANDROID_LOG_VERBOSE, "TAG", "%s", query.c_str());      //GetJStringContent(env, search, search_term);     sqlite3 *db;     //Open database     __android_log_print(ANDROID_LOG_VERBOSE, "TAG", "Opening database");     int rc = sqlite3_open("/data/data/com.kfmwa916.testapp/databases/MyDBName.db", &db);     //It'll be good to check 'rc' for error(s).      //Attach function to database     __android_log_print(ANDROID_LOG_VERBOSE, "TAG", "Attaching function");     rc = sqlite3_create_function(db, "strip_text", 1, SQLITE_ANY, NULL, &strip_text, NULL, NULL);      __android_log_print(ANDROID_LOG_VERBOSE, "TAG", "Executing query");     rc = sqlite3_exec(db, query.c_str(), callback, NULL, NULL);  } 
  3. Implement callback function to process result. It should look like

    static int callback(void *NotUsed, int argc, char **argv, char **azColName)                      {     __android_log_print(ANDROID_LOG_VERBOSE, "TAG", "FOUND");     int i;     for (i = 0; i < argc; ++i) {         __android_log_print(ANDROID_LOG_VERBOSE, "TAG", "%s = %sn", azColName[i], argv[i] ? argv[i] : "NULL");     }     return 0; } 
  4. Finally come to your Java class, in my case, it is SearchResult Load library and define function.

    static {     System.loadLibrary("native-lib"); }  public native void createFunction(String search); 

and call it where you want it. Let's say onClickEvent of a button like createFunction(searchterm)

Post is open for correction and modification.

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment