Showing posts with label unicode. Show all posts
Showing posts with label unicode. Show all posts

Sunday, May 6, 2018

Is It Safe To Use Unicode Literals in HTML?

Leave a Comment

I am making an application, and I want to add a "HOME" button.

After much struggling with various icon libraries, I stumbled upon this site,

http://graphemica.com/%F0%9F%8F%A0, with this

🏠

A unicode symbol, which is more akin to a letter than an image.

I pasted it into my HTML, and it just workedTM.

All this seems a little too easy, though. Are unicode symbols widely supported? Is there some kind of problem with them that leads people to use icon libraries instead?

2 Answers

Answers 1

It depends on what do you mean for "safe".

User should have the fonts, so you must include the relative font, and in various formats: there is not yet a format recognized by most used web-browsers.

Additionally, font with multiple colours are not fully understood by various systems, so you should care about what do you expect from users (click, select, copy, etc.).

Additionally, every fonts has own design, so between different fonts (so browsers and operating system) things can look differently. We do not have yet a "Helvetica 'Home'", a "Times New Roman 'Home'".

All this points, could be solved by using a web font, with monochrome glyphs (but it could be huge, if it includes all Unicode code points (+ usual combinations).

It seems that various recent browser crashes if there are many different glyphs, but usually it should not be a problem.

I also recommend aria stuffs so that you page could be used also by e.g. readers (and braille screen).

Note: on the plus side, the few people that use text browser can better see the HOME (not the case in case of an image), if somebody still care about this use case.

Answers 2

Here are few precautions to be taken while doing that, I did some research and found this to be more helpful for your question. Also I dont know how you can do but credits go to Mr.GOY

Displaying unicode symbols in HTML

Read More

Thursday, May 3, 2018

regex for accepting only persian characters

Leave a Comment

I'm working on a form which one of it's custom validator should only accept persian characters...I used the following code:

    var myregex = new Regex(@"^[\u0600-\u06FF]+$");     if (myregex.IsMatch(mytextBox.Text))     {         args.IsValid = true;     }     else     {         args.IsValid = false;     } 

but it seems it only work for checking arabic characters and it doesn't cover all persian characters (it lacks these four گ,چ,پ,ژ )... is there a way for solving this problem?

8 Answers

Answers 1

TL;DR

\u0600-\u06FF includes:

  • گ with codepoint 06AF
  • چ with codepoint 0686
  • پ with codepoint 067E
  • ژ with codepoint 0698

as well. You don't need to worry about گ چ پ ژ and duplicate codepoints (as in accepted answer!). But... all answers that say use \u0600-\u06FF or [آ-ی] are simply WRONG.

i.e. \u0600-\u06FF contains 209 more characters than you need! and it includes numbers too!

enter image description here

Farsi MUST used character sets are as following:

  • Use ^[آابپتثجچحخدذرزژسشصضطظعغفقکگلمنوهی]+$ for letters or using codepoints regarding your flavor syntax:

    ^[\u0622\u0627\u0628\u067E\u062A-\u062C\u0686\u062D-\u0632\u0698\u0633-\u063A\u0641\u0642\u06A9\u06AF\u0644-\u0648\u06CC]+$ 
  • Use ^[۰۱۲۳۴۵۶۷۸۹]+$ for numbers or regarding your flavor syntax:

    ^[\u06F0-\u06F9]+$ 
  • Use [ ‬ٌ ‬ًّ ‬َ ‬ِ ‬ُ ‬ْ ‬] for vowels or regarding your flavor syntax:

    [\u202C\u064B\u064C\u064E-\u0652] 

or a combination of those together. You may want to add other Arabic letters like Hamza ء to your character set additionally.

Whole story

This answer exists to fix a common misconception. Codepoints 0600 through 06FF do not denote Persian / Farsi alphabet (neither does [آ-ی]):

[\u0600-\u0605 ؐ-ؚ\u061Cـ ۖ-\u06DD ۟-ۤ ۧ ۨ ۪-ۭ ً-ٕ ٟ ٖ-ٞ ٰ ، ؍ ٫ ٬ ؛ ؞ ؟ ۔ ٭ ٪ ؉ ؊ ؈ ؎ ؏ ۞ ۩ ؆ ؇ ؋ ٠۰ ١۱ ٢۲ ٣۳ ٤۴ ٥۵ ٦۶ ٧۷ ٨۸ ٩۹ ءٴ۽ آ أ ٲ ٱ ؤ إ ٳ ئ ا ٵ ٮ ب ٻ پ ڀ ة-ث ٹ ٺ ټ ٽ ٿ ج ڃ ڄ چ ڿ ڇ ح خ ځ ڂ څ د ذ ڈ-ڐ ۮ ر ز ڑ-ڙ ۯ س ش ښ-ڜ ۺ ص ض ڝ ڞ ۻ ط ظ ڟ ع غ ڠ ۼ ف ڡ-ڦ ٯ ق ڧ ڨ ك ک-ڴ ػ ؼ ل ڵ-ڸ م۾ ن ں-ڽ ڹ ه ھ ہ-ۃ ۿ ەۀ وۥ ٶ ۄ-ۇ ٷ ۈ-ۋ ۏ ى يۦ ٸ ی-ێ ې ۑ ؽ-ؿ ؠ ے ۓ \u061D] 

255 characters are fallen under Arabic block (0600–06FF), Farsi alphabet has 32 letters that in addition to Farsi demonstration of digits it would be 42. If we add vowels (Arabic vowels originally, that rarely used in Farsi) without Tanvin (ً, ٍِ ‬, ٌ ‬) and Tashdid (ّ ‬) that are both a subset of Arabic diacritics not Farsi, we would end up with 46 characters. This means \u0600-\u06FF contains 209 more characters than you need!

۷ with codepoint 06F7 is a Farsi representation of number 7 and ٧ with codepoint 0667 is Arabic representation of the same number. ۶ is Farsi representation of number 6 and ٦ is Arabic representation of the same number. And all reside in 0600 through 06FF codepoints.

The shapes of the Persian digits four (۴), five (۵), and six (۶) are different from the shapes used in Arabic and the other numbers have different codepoints.

You can see different number of other characters that doesn't exist in Farsi / Persian too and nobody is willing to have them while validating a first name or surname.

[آ-ی] includes 117 characters too which is much more than what someone needs for validation. You can see them all using Unicode CLDR.

Answers 2

What you currently have in your regex is a standard Arabic symbols range. For additional characters your need to add them to the regex separately. Here are their codes:

ژ \u0698 پ \u067E چ \u0686 گ \u06AF 

So all in all you should have

^[\u0600-\u06FF\u0698\u067E\u0686\u06AF]+$ 

Answers 3

In addition to the accepted answer(https://stackoverflow.com/a/22565376/790811), we should consider Zero-width_non-joiner (or نیم فاصله in persian) characters too. Unfortunately we have 2 symbols for it. One is standard and the other is not standard but widely used :

  1. \u200C : http://en.wikipedia.org/wiki/Zero-width_non-joiner
  2. \u200F : Right-to-left mark (http://unicode-table.com/en/#200F)

So the final regix can be :

^[\u0600-\u06FF\uFB8A\u067E\u0686\u06AF\u200C\u200F]+$ 

If you want to consider "space", you can use this :

^[\u0600-\u06FF\uFB8A\u067E\u0686\u06AF\u200C\u200F ]+$ 

you can test it JavaScript by this :

/^[\u0600-\u06FF\uFB8A\u067E\u0686\u06AF7\u200C\u200F ]+$/.test('ای‌پسر تو چه می‌دانی؟') 

Answers 4

attention: persianRex is written in Javascript however you can use the source code and copy paste the characters

Detecting Persian characters is a tricky task due to veraiety of keyboard layouts and operating systems. I faced the same challenge sometime before and I decided to write an open source library to fix this issue.

you can fix your issue like this: persianRex.text.test(yourInput); //returns true or false

here is the full documentation: http://imanmh.github.io/persianRex/

Answers 5

Farsi, Dari and Tajik are out of my bailiwick, but a little rummaging through the Unicode code charts tells me that Arabic covers 5 Unicode code blocks:

You can get at them (at least some of them) in regular expressions using named blocks instead of explicit code point ranges: \p{IsArabicPresentationForms-A} will give you the 4th Unicode block in the preceding list.

You might also read Persian Computing in Unicode: http://behdad.org/download/Publications/persiancomputing/a007.pdf

Answers 6

I can't read Farsi but see if one of the Arabic unicode supplements have the letters you are looking for.

http://www.unicode.org/charts/

Answers 7

The named blocks, e.g \p{Arabic} cover the entire Arabic script, not just the Persian characters.

The presentation forms (u+FB50-u+FDFF) should not be used in text, and should be converted to the standard range (u+0600-u+06FF).

In order to only cover Persian we need the following:

  • The subset of Farsi characters out of the standard Arabic range, i.e (U+0621-U+0624, U+0626-U+063A, U+0641-U+0642, U+0644-U+0648)
  • The standard Arabic diacritics (U+064B-U+0652)
  • The 2 additional diacritics (U+0654, U+0670)
  • The 4 extra Farsi characters "گ چ پ ژ" (U+067E, U+0686, U+0698, U+06AF)
  • U+06A9: Persian Kaf (formally: "Arabic Letter Keheh"; different notation from Arabic Kaf)
  • U+06CC: Farsi Yeh (a different notation from the Arabic Yeh)
  • U+200C: Zero-Width-Non-Joiner

So, the resulting regexp would be:

^[\u0621-\u0624\u0626-\u063A\u0641-\u0642\u0644-\u0648\u064B-\u0652\u067E\u0686\u0698\u06AF\u06CC\u06A9\u0654\u670\u200c}]+$ 

See also the exemplar characters for Persian listed here:

http://unicode.org/cldr/trac/browser/trunk/common/main/fa.xml

Answers 8

I'm not sure if regex is the way to do this, however the problem is not specific to only persian or arabic, chinees, russian text. so perhaps you could see if the character is existing in your Codepage, if not in the code page then I doubt the user can insert them using a input device....

 var encoding = Encoding.GetEncoding(1256);  var expect = "گ چ پ ژ";  var actual= encoding.GetBytes("گ چ پ ژ");  Assert.AreEqual(encoding.GetString(actual),expect); 

The test tests a round trip where input should match the string to bytes and back. The link shows those code pages supported

Happy coding

Walter

Read More

Tuesday, October 17, 2017

Why does this code, written backwards, print “Hello World!”

Leave a Comment

Here is some code that I found on the Internet:

class M‮{public static void main(String[]a‭){System.out.print(new char[] {'H','e','l','l','o',' ','W','o','r','l','d','!'});}}     

This code prints Hello World! onto the screen; you can see it run here. I can clearly see public static void main written, but it is backwards. How does this code work? How does this even compile?

Edit: I tried this code in IntellIJ, and it works fine. However, for some reason it doesn't work in notepad++, along with cmd. I still haven't found a solution to that, so if anyone does, comment down below.

4 Answers

Answers 1

There are invisible characters here that alter how the code is displayed. In Intellij these can be found by copy-pasting the code into an empty string (""), which replaces them with Unicode escapes, removing their effects and revealing the order the compiler sees.

Here is the output of that copy-paste:

"class M\u202E{public static void main(String[]a\u202D){System.out.print(new char[]\n"+         "{'H','e','l','l','o',' ','W','o','r','l','d','!'});}}   " 

The source code characters are stored in this order, and the compiler treats them as being in this order, but they're displayed differently.

Note the \u202E character, which is a right-to-left override, starting a block where all characters are forced to be displayed right-to-left, and the \u202D, which is a left-to-right override, starting a nested block where all characters are forced into left-to-right order, overriding the first override.

Ergo, when it displays the original code, class M is displayed normally, but the \u202E reverses the display order of everything from there to the \u202D, which reverses everything again. (Formally, everything from the \u202D to the line terminator gets reversed twice, once due to the \u202D and once with the rest of the text reversed due to the \u202E, which is why this text shows up in the middle of the line instead of the end.) The next line's directionality is handled independently of the first's due to the line terminator, so {'H','e','l','l','o',' ','W','o','r','l','d','!'});}} is displayed normally.

For the full (extremely complex, dozens of pages long) Unicode bidirectional algorithm, see Unicode Standard Annex #9.

Answers 2

It looks different because of the Unicode Bidirectional Algorithm. There are two invisible characters of RLO and LRO that the Unicode Bidirectional Algorithm uses to change the visual appearance of the characters nested between these two metacharacters.

The result is that visually they look in reverse order, but the actual characters in memory are not reversed. You can analyse the results here. The Java compiler will ignore RLO and LRO, and treat them as whitespace which is why the code compiles.

Note 1: This algorithm is used by text editors and browsers to visually display characters both LTR characters (English) and RTL characters (e.g. Arabic, Hebrew) together at the same time - hence "bi"-directional. You can read more about the Bidirectional Algorithm at Unicode's website.
Note 2: The exact behaviour of LRO and RLO is defined in Section 2.2 of the Algorithm.

Answers 3

The Character U+202E mirrors the code from right to left, it is very clever though. Is hidden starting in the M,

"class M\u202E{..." 

How did I found the magic behind this?

Well, at first when I saw the question I tough, "it's a kind of joke, to lose somebody else time", but then, I opened my IDE ("IntelliJ"), create a class, and past the code... and it compiled!!! So, I took a better look and saw that the "public static void" was backward, so I went there with the cursor, and erase a few chars... And what happens? The chars started erasing backward, so, I thought mmm.... rare... I have to execute it... So I proceed to execute the program, but first I needed to save it... and that was when I found it!. I couldn't save the file because my IDE said that there was a different encoding for some char, and point me where was it, So I start a research in Google for special chars that could do the job, and that's it :)

A little about

the Unicode Bidirectional Algorithm, and U+202E involved, a briefly explain:

The Unicode Standard prescribes a memory representation order known as logical order. When text is presented in horizontal lines, most scripts display characters from left to right. However, there are several scripts (such as Arabic or Hebrew) where the natural ordering of horizontal text in display is from right to left. If all of the text has a uniform horizontal direction, then the ordering of the display text is unambiguous.

However, because these right-to-left scripts use digits that are written from left to right, the text is actually bi-directional: a mixture of right-to-left and left-to-right text. In addition to digits, embedded words from English and other scripts are also written from left to right, also producing bidirectional text. Without a clear specification, ambiguities can arise in determining the ordering of the displayed characters when the horizontal direction of the text is not uniform.

This annex describes the algorithm used to determine the directionality for bidirectional Unicode text. The algorithm extends the implicit model currently employed by a number of existing implementations and adds explicit formatting characters for special circumstances. In most cases, there is no need to include additional information with the text to obtain correct display ordering.

However, in the case of bidirectional text, there are circumstances where an implicit bidirectional ordering is not sufficient to produce comprehensible text. To deal with these cases, a minimal set of directional formatting characters is defined to control the ordering of characters when rendered. This allows exact control of the display ordering for legible interchange and ensures that plain text used for simple items like filenames or labels can always be correctly ordered for display.

Why create some algorithm like this?

the bidi algorithm can render a sequence of Arabic or Hebrew characters one after the other from right to left.

P.S.: I know it's not the best answer, but it was fun to crack the problem first :P

Answers 4

Chapter 3 of the language specification provides an explanation by describing in detail how the lexical translation is done for a Java program. What matters most for the question:

Programs are written in Unicode (§3.1), but lexical translations are provided (§3.2) so that Unicode escapes (§3.3) can be used to include any Unicode character using only ASCII characters.

So a program is written in Unicode characters, and the author can escape them using \uxxxx in case the file encoding does not support the Unicode character, in which case it is translated to the appropriate character. One of the Unicode characters present in this case is \u202E. It is not visually shown in the snippet, but if you try switching the encoding of the browser, the hidden characters may appear.

Therefore, the lexical translation results in the class declaration:

class M\u202E{ 

which means that the class identifier is M\u202E. The specification considers this as a valid identifer:

Identifier:     IdentifierChars but not a Keyword or BooleanLiteral or NullLiteral IdentifierChars:     JavaLetter {JavaLetterOrDigit} 

A "Java letter-or-digit" is a character for which the method Character.isJavaIdentifierPart(int) returns true.

Read More

Friday, February 3, 2017

UTF-8 / Unicode Text Encoding with RPostgreSQL

Leave a Comment

I'm running R on a Windows machine which is directly linked to a PostgreSQL database. I'm not using RODBC. My database is encoded in UTF-8 as confirmed by the following R command:

dbGetQuery(con, "SHOW CLIENT_ENCODING") #   client_encoding # 1            UTF8 

However, when some text is read into R, it displays as strange text in R.

For example, the following text is shown in my PostgreSQL database: "Stéphane"

After exporting to R it's shown as: "Stéphane" (the é is encoded as é)

When importing to R I use the dbConnect command to establish a connection and the dbGetQuery command to query data using SQL. I do not specify any text encoding anywhere when connecting to the database or when running a query.

I've searched online and can't find a direct resolution to my issue. I found this link, but their issue is with RODBC, which I'm not using.

This link is helpful in identifying the symbols, but I don't just want to do a find & replace in R... way too much data.

I did try running the following commands below and I arrived at a warning.

Sys.setlocale("LC_ALL", "en_US.UTF-8") # [1] "" # Warning message: # In Sys.setlocale("LC_ALL", "en_US.UTF-8") : #   OS reports request to set locale to "en_US.UTF-8" cannot be honored Sys.setenv(LANG="en_US.UTF-8") Sys.setenv(LC_CTYPE="UTF-8") 

The warning occurs on the Sys.setlocale("LC_ALL", "en_US.UTF-8") command. My intuition is that this is a Windows specific issue and doesn't occur with Mac/Linux/Unix.

EDIT 2014-01-29:
The following execution will fix any Unicode/UTF-8 problems in Windows. It must be executed before querying the database.

postgresqlpqExec(con, "SET client_encoding = 'windows-1252'") 

2 Answers

Answers 1

After exporting to R it's shown as: "Stéphane" (the é is encoded as é)

Your R environment is using a 1-byte non-composed encoding like latin-1 or windows-1252. Witness this test in Python, demonstrating that the utf-8 bytes for é, decoded as if they were latin-1, produce the text you see:

>>> print u"é".encode("utf-8").decode("latin-1") é 

Either SET client_encoding = 'windows-1252' or fix the encoding your R environment uses. If it's running in a cmd.exe console you'll need to mess with the chcp console command; otherwise it's specific to whatever your R runtime is.

Answers 2

As Craig Ringer said, setting client_encoding to windows-1252 is probably not the best thing to do. Indeed, if the data you're retrieving contains a single exotic character, you're in trouble:

Error in postgresqlExecStatement(conn, statement, ...) : RS-DBI driver: (could not Retrieve the result : ERROR: character 0xcca7 of encoding "UTF8" has no equivalent in "WIN1252" )

On the other hand, getting your R environment to use Unicode could be impossible (I have the same problem as you with Sys.setlocale... Same in this question too.).

A (dirty) workaround is to manually declare UTF-8 encoding on all your data, using a function like this one:

set_utf8 = function(x){   # Declare UTF-8 encoding on all character strings:   for (i in 1:ncol(x)){     if (is.character(x[, i])) Encoding(x[, i]) <- 'UTF-8'   }   # Same on column names:   for (name in colnames(x)){     Encoding(name) <- 'UTF-8'   }   return(x) } 

And you have to use this function in all your queries:

set_utf8(dbGetQuery(con, "SELECT myvar FROM mytable")) 
Read More

Monday, March 14, 2016

Can N function cause problems with existing queries?

Leave a Comment

We use Oracle 10g and Oracle 11g.

We also have a layer to automatically compose queries, from pseudo-SQL code written in .net (something like SqlAlchemy for Python).

Our layer currently wraps any string in single quotes ' and, if contains non-ANSI characters, it automatically compose the UNISTR with special characters written as unicode bytes (like \00E0).

Now we created a method for doing multiple inserts with the following construct:
INSERT INTO ... (...) SELECT ... FROM DUAL UNION ALL SELECT ... FROM DUAL ...

This algorithm could compose queries where the same string field is sometimes passed as 'my simple string' and sometimes wrapped as UNISTR('my string with special chars like \00E0').

The described condition causes a ORA-12704: character set mismatch.

One solution is to use the INSERT ALL construct but it is very slow compared to the one used now.

Another solution is to instruct our layer to put N in front of any string (except for the ones already wrapped with UNISTR). This is simple.

I just want to know if this could cause any side-effect on existing queries.

Note: all our fields on DB are either NCHAR or NVARCHAR2.


Oracle ref: http://docs.oracle.com/cd/B19306_01/server.102/b14225/ch7progrunicode.htm

3 Answers

Answers 1

Basicly what you are asking is, is there a difference between how a string is stored with or without the N function.

You can just check for yourself consider:

SQL> create table test (val nvarchar2(20));  Table TEST created.  SQL> insert into test select n'test' from dual;  1 row inserted.  SQL> insert into test select 'test' from dual;  1 row inserted.  SQL> select dump(val) from test; DUMP(VAL)                                                                       -------------------------------------------------------------------------------- Typ=1 Len=8: 0,116,0,101,0,115,0,116                                             Typ=1 Len=8: 0,116,0,101,0,115,0,116   

As you can see identical so no side effect.

The reason this works so beautifully is because of the elegance of unicode

If you are interested here is a nice video explaining it

https://www.youtube.com/watch?v=MijmeoH9LT4

Answers 2

I assume that you get an error "ORA-12704: character set mismatch" because your data inside quotes considered as char but your fields is nchar so chay collated using different charsets one using NLS_CHARACTERSET other NLS_NCHAR_CHARACTERSET.

When you use an UNISTR function, it converts data form char to nchar (besides that converts encoded values into characters) as oralce docs says:

"UNISTR takes as its argument a text literal or an expression that resolves to character data and returns it in the national character set."

But when you converts values explicitly using N or TO_NCHAR you get only value in NLS_NCHAR_CHARACTERSET without decoding. But if you have some values encoded like this "\00E0" they will not be decoded and will be considered as is.

So if you will get such insert:

   insert into  select N'my string with special chars like \00E0',      UNISTR('my string with special chars like \00E0') from dual .... 

your data in first inserting field will be: 'my string with special chars like \00E0' not 'my string with special chars like à'. This is the only side effect I'm aware of. Other queries should already use NLS_NCHAR_CHARACTERSET encoding, so it shouldn't be any problem using explicit conversion.

And by the way, why not just insert all values as N'my string with special chars like à', just encode them into UTF-16(I suppose that you use UTF-16 for ncahrs) first if you use different encoding in 'upper level' software.

Answers 3

  • use of n function - you have answers already above.

If you have any chance to change the charset of the database, that would really make your life easier. I was working on huge production systems, and found the trend that because of storage space is cheap, simply everyone moves to AL32UTF8 and the hassle of internationalization slowly becomes the painful memories of the past.

I found the easiest thing is to use AL32UTF8 as the charset of the database instance, and simply use varchar2 everywhere. We're reading and writing standard Java unicode strings via JDBC as bind variables without any harm, and fiddle.

Your idea to construct a huge text of SQL inserts may not scale well for multiple reasons:

  • there is a fixed length of maximum allowed SQL statement - so it won't work with 10000 inserts
  • it is advised to use bind variables (and then you don't have the n'xxx' vs unistr mess either)
  • the idea to create a new SQL statement dynamically is very resource unfriedly. It does not allow Oracle to cache any execution plan for anything, and will make Oracle hard parse your looong statement at each call.

What you're trying to achieve is a mass insert. Use the JDBC batch mode of the Oracle driver to perform that at light-speed, see e.g.: http://viralpatel.net/blogs/batch-insert-in-java-jdbc/

Note that insert speed is also affected by triggers (which has to be executed) and foreign key constraints (which has to be validated). So if you're about to insert more than a few thousands of rows, consider disabling the triggers and foreign key constraints, and enable them after the insert. (You'll lose the trigger calls, but the constraint validation after insert can make an impact.)

Also consider the rollback segment size. If you're inserting a million of records, that will need a huge rollback segment, which likely will cause serious swapping on the storage media. It is a good rule of thumb to commit after each 1000 records.

(Oracle uses versioning instead of shared locks, therefore a table with uncommitted changes are consistently available for reading. The 1000 records commit rate means roughly 1 commit per second - slow enough to benefit of write buffers, but quick enough to not interfer with other humans willing to update the same table.)

Read More