I'm working on a form which one of it's custom validator should only accept persian characters...I used the following code:
var myregex = new Regex(@"^[\u0600-\u06FF]+$"); if (myregex.IsMatch(mytextBox.Text)) { args.IsValid = true; } else { args.IsValid = false; }
but it seems it only work for checking arabic characters and it doesn't cover all persian characters (it lacks these four گ,چ,پ,ژ )... is there a way for solving this problem?
8 Answers
Answers 1
TL;DR
\u0600-\u06FF
includes:
گ
with codepoint06AF
چ
with codepoint0686
پ
with codepoint067E
ژ
with codepoint0698
as well. You don't need to worry about گ
چ
پ
ژ
and duplicate codepoints (as in accepted answer!). But... all answers that say use \u0600-\u06FF
or [آ-ی]
are simply WRONG.
i.e.
\u0600-\u06FF
contains 209 more characters than you need! and it includes numbers too!
Farsi MUST used character sets are as following:
Use
^[آابپتثجچحخدذرزژسشصضطظعغفقکگلمنوهی]+$
for letters or using codepoints regarding your flavor syntax:^[\u0622\u0627\u0628\u067E\u062A-\u062C\u0686\u062D-\u0632\u0698\u0633-\u063A\u0641\u0642\u06A9\u06AF\u0644-\u0648\u06CC]+$
Use
^[۰۱۲۳۴۵۶۷۸۹]+$
for numbers or regarding your flavor syntax:^[\u06F0-\u06F9]+$
Use
[ ٌ ًّ َ ِ ُ ْ ]
for vowels or regarding your flavor syntax:[\u202C\u064B\u064C\u064E-\u0652]
or a combination of those together. You may want to add other Arabic letters like Hamza ء
to your character set additionally.
Whole story
This answer exists to fix a common misconception. Codepoints 0600
through 06FF
do not denote Persian / Farsi alphabet (neither does [آ-ی]
):
[\u0600-\u0605 ؐ-ؚ\u061Cـ ۖ-\u06DD ۟-ۤ ۧ ۨ ۪-ۭ ً-ٕ ٟ ٖ-ٞ ٰ ، ؍ ٫ ٬ ؛ ؞ ؟ ۔ ٭ ٪ ؉ ؊ ؈ ؎ ؏ ۞ ۩ ؆ ؇ ؋ ٠۰ ١۱ ٢۲ ٣۳ ٤۴ ٥۵ ٦۶ ٧۷ ٨۸ ٩۹ ءٴ۽ آ أ ٲ ٱ ؤ إ ٳ ئ ا ٵ ٮ ب ٻ پ ڀ ة-ث ٹ ٺ ټ ٽ ٿ ج ڃ ڄ چ ڿ ڇ ح خ ځ ڂ څ د ذ ڈ-ڐ ۮ ر ز ڑ-ڙ ۯ س ش ښ-ڜ ۺ ص ض ڝ ڞ ۻ ط ظ ڟ ع غ ڠ ۼ ف ڡ-ڦ ٯ ق ڧ ڨ ك ک-ڴ ػ ؼ ل ڵ-ڸ م۾ ن ں-ڽ ڹ ه ھ ہ-ۃ ۿ ەۀ وۥ ٶ ۄ-ۇ ٷ ۈ-ۋ ۏ ى يۦ ٸ ی-ێ ې ۑ ؽ-ؿ ؠ ے ۓ \u061D]
255 characters are fallen under Arabic block (0600–06FF), Farsi alphabet has 32 letters that in addition to Farsi demonstration of digits it would be 42. If we add vowels (Arabic vowels originally, that rarely used in Farsi) without Tanvin (ً
, ٍِ
, ٌ
) and Tashdid (ّ
) that are both a subset of Arabic diacritics not Farsi, we would end up with 46 characters. This means \u0600-\u06FF
contains 209 more characters than you need!
۷
with codepoint 06F7
is a Farsi representation of number 7
and ٧
with codepoint 0667
is Arabic representation of the same number. ۶
is Farsi representation of number 6
and ٦
is Arabic representation of the same number. And all reside in 0600
through 06FF
codepoints.
The shapes of the Persian digits four (
۴
), five (۵
), and six (۶
) are different from the shapes used in Arabic and the other numbers have different codepoints.
You can see different number of other characters that doesn't exist in Farsi / Persian too and nobody is willing to have them while validating a first name or surname.
[آ-ی]
includes 117 characters too which is much more than what someone needs for validation. You can see them all using Unicode CLDR.
Answers 2
What you currently have in your regex is a standard Arabic symbols range. For additional characters your need to add them to the regex separately. Here are their codes:
ژ \u0698 پ \u067E چ \u0686 گ \u06AF
So all in all you should have
^[\u0600-\u06FF\u0698\u067E\u0686\u06AF]+$
Answers 3
In addition to the accepted answer(https://stackoverflow.com/a/22565376/790811), we should consider Zero-width_non-joiner (or نیم فاصله in persian) characters too. Unfortunately we have 2 symbols for it. One is standard and the other is not standard but widely used :
- \u200C : http://en.wikipedia.org/wiki/Zero-width_non-joiner
- \u200F : Right-to-left mark (http://unicode-table.com/en/#200F)
So the final regix can be :
^[\u0600-\u06FF\uFB8A\u067E\u0686\u06AF\u200C\u200F]+$
If you want to consider "space", you can use this :
^[\u0600-\u06FF\uFB8A\u067E\u0686\u06AF\u200C\u200F ]+$
you can test it JavaScript by this :
/^[\u0600-\u06FF\uFB8A\u067E\u0686\u06AF7\u200C\u200F ]+$/.test('ایپسر تو چه میدانی؟')
Answers 4
attention: persianRex is written in Javascript however you can use the source code and copy paste the characters
Detecting Persian characters is a tricky task due to veraiety of keyboard layouts and operating systems. I faced the same challenge sometime before and I decided to write an open source library to fix this issue.
you can fix your issue like this: persianRex.text.test(yourInput); //returns true or false
here is the full documentation: http://imanmh.github.io/persianRex/
Answers 5
Farsi, Dari and Tajik are out of my bailiwick, but a little rummaging through the Unicode code charts tells me that Arabic covers 5 Unicode code blocks:
- Arabic: http://www.unicode.org/charts/PDF/U0600.pdf
- Arabic Supplement: http://www.unicode.org/charts/PDF/U0750.pdf
- Arabic Extended-A: http://www.unicode.org/charts/PDF/U08A0.pdf
- Arabic Presentation Forms-A: http://www.unicode.org/charts/PDF/UFB50.pdf
- Arabic Presentation Forms-B: http://www.unicode.org/charts/PDF/UFE70.pdf
You can get at them (at least some of them) in regular expressions using named blocks instead of explicit code point ranges: \p{IsArabicPresentationForms-A}
will give you the 4th Unicode block in the preceding list.
You might also read Persian Computing in Unicode: http://behdad.org/download/Publications/persiancomputing/a007.pdf
Answers 6
I can't read Farsi but see if one of the Arabic unicode supplements have the letters you are looking for.
Answers 7
The named blocks, e.g \p{Arabic} cover the entire Arabic script, not just the Persian characters.
The presentation forms (u+FB50-u+FDFF) should not be used in text, and should be converted to the standard range (u+0600-u+06FF).
In order to only cover Persian we need the following:
- The subset of Farsi characters out of the standard Arabic range, i.e (U+0621-U+0624, U+0626-U+063A, U+0641-U+0642, U+0644-U+0648)
- The standard Arabic diacritics (U+064B-U+0652)
- The 2 additional diacritics (U+0654, U+0670)
- The 4 extra Farsi characters "گ چ پ ژ" (U+067E, U+0686, U+0698, U+06AF)
- U+06A9: Persian Kaf (formally: "Arabic Letter Keheh"; different notation from Arabic Kaf)
- U+06CC: Farsi Yeh (a different notation from the Arabic Yeh)
- U+200C: Zero-Width-Non-Joiner
So, the resulting regexp would be:
^[\u0621-\u0624\u0626-\u063A\u0641-\u0642\u0644-\u0648\u064B-\u0652\u067E\u0686\u0698\u06AF\u06CC\u06A9\u0654\u670\u200c}]+$
See also the exemplar characters for Persian listed here:
http://unicode.org/cldr/trac/browser/trunk/common/main/fa.xml
Answers 8
I'm not sure if regex is the way to do this, however the problem is not specific to only persian or arabic, chinees, russian text. so perhaps you could see if the character is existing in your Codepage, if not in the code page then I doubt the user can insert them using a input device....
var encoding = Encoding.GetEncoding(1256); var expect = "گ چ پ ژ"; var actual= encoding.GetBytes("گ چ پ ژ"); Assert.AreEqual(encoding.GetString(actual),expect);
The test tests a round trip where input should match the string to bytes and back. The link shows those code pages supported
Happy coding
Walter
0 comments:
Post a Comment