Is it possible to convert a file into UTF-8 on my end?
If I have an access on the file after the submission with
$_FILES['file']['tmp_name']
Note: The user can upload a CSV file with any kind of charset, I usually encounter an unknown 8-bit charset.
I try
$row = array(); $datas = file($_FILES['file']['tmp_name']); foreach($datas as $data) { $data = mb_convert_encoding($data, 'UTF-8'); $row[] = explode(',', $data); }
But the problem is, this code remove special characters like single quote.
My first question is htmlspecialchars remove the value inside the array?
I put it for additional information. Thanks for those who can help!
4 Answers
Answers 1
Try this out.
The example I have used was something I was doing in a test environment, you might need to change the code slightly.
I had a text file with the following data in:
test café áÁÁÁááá žžœš¥± ÆÆÖÖÖasØØ ß
Then I had a form which took a file input in and performed the following code:
function neatify_files(&$files) { $tmp = array(); for ($i = 0; $i < count($_FILES); $i++) { for ($j = 0; $j < count($_FILES[array_keys($_FILES)[$i]]["name"]); $j++) { $tmp[array_keys($_FILES)[$i]][$j]["name"] = $_FILES[array_keys($_FILES)[$i]]["name"][$j]; $tmp[array_keys($_FILES)[$i]][$j]["type"] = $_FILES[array_keys($_FILES)[$i]]["type"][$j]; $tmp[array_keys($_FILES)[$i]][$j]["tmp_name"] = $_FILES[array_keys($_FILES)[$i]]["tmp_name"][$j]; $tmp[array_keys($_FILES)[$i]][$j]["error"] = $_FILES[array_keys($_FILES)[$i]]["error"][$j]; $tmp[array_keys($_FILES)[$i]][$j]["size"] = $_FILES[array_keys($_FILES)[$i]]["size"][$j]; } } return $files = $tmp; } if (isset($_POST["submit"])) { neatify_files($_FILES); $file = $_FILES["file"][0]; $handle = fopen($file["tmp_name"], "r"); while ($line = fgets($handle)) { $enc = mb_detect_encoding($line, "UTF-8", true); if (strtolower($enc) != "utf-8") { echo "<p>" . (iconv($enc, "UTF-8", $line)) . "</p>"; } else { echo "<p>$line</p>"; } } } ?> <form action="<?= $_SERVER["PHP_SELF"]; ?>" method="POST" enctype="multipart/form-data"> <input type="file" name="file[]" /> <input type="submit" name="submit" value="Submit" /> </form>
The function neatify_files
is something I wrote to make the $_FILES
array more logical in its layout.
The form is a standard form that simply POST
s the data to the server.
Note: Using $_SERVER["PHP_SELF"]
is a security risk, see here for more.
When the data is posted I store the file in a variable. Obviously, if you are using the multiple
attribute your code won't look quite like this.
$handle
stores the entire contents of the text file, in a read-only format; hence the "r"
argument.
$enc
uses the mb_detect_encoding
function to detect the encoding (duh).
At first I was having trouble with obtaining the correct encoding. Setting the encoding_list
to use only UTF-8, and setting strict
to be true.
If the encoding is UTF-8 then I simply print the line, if it didn't I converted it to UTF-8 using the iconv
function.
Answers 2
before you can convert it to utf-8, you need to know what characterset it is. if you can't figure that out, you can't in any sane way convert it to utf8.. however, an insane way to convert it to utf-8, if the encoding cannot be determined, is to simply strip any bytes that doesn't happen to be valid in utf-8, you might be able to use that as a fallback...
warning, untested code (im suddenly in a hurry), but may look something like this:
foreach ( $datas as $data ) { $encoding = guess_encoding ( $data ); if (empty ( $encoding )) { // encoding cannot be determined... // as a fallback, we simply strip any bytes that isnt valid utf-8... // obviously this isn't a reliable conversion scheme. // also this could probably be improved $data = iconv ( "ASCII", "UTF-8//TRANSLIT//IGNORE", $text ); } else { $data = mb_convert_encoding ( $data, 'UTF-8', $encoding ); } $row [] = explode ( ',', $data ); } function guess_encoding(string $str): string { $blacklist = array ( 'pass', 'auto', 'wchar', 'byte2be', 'byte2le', 'byte4be', 'byte4le', 'BASE64', 'UUENCODE', 'HTML-ENTITIES', '7bit', '8bit' ); $encodings = array_flip ( mb_list_encodings () ); foreach ( $blacklist as $tmp ) { unset ( $encodings [$tmp] ); } $encodings = array_keys ( $encodings ); $detected = mb_detect_encoding ( $str, $encodings, true ); return ( string ) $detected; }
Answers 3
you can convert the file text into binary data by using the following
FUNCTION bin2text($bin_str) { $text_str = ''; $chars = EXPLODE("\n", CHUNK_SPLIT(STR_REPLACE("\n", '', $bin_str), 8)); $_I = COUNT($chars); FOR($i = 0; $i < $_I; $text_str .= CHR(BINDEC($chars[$i])), $i ); RETURN $text_str; } FUNCTION text2bin($txt_str) { $len = STRLEN($txt_str); $bin = ''; FOR($i = 0; $i < $len; $i ) { $bin .= STRLEN(DECBIN(ORD($txt_str[$i]))) < 8 ? STR_PAD(DECBIN(ORD($txt_str[$i])), 8, 0, STR_PAD_LEFT) : DECBIN(ORD($txt_str[$i])); } RETURN $bin; }
after converting the data into binary you simply change the text to php method mb_convert_encoding($fileText, "UTF-8");
Answers 4
Let's try this:
function encode_utf8($data) { if ($data === null || $data === '') { return $data; } if (!mb_check_encoding($data, 'UTF-8')) { return mb_convert_encoding($data, 'UTF-8'); } else { return $data; } }
Usage:
$content = file_get_contents($_FILES['file']['tmp_name']); $content = encode_utf8($content); $rows = explode("\n", $content); foreach ($rows as $row) { print_r($row); }
0 comments:
Post a Comment