Author Topic: No support for unicode surrogates | emoji  (Read 1030 times)

Anonan

  • Jr. Member
  • **
  • Posts: 29
Re: No support for unicode surrogates | emoji
« Reply #30 on: January 10, 2019, 06:07:31 AM »
Unicode surrogate pair are usual Unicode character except that it have code points from U+010000 to U+10FFFF, what required to use two 16-bit code units.
https://en.wikipedia.org/wiki/UTF-16#U+010000_to_U+10FFFF

This character can be also represent with UTF-8.
The same character, but the different byte representation with UTF-8 and UTF-16.

https://unicodebook.readthedocs.io/definitions.html#character-string

Phil Harvey

  • ExifTool Author
  • Administrator
  • ExifTool Freak
  • *****
  • Posts: 14759
    • ExifTool Home Page
Re: No support for unicode surrogates | emoji
« Reply #31 on: January 10, 2019, 12:38:08 PM »
Bug:
If a file with name containing surrogate pair is contained in a folder, all other files with non-ASCII name will be endoded with ANSI encoding (Other data is encoded with UTF-8).

Try specifying -charset filename=YOUR_SYSTEM_CHARACTER_SET.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Anonan

  • Jr. Member
  • **
  • Posts: 29
Re: No support for unicode surrogates | emoji
« Reply #32 on: January 10, 2019, 03:55:07 PM »
Bug:
If a file with name containing surrogate pair is contained in a folder, all other files with non-ASCII name will be endoded with ANSI encoding (Other data is encoded with UTF-8).

Try specifying -charset filename=YOUR_SYSTEM_CHARACTER_SET.

- Phil

It doesn't work.
Windows has UTF-16 charset for file names. The program says Invalid Charset utf16 (or UTF-16, UTF16, utf-16).
With any other valid (for the program) charset (utf8, cp1251) I get Error: File not found

ёшэшщ – it's a mojibake. It should be "синий".



The stderr's text is encoded with ANSI (in my case ANSI is cp1251).
« Last Edit: January 10, 2019, 04:07:39 PM by Anonan »

Anonan

  • Jr. Member
  • **
  • Posts: 29
Re: No support for unicode surrogates | emoji
« Reply #33 on: January 10, 2019, 04:28:20 PM »
I have updated bug description.

Bug:
If a file with name containing surrogate pair is contained in a folder, the output lines that contains file name for all files in this folder all other files with non-ASCII name will be encoded with ANSI* encoding (Other data is encoded with UTF-8 by default).
And if you use -json, characters in this ANSI string that are not contained in ASCII charset just will be replaced by ??.??.?.

*ANSI is cp1251 in my case.


fix: only lines with file name




The folder structure:
« Last Edit: January 10, 2019, 05:20:24 PM by Anonan »

Anonan

  • Jr. Member
  • **
  • Posts: 29
Re: No support for unicode surrogates | emoji
« Reply #34 on: January 10, 2019, 04:52:49 PM »
Surrogate pair within meta tag are processed well, I get in result.txt a valid UTF-8 character.

I can copy and paste these 6 bytes, and character would be displayed correctly.

But with -json this data will be lost.



One more example:
« Last Edit: January 10, 2019, 05:14:03 PM by Anonan »

Anonan

  • Jr. Member
  • **
  • Posts: 29
Re: No support for unicode surrogates | emoji
« Reply #35 on: January 16, 2019, 07:26:02 PM »
The PowerShell's script to find out all files with names contain a surrogate pair:

Get-ChildItem -Recurse -Force | Where-Object -FilterScript {$_.name -match "[\uD800-\uDBFF][\uDC00-\uDFFF]"}
or
ls -r -fo | where {$_.name -match "[\uD800-\uDBFF][\uDC00-\uDFFF]"}




It's the output in Notepad++ and in Windows' notepad.
And I can change the encoding to UTF-8 via Windows' notepad. After this Notepad++ displays \u{XXXXX} characters correctly.

(It was weird for me that Notepad++ does not support UTF-16, but only UCS-2.)


It's the same file, but it's in utf-8 opened by Notepad++
« Last Edit: January 16, 2019, 08:21:00 PM by Anonan »