Author Topic: No support for unicode surrogates | emoji  (Read 798 times)

Anonan

  • Jr. Member
  • **
  • Posts: 22
No support for unicode surrogates | emoji
« on: January 01, 2019, 01:58:36 PM »
The program throws the exception "No support for unicode surrogates at script/exiftool line 3553." when you use it on files that contain emoji in a file name.

The examples of file names: (see the attachment)".
This forum also does not support emoji (I can't post here examples of file names that contain emoji.).


And yes, I don't like emoji too. I don't use them, but other people do. So the support of this is needed.

Phil Harvey

  • ExifTool Author
  • Administrator
  • ExifTool Freak
  • *****
  • Posts: 14425
    • ExifTool Home Page
Re: No support for unicode surrogates | emoji
« Reply #1 on: January 01, 2019, 02:22:19 PM »
Windows special characters are really a pain.  (I'm assuming you are on Windows.)

What version of ExifTool are you using?

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Anonan

  • Jr. Member
  • **
  • Posts: 22
Re: No support for unicode surrogates | emoji
« Reply #2 on: January 01, 2019, 02:31:57 PM »
11.2.2.0 and 11.2.3.0 (I have tested this version right now. The result is the same). Yes, I use Windows 10.

I have also tried use both cmd.exe and Git Bash.

Anonan

  • Jr. Member
  • **
  • Posts: 22
Re: No support for unicode surrogates | emoji
« Reply #3 on: January 01, 2019, 03:03:00 PM »
It also does not support symbols like https://en.wiktionary.org/wiki/º (Do not confuse with https://en.wikipedia.org/wiki/Degree_symbol, ExifTool sees ° normally.)
Example of file name: "360º Test.mp4"
In this case the program just write "No matching files".

StarGeek

  • Global Moderator
  • ExifTool Freak
  • *****
  • Posts: 2368
Re: No support for unicode surrogates | emoji
« Reply #4 on: January 01, 2019, 03:17:28 PM »
It also does not support symbols like https://en.wiktionary.org/wiki/º (Do not confuse with https://en.wikipedia.org/wiki/Degree_symbol, ExifTool sees ° normally.)
Example of file name: "360º Test.mp4"
In this case the program just write "No matching files".

This would seem to be a FAQ #18 answer, as when I change the code page to 65001, it works fine.

Code: [Select]
C:\>exiftool -g1 -a -s -PNG:all "Y:\!temp\bb\360º Test.png"
---- PNG ----
ImageWidth                      : 336
---- PNG ----
ImageWidth                      : 336
ImageHeight                     : 509
BitDepth                        : 8
ColorType                       : Grayscale with Alpha
Compression                     : Deflate/Inflate
Filter                          : Adaptive
Interlace                       : Noninterlaced
Gamma                           : 2.2
WhitePointX                     : 0.3127
WhitePointY                     : 0.329
RedX                            : 0.64
RedY                            : 0.33
GreenX                          : 0.3
GreenY                          : 0.6
BlueX                           : 0.15
BlueY                           : 0.06
BackgroundColor                 : 255
Label                           : FinalDesignArt
ModifyDate                      : 2018:11:15 11:02:46
Troubleshooting hints:
* When posting, include your OS, Exiftool version, and type of file you're processing (MP4, JPG, etc).
* Double all percent signs (%) in a Windows batch file.
* If your GPS coords are negative, make sure and set the GpsLatitudeRef and GpsLongitudeRef tags correctly.

Phil Harvey

  • ExifTool Author
  • Administrator
  • ExifTool Freak
  • *****
  • Posts: 14425
    • ExifTool Home Page
Re: No support for unicode surrogates | emoji
« Reply #5 on: January 01, 2019, 05:15:31 PM »
I can't figure out that line number.  Line 3553 of exiftool version 11.22 doesn't do anything that could possibly generate a warning like that. :/

I guess I'll have to try this myself when I can.

What was the exact command you used?  (Maybe do a screen grab of the command and the warning you get.)

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Anonan

  • Jr. Member
  • **
  • Posts: 22
Re: No support for unicode surrogates | emoji
« Reply #6 on: January 02, 2019, 06:13:46 AM »
It's strange, but today I have the exception on line 3547. (The result is the same for both 11.2.2 and 11.2.3; Win 10, RUS; "chcp 65001" does not effect on results).

I run "exiftool.exe *". And there is one or more files with emoji in a name in the folder, within that I run the command.
File names: https://pastebin.com/gtNj96mg (I can not post them here, In other way I get the forum error "The message body was left empty.")
Finally I get:
"No support for unicode surrogates at script/exiftool line 3547."
No more results are in a console.



> Maybe do a screen grab of the command and the warning you get.
Ok, I will do this later.

Phil Harvey

  • ExifTool Author
  • Administrator
  • ExifTool Freak
  • *****
  • Posts: 14425
    • ExifTool Home Page
Re: No support for unicode surrogates | emoji
« Reply #7 on: January 02, 2019, 07:15:27 AM »
OK.  Line 3547 would be an error in the Win32::FindFile package.  There isn't much I can do about this.

Try not using wildcards when you specify file names on the command line.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Anonan

  • Jr. Member
  • **
  • Posts: 22
Re: No support for unicode surrogates | emoji
« Reply #8 on: January 02, 2019, 07:39:44 AM »
Oh, wait. The error on line 3553 occurs when I just use "exiftool.exe FILENAME".
The wildcard usage works fine, when where are not files with these names.

Look at the attachment.
(Mirror: https://i.imgur.com/opg7Rj9.png)

CMD displays emoji incorrectly, but works with it correctly.
I can even copy these ⍰⍰ and paste to a text editor that supports a displaying unicode surrogates, and see the correct "icon".

Or I can use the command to concat all files to one – "copy /b *.txt concated.txt" and this command works fine, even if file names contain unicode surrogates (CMD just displays them like ⍰⍰).
« Last Edit: January 02, 2019, 08:20:36 AM by Anonan »

Phil Harvey

  • ExifTool Author
  • Administrator
  • ExifTool Freak
  • *****
  • Posts: 14425
    • ExifTool Home Page
Re: No support for unicode surrogates | emoji
« Reply #9 on: January 02, 2019, 08:47:20 AM »
OK.  The underlying problem is that Win32::FindFile does not support these surrogate codes.  The reason I'm using Win32::FindFile in the first place is because of the lack of built-in support in ActivePerl for Windows Unicode file names.  The situation is unfortunate, but one possible work-around could be to create a hard link with a plain ASCII name to the file with the surrogate characters, then run exiftool on the hard link.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Anonan

  • Jr. Member
  • **
  • Posts: 22
Re: No support for unicode surrogates | emoji
« Reply #10 on: January 02, 2019, 09:24:52 AM »
Can this program just skip the files with unicode surrogates in a name without stopping work?
And at the end write the names of the files that were skipped to be processed manually by me.

I need to get meta info from a lot of files and only rare files contain unicode surrogates in its name, but the program does not work at all in this case.

Phil Harvey

  • ExifTool Author
  • Administrator
  • ExifTool Freak
  • *****
  • Posts: 14425
    • ExifTool Home Page
Re: No support for unicode surrogates | emoji
« Reply #11 on: January 02, 2019, 09:30:14 AM »
I'll see what I can do.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Phil Harvey

  • ExifTool Author
  • Administrator
  • ExifTool Freak
  • *****
  • Posts: 14425
    • ExifTool Home Page
Re: No support for unicode surrogates | emoji
« Reply #12 on: January 02, 2019, 10:39:07 AM »
I've managed to reproduce this.  (The hardest part was figuring out how to create a file with a surrogate character in its name.  I finally did it by creating the file on a Mac then sending it to the Windows machine.)

I will patch ExifTool 11.24 to catch this error from Win32::FindFile and issue a warning or error instead.

Thanks for this report.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Anonan

  • Jr. Member
  • **
  • Posts: 22
Re: No support for unicode surrogates | emoji
« Reply #13 on: January 02, 2019, 11:11:56 AM »
> And at the end write the names of the files that were skipped to be processed manually by me.
Probably it's better show them also at the start (in "err" stream) to be able to stop the program, fix the names and restart the program. In order not to run twice.
Since the work of the program can take some minutes, when you have several gigabytes of data.


> The hardest part was figuring out how to create a file with a surrogate character in its name.
For example, right click in Chrome/Opera on a text input and the first option in the context menu.




Phil Harvey

  • ExifTool Author
  • Administrator
  • ExifTool Freak
  • *****
  • Posts: 14425
    • ExifTool Home Page
Re: No support for unicode surrogates | emoji
« Reply #14 on: January 02, 2019, 12:27:44 PM »
Probably it's better show them also at the start (in "err" stream) to be able to stop the program, fix the names and restart the program.

This is problematic.  For one, there will likely be a problem interpreting the file name(s) in the ExifTool stderr messages due to character set problems.  I'll be outputting these messages in UTF-8.  The other thing is that it would be very hard for me to find these files beforehand.  So you will unfortunately be stuck trying to process them in a second pass.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Anonan

  • Jr. Member
  • **
  • Posts: 22
Re: No support for unicode surrogates | emoji
« Reply #15 on: January 02, 2019, 01:04:58 PM »
Presently I get an error if there is at least one file with that name (I use a command like "exiftool * > out.txt"). And no useful work.
It looks like the program gets the names of all files first and parses them. And after this the program works with parsed file names (and with real files), if there was no exception. Is not it?

If so, when it throws an exception it is enough to write the file name to stderr, and continue to parse the remaining file names.
(Useful output will be after all file names parsed. And in this case all "broken" file names are already listed in stderr before first metadata is outputted in stdout.)

It's my guess.

Phil Harvey

  • ExifTool Author
  • Administrator
  • ExifTool Freak
  • *****
  • Posts: 14425
    • ExifTool Home Page
Re: No support for unicode surrogates | emoji
« Reply #16 on: January 02, 2019, 01:41:13 PM »
It looks like the program gets the names of all files first and parses them.

I think this is true for each FILE argument on the command line.  But if you specify multiple FILE arguments then the files are processed before considering the next argument.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Anonan

  • Jr. Member
  • **
  • Posts: 22
Re: No support for unicode surrogates | emoji
« Reply #17 on: January 02, 2019, 04:50:40 PM »
I have tested the program some more time.
I used the simple command:
exiftool -r *

And I have the next folder structure:

 [folder1 (run the program here)]
     pic1.jpg
     pic2.jpg
     pic3.jpg
     [sub_folder2]
         pic3.jpg
         pic4.jpg


At this moment the program work so:
Parse all file names (not only files, folders too) in the folder1.
If the names are "clear" – don't contain unicode surrogate pair, then process the files (give meta info) in this folder.
If a name of any file/subfolder contains unicode surrogate pair – throws the exception and other files are not processed any way.
And after goes to subfolders and repeats this algorithm.


So, in this way ExifTool theoretically (without changing the program logic) can not provide a full list with files contain unicode surrogate pair before processes the files (a part of), if there is a subfolder in the source folder.


« Last Edit: January 02, 2019, 05:04:14 PM by Anonan »

Anonan

  • Jr. Member
  • **
  • Posts: 22
Re: No support for unicode surrogates | emoji
« Reply #18 on: January 03, 2019, 12:21:48 PM »
So, in this way ExifTool theoretically (without changing the program logic) can not provide a full list with files contain unicode surrogate pair before processes the files (a part of), if there is a subfolder in the source folder.
But I think it's no hard to do, like the new option (something like -fulltreebypassfirst (no so good name)), it's command to bypass the full tree of the files first
(in this context this allows to get the full list of files with the names contain unicode surrogate pair before any file are processed (to extract meta info)), and after this to work in normal way.

Phil Harvey

  • ExifTool Author
  • Administrator
  • ExifTool Freak
  • *****
  • Posts: 14425
    • ExifTool Home Page
Re: No support for unicode surrogates | emoji
« Reply #19 on: January 03, 2019, 01:14:59 PM »
This is actually similar to what the -progress option does.  But I have to find some time to look into this in more detail.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Anonan

  • Jr. Member
  • **
  • Posts: 22
Re: No support for unicode surrogates | emoji
« Reply #20 on: January 08, 2019, 11:55:11 AM »
I have updated to the 24 version.
And now the program works very strange (in case existing a surragate pair in a file name).

I have created some folders for the testing, check the attachment.


I used CMD and git-bash in Windows 10 Rus.
My console output (of the .bat and .sh files) also presents in the according folder.

Phil Harvey

  • ExifTool Author
  • Administrator
  • ExifTool Freak
  • *****
  • Posts: 14425
    • ExifTool Home Page
Re: No support for unicode surrogates | emoji
« Reply #21 on: January 08, 2019, 12:06:19 PM »
What are you calling strange?  The file that ExifTool reads as APPLE_~1.JPG in example2?  This is the behaviour of the standard library, which I am using if Win32::FindFile fails.  Apparently the standard library falls back to using Windows short filenames for the files with Unicode characters, but I think this depends on your system settings.  I agree that this is strange.  From this post by StarGeek it seems you can see these 8.3 filenames with dir /x.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Anonan

  • Jr. Member
  • **
  • Posts: 22
Re: No support for unicode surrogates | emoji
« Reply #22 on: January 08, 2019, 01:13:14 PM »
For example:

I use exiftool.exe -FileName * -r

1.
> ex1cmd.png and ex1cmd_cp866
https://imgur.com/a/LH4dG0p

After processing the file with surr name (the name contains a surrogate pair), the program can't work with the file contains Cyrillic characters in the same folder.
It writes "Invalid filename encoding" and "Error opening directory".

2.
> ex2cmd.png and ex2cmd_cp866
> ex2bash.png
https://imgur.com/a/NTETU5x

The program does not work at all, if a file with surr name exists in the root folder. If I use CMD. It works with "*" incorrectly in this case.
But if I use Bash, this file is skipped with Error, and the program continues to work.


3.
> ex2bash.png
https://i.imgur.com/Qwt9r48.png

Warning contains only the folder name without the file name. (Error looks normal, it shows the file name like "apple_??_apple.jpg")


4.
> ex0bash.png, ex1bash.png, ex2bash.png
https://imgur.com/a/eMze8B1

The charset for Cyrillic file names can be different in an output of the program, when I use Git-Bash.
Some file names are outputted in charset what displays the file names correctly, other file names are outputted in charset what displays the file names incorrectly.


5. The ex*bash_ls.png and ex*cmd_dir.png pictures demonstrate that both consoles can display file names correctly by they themselves.

Phil Harvey

  • ExifTool Author
  • Administrator
  • ExifTool Freak
  • *****
  • Posts: 14425
    • ExifTool Home Page
Re: No support for unicode surrogates | emoji
« Reply #23 on: January 08, 2019, 01:42:45 PM »
OK.  So apparently the 11.24 patch wasn't much help.  Was the previous behaviour better?

The globbing of filenames with wildcards in Windows is a problem that I may not be able to solve.  I have always recommended avoiding the use of wildcards in file names.  Does the situation improve if you do this instead?:

exiftool.exe -filename -ext "*" -r .

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Anonan

  • Jr. Member
  • **
  • Posts: 22
Re: No support for unicode surrogates | emoji
« Reply #24 on: January 09, 2019, 08:35:50 AM »
Well, the dot at the end of the command do the work.
exiftool.exe -FileName -r -ext * .
I didn't even see it at first.
Now CMD and Git-Bash work almost similar (in the case existing a file with surr name in the root folder, except that Git-Bash have the problem with Cyrillic name (see my preview post)).
« Last Edit: January 09, 2019, 09:02:34 AM by Anonan »

Phil Harvey

  • ExifTool Author
  • Administrator
  • ExifTool Freak
  • *****
  • Posts: 14425
    • ExifTool Home Page
Re: No support for unicode surrogates | emoji
« Reply #25 on: January 09, 2019, 08:51:58 AM »
I went to add a note about this to the common mistakes documentation, and discovered it was already there (common mistake 2f).  I had forgotten about this, but the problem was worse with surrogates in the name.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Anonan

  • Jr. Member
  • **
  • Posts: 22
Re: No support for unicode surrogates | emoji
« Reply #26 on: January 09, 2019, 01:00:11 PM »
So.
I have the next structure (where % is the apple emoji):

root_apple_%_apple.jpg
root_green.jpg
root_синий.jpg
folder1
    folder1_green.jpg
    folder1_синий.jpg
folder2
    sub_apple_%_apple.jpg
    sub_blue.jpg
    sub_синий.jpg
    sub_%.jpg


I run
on CMD: exiftool.exe -FileName -s -r -progress -j -ext * . > o.json
on Bash: exiftool.exe -FileName -s -r -progress -j * > o2.json


CMD shows me:
Warning: [Win32::FindFile] No support for unicode surrogates - .
Warning: [Win32::FindFile] No support for unicode surrogates - ./folder2

Bash shows me:
Error: [Win32::FindFile] No support for unicode surrogates - root_apple_??_apple.jpg
Warning: [Win32::FindFile] No support for unicode surrogates - folder2



CMD trimmed output:

  "SourceFile": "./folder1/folder1_green.jpg",
  "SourceFile": "./folder1/folder1_синий.jpg",
  "SourceFile": "./folder2/SUB_AP~1.JPG",    // !1 - sub_apple_??_apple.jpg
  "SourceFile": "./folder2/sub_blue.jpg",
  "SourceFile": "./folder2/sub_??^??.jpg",   // !2 - sub_синий.jpg (I have switched one "?" with "^")
  "SourceFile": "./folder2/SUB_~2.JPG",      // !1 - sub_??.jpg
  "SourceFile": "./ROOT_A~1.JPG",            // !1 - root_apple_??_apple.jpg
  "SourceFile": "./root_green.jpg",
  "SourceFile": "./root_??^??.jpg",          // !2 - root_синий.jpg | In the console I see "root_ёшэшщ.jpg" for cp866 or "root_.jpg" for cp65001


Bash trimmed output:

  "SourceFile": "folder1/folder1_green.jpg",
  "SourceFile": "folder1/folder1_синий.jpg",
  "SourceFile": "folder2/SUB_AP~1.JPG",   // !1
  "SourceFile": "folder2/sub_blue.jpg",
  "SourceFile": "folder2/sub_??^??.jpg",  // !2
  "SourceFile": "folder2/SUB_~2.JPG",     // !1
                                          // !3
  "SourceFile": "root_green.jpg",
  "SourceFile": "root_??^??.jpg"          // !2


If I run Bash with exiftool.exe -FileName -s -r -progress -j -ext "*" . > o2.json the result is similar to the CMD result.


If you wanna test by yourself download the attachment.

I will add my comment to this later.


Anonan

  • Jr. Member
  • **
  • Posts: 22
Re: No support for unicode surrogates | emoji
« Reply #27 on: January 09, 2019, 03:50:24 PM »
Additional bug.

That's about Git-Bash in Windows (UTF-16):

This command works fine:
exiftool.exe -FileName -r -ext "*" .

But this command does no.
exiftool.exe -FileName -r  *
Non-ASCII file names in the root folder will be printed with an incorrect encoding (It's ANSI, but it should be UTF-8 like the other data).



And If I use -json all data* in ANSI encoding** are lost, it just was replaced by five "?" (??_?? (I can't post this – auto smile inserting occurs)).



*In my case – Cyrillic characters.
**Except characters that are contained in ASCII charset, f.e. Latin characters.
« Last Edit: January 09, 2019, 08:47:39 PM by Anonan »

Anonan

  • Jr. Member
  • **
  • Posts: 22
Re: No support for unicode surrogates | emoji
« Reply #28 on: January 09, 2019, 08:13:48 PM »
Now let us return to my last big post.

All ??.?.?? (five ? in a row) are a lost data during an ANSI encoding string are written to JSON file.

Bug:
If a file with name containing surrogate pair is contained in a folder, all other files with non-ASCII name will be endoded with ANSI encoding (Other data is encoded with UTF-8). It leads to mojibake.
https://unicodebook.readthedocs.io/definitions.html#mojibake

And if you use -json, characters in this ANSI string that are not contained in ASCII charset just will be replaced by ??.??.?.
« Last Edit: January 09, 2019, 08:33:38 PM by Anonan »

Anonan

  • Jr. Member
  • **
  • Posts: 22
Re: No support for unicode surrogates | emoji
« Reply #29 on: January 09, 2019, 08:31:35 PM »
The error message "Error: [Win32::FindFile] No support for unicode surrogates - root_apple_??_apple.jpg" contains a readable file name, I think the warning should shows the same information.
Now a warning shows only a folder within that there are one or more file with surr name. It's no good. What if I have only one such file among a thousand files? It will be hard to find it. [1]

Is there way to no convert a file name from sub_apple_??_apple.jpg to SUB_AP~1.JPG?
But if the program will lists all these files [see 1] it will be possible to manually process its names (to remove surr pair, f.e.).
« Last Edit: January 09, 2019, 08:52:36 PM by Anonan »

Anonan

  • Jr. Member
  • **
  • Posts: 22
Re: No support for unicode surrogates | emoji
« Reply #30 on: January 10, 2019, 06:07:31 AM »
Unicode surrogate pair are usual Unicode character except that it have code points from U+010000 to U+10FFFF, what required to use two 16-bit code units.
https://en.wikipedia.org/wiki/UTF-16#U+010000_to_U+10FFFF

This character can be also represent with UTF-8.
The same character, but the different byte representation with UTF-8 and UTF-16.

https://unicodebook.readthedocs.io/definitions.html#character-string

Phil Harvey

  • ExifTool Author
  • Administrator
  • ExifTool Freak
  • *****
  • Posts: 14425
    • ExifTool Home Page
Re: No support for unicode surrogates | emoji
« Reply #31 on: January 10, 2019, 12:38:08 PM »
Bug:
If a file with name containing surrogate pair is contained in a folder, all other files with non-ASCII name will be endoded with ANSI encoding (Other data is encoded with UTF-8).

Try specifying -charset filename=YOUR_SYSTEM_CHARACTER_SET.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Anonan

  • Jr. Member
  • **
  • Posts: 22
Re: No support for unicode surrogates | emoji
« Reply #32 on: January 10, 2019, 03:55:07 PM »
Bug:
If a file with name containing surrogate pair is contained in a folder, all other files with non-ASCII name will be endoded with ANSI encoding (Other data is encoded with UTF-8).

Try specifying -charset filename=YOUR_SYSTEM_CHARACTER_SET.

- Phil

It doesn't work.
Windows has UTF-16 charset for file names. The program says Invalid Charset utf16 (or UTF-16, UTF16, utf-16).
With any other valid (for the program) charset (utf8, cp1251) I get Error: File not found

ёшэшщ – it's a mojibake. It should be "синий".



The stderr's text is encoded with ANSI (in my case ANSI is cp1251).
« Last Edit: January 10, 2019, 04:07:39 PM by Anonan »

Anonan

  • Jr. Member
  • **
  • Posts: 22
Re: No support for unicode surrogates | emoji
« Reply #33 on: January 10, 2019, 04:28:20 PM »
I have updated bug description.

Bug:
If a file with name containing surrogate pair is contained in a folder, the output lines that contains file name for all files in this folder all other files with non-ASCII name will be encoded with ANSI* encoding (Other data is encoded with UTF-8 by default).
And if you use -json, characters in this ANSI string that are not contained in ASCII charset just will be replaced by ??.??.?.

*ANSI is cp1251 in my case.


fix: only lines with file name




The folder structure:
« Last Edit: January 10, 2019, 05:20:24 PM by Anonan »

Anonan

  • Jr. Member
  • **
  • Posts: 22
Re: No support for unicode surrogates | emoji
« Reply #34 on: January 10, 2019, 04:52:49 PM »
Surrogate pair within meta tag are processed well, I get in result.txt a valid UTF-8 character.

I can copy and paste these 6 bytes, and character would be displayed correctly.

But with -json this data will be lost.



One more example:
« Last Edit: January 10, 2019, 05:14:03 PM by Anonan »

Anonan

  • Jr. Member
  • **
  • Posts: 22
Re: No support for unicode surrogates | emoji
« Reply #35 on: January 16, 2019, 07:26:02 PM »
The PowerShell's script to find out all files with names contain a surrogate pair:

Get-ChildItem -Recurse -Force | Where-Object -FilterScript {$_.name -match "[\uD800-\uDBFF][\uDC00-\uDFFF]"}
or
ls -r -fo | where {$_.name -match "[\uD800-\uDBFF][\uDC00-\uDFFF]"}




It's the output in Notepad++ and in Windows' notepad.
And I can change the encoding to UTF-8 via Windows' notepad. After this Notepad++ displays \u{XXXXX} characters correctly.

(It was weird for me that Notepad++ does not support UTF-16, but only UCS-2.)


It's the same file, but it's in utf-8 opened by Notepad++
« Last Edit: January 16, 2019, 08:21:00 PM by Anonan »