Wednesday, May 7, 2008

Ouch, UTF-8 not supported?

Funny story, I was digging into FishEye's bug database and noticed one bug related to Russian localization. I thought it'll be easy by me to reproduce (as I use Polish locale) so I'll take care of it.

The problem was that zip archives created by FishEye don't handle well Russian characters. It was very easy to reproduce for me. Then I started digging more - I wanted to fix this.

And here's a funny thing - I thought that in 2008, after web 2.0 explosion, after UTF-8 became de facto standard (OK, assumed by me) for characters encoding as many sites support multiple languages at the same time I expected that archives should also support UTF-8 out of the box. Well - wrong me :-)

I read ZIP spec and found out that UTF-8 support was added pretty recently - as Wikipedia says around September 2007. Ouch!

Today I found out that 7-ZIP one of populars archivers is going to support UTF-8 in upcoming release! Wow! I amazed. But it seems that they piss on ZIP's spec and decided to use their own approach on detecting UTF-8.

As I tried to handle this issue for our customer I tried some alternative archivers - Windows Explorer, WinRAR, WinZip. None of them was able to understand my ZIP archives.

I thought, let me switch to some ZIP's alternative - tar. Ouch again! Tar is supposed to store filenames in ASCII according to spec, it seems that Linux implementation does not follow spec word by word - so you actually can put UTF-8 into it. But Ant's tar implementation which we use in FishEye actually truncates characters to 8 bits. So it produces even prettier trash ;-)

RAR is supposed to handle UTF-8 (at least spec says it does), 7-Zip's own format 7z also supports UTF-8. But there's a problem - RAR is a closed format that you have to pay for and I don't think there's any API for Java for it, 7-Zip also has no support in Java world.

So did the world stopped on ZIP and Tar? It seems it did. I can't believe! Isn't there any real alternative that you could use? I'd love to see some format that has API for C, C++, Java, python, PERL, ruby, and so on with nice GUI on Windows, Mac. Any volunteers? :-)

Or is it just a sign of times - that those kind of applications are not widely used anymore?

9 comments:

Marcin Gorycki said...

What about making Ant use GNU tar instead of its own?

Douglas Butler said...

Have you checked out 7-Zip's LZMA SDK? (http://www.7-zip.org/sdk.html)

Pawel Niewiadomski said...

Yes, I did. The SDK is for compressing files with LZMA algorithm. I doesn't support 7z archive format.

Pawel Niewiadomski said...

Second funny thing - I today check what happens if I archive file with Polish characters in name under Windows using 7-ZIP. It was amazing - the filenames actually were encoded using CP852, which is a encoding used in DOS times, when Windows came out it used CP1250 for Polish characters.

I say "wow"! :-)

bbain said...

Zip files are like tar files, in the old name fields you store the encoding you want. This allows UTF-8 to be stored is you want. I am pretty sure that Java always stores UTF-8 when ZIP files are created with the java.util.zip classes (check the source). The problem is getting an decoder that understands this.

If you are on Linux with the UTF-8 enabled locale, then the zip and unzip commands will store UTF-8 as the file names. In this case, it is up to the user to have compatible environments.

Pawel Niewiadomski said...

@bbain - you're right - you can put anything inside of a ZIP but the problem is that you want to take it out the same way you put ;-)

Actually java.util.zip stuff uses String.getBytes() which is locale dependant - it will not always return UTF-8 encoding. Ant's ZIP tools allow you to set encoding manually (so getBytes(encoding) is called) and this is what I want. But the problem is that I haven't found any Windows app that would take out this ZIP and show me Polish characters instead of garbage.

Didn't I mention I didn't have problems on Linux and Mac most of the time?

I don't agree that it's up to user to have correct environment - it's up to developers to make this a less hassle as this can be. You don't want to force users to search how to unpack files created by your product. You want it out of the box so customers are happy and you're also (as you don't have to point them to some silly manual on "unpacking archives")

bbain said...

I completely agree with you. The achieve should automatically handle these filename translations for you.

I just wanted to point that that ZIP supports UTF-8 as well as it supports ASCII, it blindly accepts it.

I still believe that Java actually uses UTF-8 to encode file names. For instance, ZipOutputStream.writeLOC, ZipOutputStream.writeCEN and ZipInputStream.readLOC all use UTF-8 for the filename. The ZipFile class is a little bit of a mystery as it uses native methods to read the filenames from the zip file. Where do you see String.getBytes in java.util.zip?

I believe that if you use JAR to produce the ZIP (on any platform) and then use JAR to expand it on windows it should work.

Pawel Niewiadomski said...

You're right. Made mistake when writing my comment - Ant's ZIP uses getBytes() by default (if no encoding is set). java.util.zip.ZipOutputStreams outputs UTF-8.

Анатоль said...

Yeh... Since 2008 some things change to the better state.
APache Commons Compress work pretty fine for now: unicode files, compressed with it - can be decompressed with WinZip or 7zip successfully.

Post a Comment