tcpip: (Default)
Diary of a B+ Grade Polymath ([personal profile] tcpip) wrote2003-07-25 04:31 pm

Project Imperialism?

Readers will know that I'm a big fan of Project Gutenberg. I completely agree with the sentiment of founder Michael Hart that "it would be a really good idea if lots of famous and important texts were freely available to everyone in the world".

But in the course of my research (in particular Internet and language content). I'm seriously beginning to have second thoughts about the Project.

Here's the current distribution of books according to language:

Bulgarian 6 Chinese 64 (whatever that's supposed to mean) Dutch 8 Flemish 5 French 102 German 183 Greek 1 Italian 13 Japanese 2 Latin 15(!) Portuguese 3 Spanish 15 Swedish 1 Welsh 4

and ...

English about 9500

To make matters less impressive they have this rule of ASCII first (not a universal standard) and then no standards at all. For example, the Epicheski pesni (Epical Songs), Slaveikov, Pencho, uses the Cyrillic Windows 1251 character set. Legge's Confucian Analects requires the Big 5 character set. The 1 Greek text (a translation of Sangharakshita, Vision and by Spiros Doikas) doesn't even mention what character encoding it uses!

Have these people heard of unicode?

[identity profile] damned-colonial.livejournal.com 2003-07-25 01:24 pm (UTC)(link)
I dunno... there are lots of languages that are representable in ASCII, and yet they are not represented in PG... so it's not just the encoding issue that's going on here.

[identity profile] tcpip.livejournal.com 2003-07-26 03:37 am (UTC)(link)

Plain vanilla ASCII only includes the English character set. Various (and often incompatiable) extensions were then added to it (e.g., IBM PC Extended ASCII, Microsoft Windows ASCII, Digital VT220 terminal ASCII etc etc).

Finally, the ISO developed ISO-8859 8-bit ASCII which eventually developed into a family (e.g., 8859-1 (Latin-1), 8859-2 (East European Latin), 8859-5 (Cyrillic). M$ continued on it's merry way with its own version of ASCII (code page 1252, with some jokingly say shows that M$ is still in the thirteenth century).

The People's Republic of China and the Province of Taiwan has completely different encodings (GB vs Big5) which means if you send an email from Beiging to Taiwan it comes out as garbage.

The damn thing is a mess. They should just insist on Unicode.


[identity profile] rustythoughts.livejournal.com 2003-07-28 04:01 am (UTC)(link)
Don't forget the various encodings used in Japan dominated by JIS encoded as ShiftJIS and EUC.

But even Unicode has it's problems. It started as a 16 bit standard, but has been extended via some clever but not obvious tricks into a 32 bit standard.

Or this, unicode is a superset of JIS, except there is no standard mapping between them, individual vendors have some but nothing complete in the public domain...

The problem of encodings is quite far reaching, consider that fonts are collections of glyphs, and that something in the font must map from code x to glyph n, if you encode the document differently then you need new fonts (or seperate mappings). Then let some some company decides to invent a general mapping scheme but use it in a specific way and suddently you have a new quasi-standard encoding, but with non-standard uses abounding and no way to tell them apart, I'm looking at you Adobe...

There are also very few comprehensive Unicode fonts out there. I think Microsoft has one that has all the basic Unicode characters and a fair proportion of the surrogates. It's huge, and they advise against using it except for testing purposes. So font formats evolve to allow multiple fonts to be treated as collections except then you've got no way of ensuring that the same fonts are being used for the same glyph subsets.

And lets not talk about how printers deal with these issues, because that's the work I'm paid for and I'm avoiding it.

[identity profile] tcpip.livejournal.com 2003-07-28 04:08 am (UTC)(link)

Interesting comments and right on the ball on almost all issues. This one: There are also very few comprehensive Unicode fonts out there. is interesting, but not really relevant for this purpose of this discussion.

I'm not even going to begin discussing fonts in my thesis. It's driving me crazy enough as it is.