Project Imperialism?
Jul. 25th, 2003 04:31 pm![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
Readers will know that I'm a big fan of Project Gutenberg. I completely agree with the sentiment of founder Michael Hart that "it would be a really good idea if lots of famous and important texts were freely available to everyone in the world".
But in the course of my research (in particular Internet and language content). I'm seriously beginning to have second thoughts about the Project.
Here's the current distribution of books according to language:
Bulgarian 6 Chinese 64 (whatever that's supposed to mean) Dutch 8 Flemish 5 French 102 German 183 Greek 1 Italian 13 Japanese 2 Latin 15(!) Portuguese 3 Spanish 15 Swedish 1 Welsh 4
and ...
English about 9500
To make matters less impressive they have this rule of ASCII first (not a universal standard) and then no standards at all. For example, the Epicheski pesni (Epical Songs), Slaveikov, Pencho, uses the Cyrillic Windows 1251 character set. Legge's Confucian Analects requires the Big 5 character set. The 1 Greek text (a translation of Sangharakshita, Vision and by Spiros Doikas) doesn't even mention what character encoding it uses!
Have these people heard of unicode?
But in the course of my research (in particular Internet and language content). I'm seriously beginning to have second thoughts about the Project.
Here's the current distribution of books according to language:
Bulgarian 6 Chinese 64 (whatever that's supposed to mean) Dutch 8 Flemish 5 French 102 German 183 Greek 1 Italian 13 Japanese 2 Latin 15(!) Portuguese 3 Spanish 15 Swedish 1 Welsh 4
and ...
English about 9500
To make matters less impressive they have this rule of ASCII first (not a universal standard) and then no standards at all. For example, the Epicheski pesni (Epical Songs), Slaveikov, Pencho, uses the Cyrillic Windows 1251 character set. Legge's Confucian Analects requires the Big 5 character set. The 1 Greek text (a translation of Sangharakshita, Vision and by Spiros Doikas) doesn't even mention what character encoding it uses!
Have these people heard of unicode?
no subject
Date: 2003-07-25 11:00 am (UTC)no subject
Date: 2003-07-26 02:13 am (UTC)Point. Fixed.
no subject
Date: 2003-07-25 11:11 am (UTC)I have to admit, ASCII has served me well as a standard over the years, and I have but vague notions of what Unicode does, nut perhaps Gutenberg (and indeed the net) are at the point where cultural diversity are important.
After all, was it not last year that they finally got around to allowing non-English URLs?
no subject
Date: 2003-07-26 02:49 am (UTC)Unicode, unbelievably, actually maps the character sets of every language in the world and quite a few historical ones as well. It is designed to be platform independent as well. I have contacted PG over this, so we'll see how it goes.
IIRC, Verisign, the mob in charge of the .com TLDs started issueing non-English URLs without checking with ICANN so there's a huge bun-fight over that one. I'll get back to you once I've done some research in that area.
domain names.
Date: 2003-07-28 10:57 am (UTC)Try telling that to customers who want them. GAH!!!!!!!!!
Benjamin
Re: domain names.
Date: 2003-07-29 01:58 am (UTC)That's a damn good point. It'll even make up an extra two or three sentences in my thesis ;-)
Do you know if any working this? Getting bind and unicode to talk to one another?
Re: domain names.
Date: 2003-07-29 07:03 am (UTC)no subject
Date: 2003-07-25 01:24 pm (UTC)no subject
Date: 2003-07-26 03:37 am (UTC)Plain vanilla ASCII only includes the English character set. Various (and often incompatiable) extensions were then added to it (e.g., IBM PC Extended ASCII, Microsoft Windows ASCII, Digital VT220 terminal ASCII etc etc).
Finally, the ISO developed ISO-8859 8-bit ASCII which eventually developed into a family (e.g., 8859-1 (Latin-1), 8859-2 (East European Latin), 8859-5 (Cyrillic). M$ continued on it's merry way with its own version of ASCII (code page 1252, with some jokingly say shows that M$ is still in the thirteenth century).
The People's Republic of China and the Province of Taiwan has completely different encodings (GB vs Big5) which means if you send an email from Beiging to Taiwan it comes out as garbage.
The damn thing is a mess. They should just insist on Unicode.
no subject
Date: 2003-07-28 04:01 am (UTC)But even Unicode has it's problems. It started as a 16 bit standard, but has been extended via some clever but not obvious tricks into a 32 bit standard.
Or this, unicode is a superset of JIS, except there is no standard mapping between them, individual vendors have some but nothing complete in the public domain...
The problem of encodings is quite far reaching, consider that fonts are collections of glyphs, and that something in the font must map from code x to glyph n, if you encode the document differently then you need new fonts (or seperate mappings). Then let some some company decides to invent a general mapping scheme but use it in a specific way and suddently you have a new quasi-standard encoding, but with non-standard uses abounding and no way to tell them apart, I'm looking at you Adobe...
There are also very few comprehensive Unicode fonts out there. I think Microsoft has one that has all the basic Unicode characters and a fair proportion of the surrogates. It's huge, and they advise against using it except for testing purposes. So font formats evolve to allow multiple fonts to be treated as collections except then you've got no way of ensuring that the same fonts are being used for the same glyph subsets.
And lets not talk about how printers deal with these issues, because that's the work I'm paid for and I'm avoiding it.
no subject
Date: 2003-07-28 04:08 am (UTC)Interesting comments and right on the ball on almost all issues. This one: There are also very few comprehensive Unicode fonts out there. is interesting, but not really relevant for this purpose of this discussion.
I'm not even going to begin discussing fonts in my thesis. It's driving me crazy enough as it is.
no subject
Date: 2003-07-25 01:49 pm (UTC)In these situations I like to apply a variant of Occam's Razor - never ascribe to malice what can be adequately explained by ignorance.
no subject
Date: 2003-07-31 02:26 am (UTC)You know, I hadn't responded to this comment because I agreed with everything it said ;-)
I (temporarily) joined the PG volunteers list to flag this with them. A couple of the volunteers reacted a bit tetchily - you know "we're volunteers, how dare you criticise us!", and Michael Hart himself made a couple of comments - mainly along the lines of "we don't have any standards anymore because I don't like to force people to do anything".
Whilst I agree with him on an individual perspective, PG is an organization. And for the sake of efficiency and effectiveness, organizations can set standards.
never ascribe to malice what can be adequately explained by ignorance.
I do this as often as possible. But there are those who simply choose to remain ignorant and therefore malicious...
no subject
Date: 2003-07-26 07:43 am (UTC)They are another group that should use a spell check before publishing them on the net :)
I do however love the fact that they exist in the first place
no subject
Date: 2003-07-26 08:53 am (UTC)There's some instructions in their faq regarding formatting.
You're quite right 'though, I'm damn glad that they exist. Although their rate of production hasn't been the greatest - depending on volunteer labour and all.
It would be nice if one, just one, government in among the OECD nations said "Hey, what a good idea! An international public library".
I tried to spin the idea to the Victorian government a couple of years ago. There was some interest in local history stuff, but that's about it.
Maybe a project I can embark upon in the near future.. Dammit, if only some nice government would give me a couple of million dollars. I'm sure I could do something really useful with it.
no subject
Date: 2003-07-27 06:00 am (UTC)I'm sure you will be able to convince the Govt to do something about it at some stage in the future...
no subject
Date: 2003-07-31 02:31 am (UTC)Certainly. They recommend OCR and then proofreading the book, but because that requires "shredding" an out-of-copyright book, which is often pretty old, it isn't always a popular idea...
I'm thinking of typing up Cliff Morris' Tetun-English dictionary. It was a bit of a home publication job and now Cliff is no longer in this world...
no subject
Date: 2003-07-31 02:37 am (UTC)As for the Tetun English dictionary - it sounds like a very good idea.
no subject
Date: 2003-07-31 02:48 am (UTC)Optical Character Recognition.
I'm also thinking of donating the User's Manual to their collection, after all the material is generic enough to last some time. And it will be in Bahasa, Portuguese and English.
no subject
Date: 2003-07-31 02:50 am (UTC)