tcpip: (Default)
Diary of a B+ Grade Polymath ([personal profile] tcpip) wrote2003-07-25 04:31 pm

Project Imperialism?

Readers will know that I'm a big fan of Project Gutenberg. I completely agree with the sentiment of founder Michael Hart that "it would be a really good idea if lots of famous and important texts were freely available to everyone in the world".

But in the course of my research (in particular Internet and language content). I'm seriously beginning to have second thoughts about the Project.

Here's the current distribution of books according to language:

Bulgarian 6 Chinese 64 (whatever that's supposed to mean) Dutch 8 Flemish 5 French 102 German 183 Greek 1 Italian 13 Japanese 2 Latin 15(!) Portuguese 3 Spanish 15 Swedish 1 Welsh 4

and ...

English about 9500

To make matters less impressive they have this rule of ASCII first (not a universal standard) and then no standards at all. For example, the Epicheski pesni (Epical Songs), Slaveikov, Pencho, uses the Cyrillic Windows 1251 character set. Legge's Confucian Analects requires the Big 5 character set. The 1 Greek text (a translation of Sangharakshita, Vision and by Spiros Doikas) doesn't even mention what character encoding it uses!

Have these people heard of unicode?

[identity profile] darkstardeity.livejournal.com 2003-07-25 01:49 pm (UTC)(link)
I don't know that we need to necessarily look for conspiracy theories here. As I understand it PG relies entirely on volunteers to scan in the books, correct the inevitable errors in the OCR, and proof-read the thing at the end of the process. It might not be that lack of diversity is being imposed on the project by any deliberate actions of the PG coordinators, but it may simply be that far fewer people are volunteering to convert and proof-read books from other language groups. The "ASCII by preference" thing no doubt discourages people whose language uses a character set outside of this standard, but again, I don't think that that is a deliberate ploy on the part of the PG people - PG started well before Unicode was an accepted or even well-known standard, and they may still have their doubts about it, or, as you have suggested, may not even be aware of it. The last time I looked into it, they were even baulking at the idea of using standard HTML as their preferred format because of issues of incompatibility between browsers that had already surfaced.

In these situations I like to apply a variant of Occam's Razor - never ascribe to malice what can be adequately explained by ignorance.

[identity profile] tcpip.livejournal.com 2003-07-31 02:26 am (UTC)(link)

You know, I hadn't responded to this comment because I agreed with everything it said ;-)

I (temporarily) joined the PG volunteers list to flag this with them. A couple of the volunteers reacted a bit tetchily - you know "we're volunteers, how dare you criticise us!", and Michael Hart himself made a couple of comments - mainly along the lines of "we don't have any standards anymore because I don't like to force people to do anything".

Whilst I agree with him on an individual perspective, PG is an organization. And for the sake of efficiency and effectiveness, organizations can set standards.

never ascribe to malice what can be adequately explained by ignorance.

I do this as often as possible. But there are those who simply choose to remain ignorant and therefore malicious...