tcpip: (Default)
Diary of a B+ Grade Polymath ([personal profile] tcpip) wrote2003-07-25 04:31 pm

Project Imperialism?

Readers will know that I'm a big fan of Project Gutenberg. I completely agree with the sentiment of founder Michael Hart that "it would be a really good idea if lots of famous and important texts were freely available to everyone in the world".

But in the course of my research (in particular Internet and language content). I'm seriously beginning to have second thoughts about the Project.

Here's the current distribution of books according to language:

Bulgarian 6 Chinese 64 (whatever that's supposed to mean) Dutch 8 Flemish 5 French 102 German 183 Greek 1 Italian 13 Japanese 2 Latin 15(!) Portuguese 3 Spanish 15 Swedish 1 Welsh 4

and ...

English about 9500

To make matters less impressive they have this rule of ASCII first (not a universal standard) and then no standards at all. For example, the Epicheski pesni (Epical Songs), Slaveikov, Pencho, uses the Cyrillic Windows 1251 character set. Legge's Confucian Analects requires the Big 5 character set. The 1 Greek text (a translation of Sangharakshita, Vision and by Spiros Doikas) doesn't even mention what character encoding it uses!

Have these people heard of unicode?

[identity profile] caseopaya.livejournal.com 2003-07-27 06:00 am (UTC)(link)
Maybe I should see if they need any volunteers to transcribe the books then, that's at least something I CAN do :)

I'm sure you will be able to convince the Govt to do something about it at some stage in the future...

[identity profile] tcpip.livejournal.com 2003-07-31 02:31 am (UTC)(link)

Certainly. They recommend OCR and then proofreading the book, but because that requires "shredding" an out-of-copyright book, which is often pretty old, it isn't always a popular idea...

I'm thinking of typing up Cliff Morris' Tetun-English dictionary. It was a bit of a home publication job and now Cliff is no longer in this world...

[identity profile] caseopaya.livejournal.com 2003-07-31 02:37 am (UTC)(link)
I'm going to have to ask what OCR means? Remind me to travel to their web site and see what they need.

As for the Tetun English dictionary - it sounds like a very good idea.

[identity profile] tcpip.livejournal.com 2003-07-31 02:48 am (UTC)(link)

Optical Character Recognition.

I'm also thinking of donating the User's Manual to their collection, after all the material is generic enough to last some time. And it will be in Bahasa, Portuguese and English.

[identity profile] caseopaya.livejournal.com 2003-07-31 02:50 am (UTC)(link)
Even better, that way you will be helping a number of countries even if you can't actually get there. As long as they learn about it that is.