tcpip: (Default)
[personal profile] tcpip
Readers will know that I'm a big fan of Project Gutenberg. I completely agree with the sentiment of founder Michael Hart that "it would be a really good idea if lots of famous and important texts were freely available to everyone in the world".

But in the course of my research (in particular Internet and language content). I'm seriously beginning to have second thoughts about the Project.

Here's the current distribution of books according to language:

Bulgarian 6 Chinese 64 (whatever that's supposed to mean) Dutch 8 Flemish 5 French 102 German 183 Greek 1 Italian 13 Japanese 2 Latin 15(!) Portuguese 3 Spanish 15 Swedish 1 Welsh 4

and ...

English about 9500

To make matters less impressive they have this rule of ASCII first (not a universal standard) and then no standards at all. For example, the Epicheski pesni (Epical Songs), Slaveikov, Pencho, uses the Cyrillic Windows 1251 character set. Legge's Confucian Analects requires the Big 5 character set. The 1 Greek text (a translation of Sangharakshita, Vision and by Spiros Doikas) doesn't even mention what character encoding it uses!

Have these people heard of unicode?

Date: 2003-07-25 11:00 am (UTC)
From: [identity profile] mr-eleganza.livejournal.com
Relative links are not your friend :)

Date: 2003-07-26 02:13 am (UTC)
From: [identity profile] tcpip.livejournal.com

Point. Fixed.

Date: 2003-07-25 11:11 am (UTC)
From: [identity profile] greylock.livejournal.com
Have you contacted the PG people?
I have to admit, ASCII has served me well as a standard over the years, and I have but vague notions of what Unicode does, nut perhaps Gutenberg (and indeed the net) are at the point where cultural diversity are important.

After all, was it not last year that they finally got around to allowing non-English URLs?

Date: 2003-07-26 02:49 am (UTC)
From: [identity profile] tcpip.livejournal.com

Unicode, unbelievably, actually maps the character sets of every language in the world and quite a few historical ones as well. It is designed to be platform independent as well. I have contacted PG over this, so we'll see how it goes.

IIRC, Verisign, the mob in charge of the .com TLDs started issueing non-English URLs without checking with ICANN so there's a huge bun-fight over that one. I'll get back to you once I've done some research in that area.

domain names.

Date: 2003-07-28 10:57 am (UTC)
From: [identity profile] cvisors.livejournal.com
Its not just that, (all gtld registrars) do ML domain names. But bind and most other name servers support them. So they don't work....

Try telling that to customers who want them. GAH!!!!!!!!!


Benjamin

Re: domain names.

Date: 2003-07-29 01:58 am (UTC)
From: [identity profile] tcpip.livejournal.com

That's a damn good point. It'll even make up an extra two or three sentences in my thesis ;-)

Do you know if any working this? Getting bind and unicode to talk to one another?

Re: domain names.

Date: 2003-07-29 07:03 am (UTC)
From: [identity profile] cvisors.livejournal.com
Don't know its just not supported, and as it isn't supported at the root, they just don't work

Date: 2003-07-25 01:24 pm (UTC)
From: [identity profile] damned-colonial.livejournal.com
I dunno... there are lots of languages that are representable in ASCII, and yet they are not represented in PG... so it's not just the encoding issue that's going on here.

Date: 2003-07-26 03:37 am (UTC)
From: [identity profile] tcpip.livejournal.com

Plain vanilla ASCII only includes the English character set. Various (and often incompatiable) extensions were then added to it (e.g., IBM PC Extended ASCII, Microsoft Windows ASCII, Digital VT220 terminal ASCII etc etc).

Finally, the ISO developed ISO-8859 8-bit ASCII which eventually developed into a family (e.g., 8859-1 (Latin-1), 8859-2 (East European Latin), 8859-5 (Cyrillic). M$ continued on it's merry way with its own version of ASCII (code page 1252, with some jokingly say shows that M$ is still in the thirteenth century).

The People's Republic of China and the Province of Taiwan has completely different encodings (GB vs Big5) which means if you send an email from Beiging to Taiwan it comes out as garbage.

The damn thing is a mess. They should just insist on Unicode.


Date: 2003-07-28 04:01 am (UTC)
From: [identity profile] rustythoughts.livejournal.com
Don't forget the various encodings used in Japan dominated by JIS encoded as ShiftJIS and EUC.

But even Unicode has it's problems. It started as a 16 bit standard, but has been extended via some clever but not obvious tricks into a 32 bit standard.

Or this, unicode is a superset of JIS, except there is no standard mapping between them, individual vendors have some but nothing complete in the public domain...

The problem of encodings is quite far reaching, consider that fonts are collections of glyphs, and that something in the font must map from code x to glyph n, if you encode the document differently then you need new fonts (or seperate mappings). Then let some some company decides to invent a general mapping scheme but use it in a specific way and suddently you have a new quasi-standard encoding, but with non-standard uses abounding and no way to tell them apart, I'm looking at you Adobe...

There are also very few comprehensive Unicode fonts out there. I think Microsoft has one that has all the basic Unicode characters and a fair proportion of the surrogates. It's huge, and they advise against using it except for testing purposes. So font formats evolve to allow multiple fonts to be treated as collections except then you've got no way of ensuring that the same fonts are being used for the same glyph subsets.

And lets not talk about how printers deal with these issues, because that's the work I'm paid for and I'm avoiding it.

Date: 2003-07-28 04:08 am (UTC)
From: [identity profile] tcpip.livejournal.com

Interesting comments and right on the ball on almost all issues. This one: There are also very few comprehensive Unicode fonts out there. is interesting, but not really relevant for this purpose of this discussion.

I'm not even going to begin discussing fonts in my thesis. It's driving me crazy enough as it is.

Date: 2003-07-25 01:49 pm (UTC)
From: [identity profile] darkstardeity.livejournal.com
I don't know that we need to necessarily look for conspiracy theories here. As I understand it PG relies entirely on volunteers to scan in the books, correct the inevitable errors in the OCR, and proof-read the thing at the end of the process. It might not be that lack of diversity is being imposed on the project by any deliberate actions of the PG coordinators, but it may simply be that far fewer people are volunteering to convert and proof-read books from other language groups. The "ASCII by preference" thing no doubt discourages people whose language uses a character set outside of this standard, but again, I don't think that that is a deliberate ploy on the part of the PG people - PG started well before Unicode was an accepted or even well-known standard, and they may still have their doubts about it, or, as you have suggested, may not even be aware of it. The last time I looked into it, they were even baulking at the idea of using standard HTML as their preferred format because of issues of incompatibility between browsers that had already surfaced.

In these situations I like to apply a variant of Occam's Razor - never ascribe to malice what can be adequately explained by ignorance.

Date: 2003-07-31 02:26 am (UTC)
From: [identity profile] tcpip.livejournal.com

You know, I hadn't responded to this comment because I agreed with everything it said ;-)

I (temporarily) joined the PG volunteers list to flag this with them. A couple of the volunteers reacted a bit tetchily - you know "we're volunteers, how dare you criticise us!", and Michael Hart himself made a couple of comments - mainly along the lines of "we don't have any standards anymore because I don't like to force people to do anything".

Whilst I agree with him on an individual perspective, PG is an organization. And for the sake of efficiency and effectiveness, organizations can set standards.

never ascribe to malice what can be adequately explained by ignorance.

I do this as often as possible. But there are those who simply choose to remain ignorant and therefore malicious...

Date: 2003-07-26 07:43 am (UTC)
From: [identity profile] caseopaya.livejournal.com
I definately have to agree that when you print off copies of the books on Project Guttenberg they are not well formatted.

They are another group that should use a spell check before publishing them on the net :)

I do however love the fact that they exist in the first place

Date: 2003-07-26 08:53 am (UTC)
From: [identity profile] tcpip.livejournal.com

There's some instructions in their faq regarding formatting.

You're quite right 'though, I'm damn glad that they exist. Although their rate of production hasn't been the greatest - depending on volunteer labour and all.

It would be nice if one, just one, government in among the OECD nations said "Hey, what a good idea! An international public library".

I tried to spin the idea to the Victorian government a couple of years ago. There was some interest in local history stuff, but that's about it.

Maybe a project I can embark upon in the near future.. Dammit, if only some nice government would give me a couple of million dollars. I'm sure I could do something really useful with it.

Date: 2003-07-27 06:00 am (UTC)
From: [identity profile] caseopaya.livejournal.com
Maybe I should see if they need any volunteers to transcribe the books then, that's at least something I CAN do :)

I'm sure you will be able to convince the Govt to do something about it at some stage in the future...

Date: 2003-07-31 02:31 am (UTC)
From: [identity profile] tcpip.livejournal.com

Certainly. They recommend OCR and then proofreading the book, but because that requires "shredding" an out-of-copyright book, which is often pretty old, it isn't always a popular idea...

I'm thinking of typing up Cliff Morris' Tetun-English dictionary. It was a bit of a home publication job and now Cliff is no longer in this world...

Date: 2003-07-31 02:37 am (UTC)
From: [identity profile] caseopaya.livejournal.com
I'm going to have to ask what OCR means? Remind me to travel to their web site and see what they need.

As for the Tetun English dictionary - it sounds like a very good idea.

Date: 2003-07-31 02:48 am (UTC)
From: [identity profile] tcpip.livejournal.com

Optical Character Recognition.

I'm also thinking of donating the User's Manual to their collection, after all the material is generic enough to last some time. And it will be in Bahasa, Portuguese and English.

Date: 2003-07-31 02:50 am (UTC)
From: [identity profile] caseopaya.livejournal.com
Even better, that way you will be helping a number of countries even if you can't actually get there. As long as they learn about it that is.

Profile

tcpip: (Default)
Diary of a B+ Grade Polymath

August 2025

S M T W T F S
     12
34 56789
101112 131415 16
17 181920212223
24252627282930
31      

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Aug. 27th, 2025 01:42 am
Powered by Dreamwidth Studios