...making Linux just a little more fun!
NASA is the National Aeronautics and Space Administration for the United States government space program. The shuttle liftoff picture and the Discovery landing picture are from NASA's archives.
xplanet
follows the tradition of xearth but using real imagery. I've
prepared this picture using the following options:
xplanet -output lg_cover117.jpg -geometry 800x560
--background /home/heather/xplanet_bg_lg117.jpg -body
earth -longitude -20 -north orbit -config
overlay_clouds.29july05 -center +330+280 -num_times 1
I fetched the current cloudcover imagery per the instructions in
/usr/share/xplanet/images/README. I decided the night image was a
little too dark and made a softer one called night_mode which still shows some of
the landscape. Most people don't know that you can give it any background
you like instead of having it speckle the black background. Many people know
that you don't have to let it pick the origin point. I started with
-origin=moon which looks great but the moon wasn't visible from
Florida right then, so I had to move our point of view
![]()
The starfield is actually from an ultraviolet study of the Milky Way; read a little more about our galaxy at www.astro.virginia.edu/~mwk7v/sim/mw.shtml.
I created a markerfile with only one entry in it, the latitude and longitude for Kennedy Space Center, where the current shuttle took off and will land. There are a number of groups with custom markerfiles for various purposes. If you're putting together your own, you might take a look at Dave Pietromonaco's xearth markers page since he has a Gazetteer of coordinates for many locations.
May the shuttles enjoy many more successful flights.
Heather is Linux Gazette's Technical Editor and The Answer Gang's Editor Gal.
Heather is a hardware agnostic, but has spent more hours as a tech in
Windows related tech support than most people have spent with their computers.
(Got the pin, got the Jacket, got about a zillion T-shirts.) When she
discovered Linux in 1993, it wasn't long before the home systems ran Linux
regardless of what was in use at work.
By 1995 she was training others in using Linux - and in charge of all the
"strange systems" at a (then) 90 million dollar company. Moving onwards, it's
safe to say, Linux has been an excellent companion and breadwinner... She
took over the HTML editing for "The Answer Guy" in issue 28, and has been
slowly improving the preprocessing scripts she uses ever since.
Here's an autobiographical filksong she wrote called
The Programmer's Daughter.
Heather got started in computing before she quite got started learning
English. By 8 she was a happy programmer, by 15 the system administrator
for the home... Dad had finally broken down and gotten one of those personal
computers, only to find it needed regular care and feeding like any other
pet. Except it wasn't a Pet: it was one of those brands we find most
everywhere today...
Digital camera audio filesI've recently gotten a digital camera (yes, I know I'm sort of late coming into the digital revolution). This is a HP R607 and lets you add audio tags to still images.
[Lew] Kewl.
I would have thought that the audio would be a wav or mp3 file with the same name as the image, but life is never easy. I can only assume that the audio is embbedded into the jpg. I've checked, and the only files on the camera (other than some short XML files) are the jpgs.
[Lew] Yah. The JFIF format (that's the file format that "JPEG" pictures are stored in) supports a bunch of metadata. Many (I'm tempted to say most) cameras store a 'thumbnail' photo in the jpg along with the full photo. They also store camera information (make, model) and photo metadata (date/time of photo, focal of lens, exposure time, photographers comments, picture orientation, and a whole lot more). It wouldn't surprise me if the HP camera also stored an audio clip as metadata in the picture jpg.
[Ben] When you say that you've checked, do you mean that you used something like "camedia", or did you actually mount it as a storage device and looked at the files on it? The former may only show the JPG files, while the latter should show everything. My Olympus D-40, for example, produces discrete files for audio, stills, and movies.
If you're actually looking at the files on the device, then I'd have to agree with the previous post - it's stored as EXIF data.
Yes. It is definitely stored as EXIF data. Ran a jpg file into emacs and had a look. There's a nice RIFF/WAV header block right in the file. Of course, the picture files without audio don't have the header.
More reading leads me to think that some cameras use 2 files and others
embed the audio into the picture. Mine is that later
So, is there a way to play the audio in the pictures on Linux?
[Lew] I'm not sure, but it's likely that the audio is stored in one of the EXIF (JFIF metadata) tags. There are tools available that can extract EXIF tag data, ranging from the digikam/gphoto2/libgphoto tools to standalone tools like jhead. Perhaps one of these tools can extract out the audio, and you can play it from there.
Yes, it helps. Problem is to find a tool to do the extraction. digikam, etc (based on gphoto2) do NOT seem to support audio play/extraction.
[Ben] 'gphoto2' supports a '--get-audio-data' option. There are probably a number of other programs; googling for "exif audio extract linux" comes up with 40,500 hits.![]()
I think from reading and a bit of testing that --get-audio-data just copies .wav files if they are on the camera. I could be wrong, but I could not get this program to extract data.
I did the google as well
I found 2 candidate programs:
dphotox - this appears to be a great program, but I can't access the
download site ftp://ftp.mostang.com/pub/dphotox I sent
David.Mosberger @acm.org a note, but no reply as of yet. This is being
distributed as a binary only ... something to do with non-disclosure
according to the web page (which is accessable
http://www.mostang.com/dphotox )
EXIFutilsLinux2.6.2.tgz is another package which works. Installed and tried it. It is shareware and they want a bit of money to unlock all the features.
Amazing, the files I did extract sounded not bad at all.
I can't imagine that getting the audio out is much more than trivial. I just don't have time right now, but I did compare the extracted file to the data in the camera file and it is identical. Just a matter of figuring an offset and the size.
[Jimmy] I shouldn't imagine it's even that difficult: the audio is stored as an EXIF tag, so you really just need to extract the contents of a specific tag. (Hint: with Perl, Image::EXIF and Data::Dumper are your friends).
If you want to send along a sample file, I'd be happy to give it a stab/eat my words![]()
[Heather] dd has a very nice 'skip' option as well as 'count'. if your 'blocksize' is set to 1 and you are otherwise able to calculate how long to make the cut, you should be able to do something like substring extraction, on a file basis.
If our gentle readers have more ideas, or someone would like to do an article on really getting the most out of your camera under Linux, it'd be just the kind of thing to make Linux just a little more fun![]()
Problems with tcsh scriptingI have a number of issues with tcsh (not my choice..) shell scripting I need help with.
Basically I'm writing a shell script that automates a long setup procedure. This top-level script is in bash, however the bulk of it is comprised of commands that are fed to a program which is, in essence, a tcsh shell. I've achieved this by using the << redirector. I need help on two points:
1) Is there any way of suspending input redirection, taking input from
the keyboard, and then resuming input from the tcsh script?
[Ben] There is, but it is Fraught With Large Problems. I'm not that familiar with TCSH, but I've just had to do that in Perl - essentially, I was writing a Perl version of 'more' and had to answer the question of 'how do you take user input from STDIN when the stream being paged is STDIN?' It requires duplicating the STDIN filehandle... but before you start even trying to do it, let me point you to "CSH Programming Considered Harmful" - Tom Christiansen's famous essay on Why You Shouldn't Do That. If it's not your choice, then tell the people whose choice it is that They Shouldn't Do That. The task is complex enough, and has enough touchy problems of its own, that introducing CSH into the equation is like the Roadrunner handing an anvil to the Coyote when he's already standing on shaky ground over a cliff.
To give you some useful direction, however - the answer lies in using 'stty'.
2) There comes a point towards the end of the script when two shell scripts are run simultaneously. These shell scripts open up individual xterm windows to run inside. I'm wondering, is there anyway of having the tcsh script monitor stdout of one xterm, and upon the output of a certain piece of text, echoing a command into the stdin of the other xterm?
[Ben] Why not 'tee' the output of the script - one 'branch' into the xterm and the other into a pipemill (or whatever you want to run 'grep' in)?
Any insight or knowledge on the matter would be very much appreciated. I hope I have provided sufficient details.
[Ben] You have - from my perspective, anyway.
If the gentle readers have more to say, please let Aengus know, and cc The Answer Gang so we can comment on the results or see his answer in a later issue. Artciles on working with shells beyond bash are always welcome. -- Heather
Correction to WSGI articleThere's a mistake in my "WSGI Explorations in Python" article. http://linuxgazette.net/115/orr.html It says,
|
............... But both sides of WSGI must be in the same process, for the simple reason that the spec requires an open file object in the dictionary, and you can't pickle a file object and transmit it to another process. ............... |
Actually, the spec requires a file-like object, so an emulation like StringIO is allowed. StringIO is pickleable:
>>> from StringIO import StringIO
>>> sio = StringIO("abc")
>>> p = pickle.dumps(sio)
>>> p
"(iStringIO\nStringIO\np0\n(dp1\nS'softspace'\np2\nI0\nsS'buflist'..."
>>> sio.read()
'abc'
>>> sio.read()
" # End of file.
>>> sio2 = pickle.loads(p)
>>> sio2.read()
'abc'
cStringIO, however, is not pickleable.
Debian kernels without devfsThis is regarding Hugo Mills query on how to build a Debian initrd without devfs.
<flamebait>
Now why would you be wanting to build a Debian kernel without devfs.
Surely you haven't bought all that stuff by Greg K.-H. about devfs
being bad design?
</flamebait>
[Rick] Surely, it would be rude to speak ill of the dead. As we say in Norwegian, "Aluv ha sholem."![]()
I assume that you are planning to use Debian's kernel-package (make-kpkg) utility to build the kernel. This you can do without worrying about anything. Just build a kernel without devfs and other options as you want them.
The initrd that is installed along with the kernel is built (providing you specified that you wanted it to be built) at the time when you install the kernel-image-x.x.x-y package.
This initrd is build by a set of tools called (what else) initrd-tools; the principal among them being "mkinitrd". Now "mkinitrd" takes a conf file /etc/mkinitrd/mkinitrd.conf so you can make some changes there.
I haven't tried this but you would need to create a script in /etc/mkinitrd/scripts that would setup the necessary device files in the $INITRDDIR/dev.
More importantly, the script /usr/share/initrd-tools/init is the "init" that is put on the initrd image. You would need to replace this with your own version as the default one makes use of devfs.
If you are keen on sorting out all these issues you should probably contact the maintainer of initrd-tools as Debian's initrd will have to give up "devfs" at some stage since Linux 2.8 won't have "devfs".
Re: [LG 116] mailbag #1Just a thought, but how about if he compiles a plain-jane Pentium I kernel? It could be that the more recent kernels (and GCC?) might be putting in CPU instructions that didn?t get called before, or they are called more often now and are causing other errors. My experience is if the system is locked hard like he implies, then it's probably a hung CPU not answering interrupts (as he alluded to).
[Heather] Except for the part about how it answers them just fine under 2.4.x kernels, this seems plausible. It'll be tried.
He might also try limiting his use to the 768MB of RAM that his MB officially supports. Either by using the kernel command line "mem=768M" or put in only 768MB of RAM.
As a side note, he might want to run "MEMTest86+" (http://www.memtest.org/) and see if that sees any RAM errors.
There just might be a reason the MB manufacturer didn't recommend >768MB RAM.
[Heather] The full gig worked under 2.4.x kernels however some time has passed. So, this is a very worthy suggestion. Also a current 2.4.x kernel will be tried, if the hardware is failing in this sense that might be affected too. Note to readers: the good folk at MEMTEST do occasionally update their test suite. Looks like the last update was in 2004 sometime - but some of the rescue disks might have an older edition in them.
TAG "playmidy plays silently" LG#116Midi plays, but it is not easy. I am running Slackware 9.1 with a 2.4.28 kernel on a K6III 400.
I am using sfxload from the awesfx-0.5.0d package to load soundfonts on to a SBLive soundcard
If you load the soundfonts and play a midi, you may have to download the soundfonts from the SBLive card prior to playing a wav file. The soundfonts do not stay loaded very long, so you have to check the available SBLive memory prior to playing a midi.
Perhaps this will help.
Latest Gazette IssueA tickler about last month's issue taking so long to publish... -- Heather
Any news on the when the latest Gazette is out? Or have I missed the announcement... Being subscribed on both of my accounts to the announcement list...
[Ben] I figured we'd get mail from our readers about now.The new issue should be out tonight, Martin; the problem was that - this being summer - many of our authors are on vacation, and we were a bit thin on articles. Several of the folks in The Answer Gang had, very capably, scrambled their jets and kicked in a bunch of material, and I've been working to get it all organized. Blame the delay on me.
For all our readers, if you've ever wanted to try your hand at writing - we're always looking for new authors. In the worst case, if your submission gets rejected - and this is part of my commitment to LG - you'll get a note explaining exactly why it was rejected, along with suggestions on how to improve your writing. That is, at the very least, you'll get to learn something useful - and you may well end up getting published, which is not a bad thing to have on your resume.So don't be shy, folks - read our Authors' FAQ and send'em in!
It is our preference to ship an issue somewhere around the first of the month. Our lives including work in the weekdays have led the last several issues to come out somewhere around the weekend that's nearest; with the articles that came in late, we ran a bit overtime even by that standard. Sorry. But I also encourage people with good articles that they feel need some work, to contact us and get into the sequence. You needn't always publish in the same month - we won't write your article for you, but our Articles@ staff may be able to point out some directions for improvement, and we get a new author out of the deal, too. So everyone wins
-- Heather
London explosionsI know that Thomas, at least, is away from the City, but - just for my own peace of mind - are all the Answer Gangsters from England OK? That would be Mike Martin and Neil Youngman that I recall; if anyone else can think of others, I'd appreciate it if you could ping them and CC the list.
[Thomas] Aww, thanks, Ben. Actually, I was only 33 miles from London at the time. I was in Stevenage, a town around London visiting my Grandparents. Would you believe they have broadband? Heh. More modern than my parents.
But I'm well, and accounted for.![]()
And I'm very glad to hear it!
[Thomas] I just hope Neil is, although as I know Neil is around the Pin-Green area (where my grandparents are), I would surmise he too is just as well.![]()
Yes indeed; he's emailed me (and as I recall, cc'd the Gang on it.) I
haven't heard back from Mike yet, though - it's a bit worrying. Since
you live in the same small island, could you perhaps walk over and knock
on his door?
[Neil] Ben your concern is appreciated. I was working from home that day, fortunately miles away from all the incidents. My wife was working in London as usual, and the first bomb was on her route in. Luckily she was already at work when it went off. Although she had a very unpleasant journey home, we are lucky it was no worse than an inconvenience for us. With 49 confirmed dead and more than 25 missing, our thoughts are with their families.
[Thomas] Small? Hehehe, Ben, sail your boat round these waters, I'll show you around this place.![]()
[Jimmy] There was a lot of off-topic chat in here too: enjoy the Launderette if you're interested
|
...making Linux just a little more fun! |
The Answer Gang
![]() By Jim Dennis, Jason Creighton, Chris G, Karl-Heinz, and... (meet the Gang) ... the Editors of Linux Gazette... and You! |
We have guidelines for asking and answering questions. Linux questions only, please.
We make no guarantees about answers, but you can be anonymous on request.
See also: The Answer Gang's
Knowledge Base
and the LG
Search Engine
Greetings from Heather SternHello everyone -- welcome, once again, to the world of The Answer Gang.
There's a world of people out here doing good things. As my cover art raises a cheery note to those who are part of the space programs (not just for the USA program, though certainly since that's where I live that's the pics I'm looking at) - there's those of us who not only hope for a brighter future but make it so, by our heartfelt efforts and getting our hands deep in code and craft.
As good fortune would have it, I'll get to meet a larger batch of them at this year's LinuxPicnic than last. And I can reasonably hope I'll see a decent batch of you folks at Linux World Expo in my area too.
Why is this important, you might ask. It's the Internet Age; people live on their cellphones, podcast LUG radio reports at each other, spend more time in IRC than visiting their aunts and uncles, mail order things via PayPal or other money-kindred and about a billion online stores. Who notices that real world thingy? The wattage in the Blue Room is up way too high, too. But that's just it -- it may be a big blue room... but it's the same world we all live in... and we've all got a much finer chance of doing our best if we learn to share the bright marble Nature has granted us.
Hooray for open source. Have a great Summer, folks. See you next month.
Network File Systems.From mso@oz.net
Answered By: Rick Moen, Lew Pitcher, Jimmy O'Regan, Bruce Ferrell, Neil Youngman
What other choice is there besides Samba? Am I wrong for dismissing Samba due to its Microsoft taint?
(I do have to use Samba at work. So far it's been fine except I had to use the "cifs" filesystem instead of "smbfs". Apparently our server pretends to speak the older smbfs but actually doesn't.)
[Rick] Well, if you need help disposing of those troublesome spare CPU cycles, there's always SFS (http://www.fs.net).
Personally, my preferred solution is called SMTP, aka "Please drop the mail off right here where I am, thanks" -- for values of "where I am" equating to "the machine I ssh to, where mutt is left running permanently under GNU screen ".
Tridge's reported solution is to use rsync (what else?) to mirror his mbox between his SMTP host and whatever machine he's sitting at.
OK. I meant for the general problem of mounting remote filesystems, not
the specific problem of remote mailboxes.
[Rick] I have nothing against you changing the focus of the discussion in that fashion; I just note that you've done so. Enjoy.
For that there's only NFS and
Samba? (rsync, scp, and ftp don't count
It looks like SFS on Linux is built on top of NFS, so I'm not sure it counts as a "third" one. http://www.fs.net/sfswww/linux
The reason for my question is, there doesn't seem to be a "good" solution for sharing filesystems on Linux. For years I keep hearing:
NFS: Unreliable! Doesn't play well with file locking!
Samba: Evil! Microsoft! Proprietary protocol! Embrace and extend!
So what's the organization that wants a central fileserver to do?
[Rick] Take your pick:
- AFS, or
- It depends.
Did
Microsoft in fact create something better than NFS (better = more reliable
and better designed), or is it just different?
[Rick] They're different. It would take a long time to go through the differences, and I'll leave that to some other poster.
[Lew] I won't comment on "better than NFS or just different", but I will take exception to the implication that Microsoft created the protocol.
[Rick] I also recommend hearing hearing Jeremy Alison give his standard lecture, if people want to hear the full details of just how bad CIFS/SMB really is.
[Lew] I will give Microsoft credit for extending an already existing protocol, but the basics come from IBM's NETBIOS.
[Jimmy] Well, there are actually three different fileserving systems in the heap that is MS fileserving: NetBIOS, CIFS, and DCE DFS. CIFS may or may not depend on one or both of the others.
MS's file serving is much, much better for file locking (but you only get the benefits of that from software that uses MS's locking API). It's also better to use Samba in an environment where there are several Unix variants, and you care about ACLs -- the Samba team 'embraced and extended'[1] CIFS to add marshalling for the various ACL types, which NFS doesn't do. (Well, NFS4 might do, I don't know).
[1] They just like saying that, as far as I can make out. AFAICT, they just added an extension that plays well with others, and looks like any other unknown DCOM interface to clients that don't look for it -- it doesn't get drunk at the party and throw up on the other guests.
[Bruce] Umm Jimmy, aren't NetBIOS/NetBEUI simply transport protocols? I think it might be more appropriate to say SMB and DCE DFS. And DCE DFS is actually built on top of SMB, but I could be wrong there. I just set these things up. I'm too busy to look at the messages anymore. And I think you neglected NCPFS... Not that anyone does much with Novell protocols anymore.
[Jimmy] ...cue brain dump.
This may not be entirely accurate, because: it's 6am, and though I was working nights last night, I didn't sleep much during the day in the hopes of readjusting to normal hours, and this is stuff I mostly learned back in the days of NT4, when I was in college, and went a bit further than strictly necessary in my studies for the MCSE exams I was never able to afford to take, though fortified with some investigations last year when Mike was asking about smbfs vs cifs. The article I wrote about outliners came the month after that, and AFAIR I included an example outliner file that contained some specifics, such as RFC numbers etc.
- NetBIOS is an API (originally, IIRC, a set of DOS interrupts)
- SMB is the part most people think of when they think of NetBIOS, but it actually uses a transport protocol such as NBT (NetBIOS over TCP/IP), NetBEUI, or NetBIOS over SPX.
- SMB, NetBIOS, and possibly NBT were created by IBM
- Microsoft extended these, originally while working with IBM on OS/2 for LanManager (y'know, the easily crackable password hashes that made NT's network authentication so horribly insecure)
- CIFS is a version of SMB that doesn't require NBT -- that's why smbfs and cifs are separate kernel modules -- that also includes extensions for NTFS's features such as multiple file streams[1]
- DCE DFS is something like NFS, over DCE RPC instead of Sun RPC
- Microsoft uses an extended version of DCE RPC, Object RPC, as the basis for DCOM. COM interfaces are implemented locally using C++ vtables, which are marshalled over ORPC, which in turn is passed over NT's (network) Named Pipes: (extended) DCE RPC over SMB/CIFS.
I didn't mention ncpfs because I was answering the second question (NFS vs. Samba), not the first (any network file system).
[1] This is one of the places where NT shows its VMS heritage, as all file streams are preceded by ::$, such as ::$DATA, where the data is contained. (If that seems familiar, it may be because of the IIS flaw where it would send you the unprocessed source of ASP files if you appended ::$DATA to the file name).
And now IBM has dumped OS/2 and encouraged users to migrate to Linux.
http://www-306.ibm.com/software/os/warp/migration.html
http://slashdot.org/article.pl?sid=05/07/15/0245221&tid=136&tid=190
(Scratches head.) There are still OS/2 users out there? And they are more technical than Windows or Mac users who are scared of Linux?
[Neil] Heck, there are still VMS users out there, there are probably still RSX users out there. Generally these are people who have specific requirements, e.g. real time, stability, security, that have had the sense not to jump on the Windows bandwagon, because it doesn't meet their requirements.
So yes, I reckon OS/2 users are generally more technical than "Mac and Windows users who are scared of linux". They may migrate to Linux, but they will do it when they are convinced it meets their requirements better than the alternatives and when it suits them, not a minute before.
[Lew] You bet. OS/2 is/was heavily used in the banking industry as 'Teller' terminal systems and as operator control systems/interface systems to IBM mainframes.
My employer (a Canadian bank) has a multi-year project currently running to migrate our approx 15,000 OS/2 branch workstations and branch servers to another OS. Linux was considered, but in the end, my employer went with WinXP.
Our OS/2 users are no more technical than the cashier in your local grocery store. Our OS/2 applications are quite sophisticated.
Making SSH a supported protocolFrom Mark Jacobs
Answered By: Ben Okopnik, Jimmy O'Regan
Gang,
I manage a web server that is used by an internal help desk, currently this help desk uses telnet to access aix servers on our corporate wan. I have multiple pages that serve URL's to the aix machines e.g. telnet://hostname <telnet://hostname/> . We are in the process of changing all of these servers to use SSH and need to know how to make ssh://hostname a registered protocol so that I can convert my links and have them work. I am unable to find any information on where/how you set up a protocol and associate it with an application. Is this a system or browser issue? Any information you might have or be able to point me to would be a big help.
[Ben] In the future, please send your questions in plain text; that's the accepted format for The Answer Gang. The instructions for setting your mail client to do this, as well as much other relevant information, can be found in the "Asking Questions of The Answer Gang" FAQ at http://linuxgazette.net/tag/ask-the-gang.html
Regarding your question, there's no "registration" that you can do to make SSH magically happen from the server side: URLs are parsed on the client end, by the specific browser that's being used.
Note that some browsers - e.g., Konqueror - do parse 'ssh://' URIs; they fire up a console with a login prompt (which is, of course, the correct response - SSH is a secure SHELL protocol.) Konqueror also supports the 'fish://' protocol - an SSH-based connection that allows file viewing and could be a bit closer to what you want... or maybe not.
The problem is that most other browsers do not support these schemes - and many cannot even be adapted to do so. There's a huge number of browsers operating on a number of OSes, and unless your company has some sort of a draconian software policy, you have no way to restrict them or control which ones people use.
The obvious solution here, in my opinion, is to run a web server, and place your documents on it. Telnet should go away - sending passwords across the network in plain text and IP-based authentication are not sensible things to do in today's world. Running a web server, particularly a simple, read-only one like "thttpd", is a trivial task requiring either no or only a few seconds of configuration, and the replacement of telnet by SSH and HTTP should significantly decrease your vulnerability profile.
[Jimmy] For Mozilla, you can add protocol support using Javascript: the URN support XPI (http://piro.sakura.ne.jp/xul/_urnsupport.html.en) is a good example. (The URL specific code can be found here: http://piro.sakura.ne.jp/xul/codes/urnsupport/content/urnsupport/URNRedirectService.js)
For Konqueror, you add protocol support by writing a KIOSlave. There's a tutorial here: http://www.heise.de/ct/english/01/05/242
For Dillo, you write a DPI: http://www.dillo.org/dpi1.html
If for whatever reason you need to run Internet Explorer using Wine, you can add protocol support by following the example of this mail (http://www.winehq.org/hypermail/wine-patches/2005/06/0776.html - a patch to add support for MS's res: protocol to Wine), and this mail (http://www.winehq.org/hypermail/wine-patches/2005/07/0049.html - registers the protocols). This is Linux/Wine specific though![]()
Urk... Mutt just did something *ST00PID*From Benjamin A. Okopnik
Wow. I've got to say that I'm just stunned by the moronic thing that Mutt just did. It's probably the stupidest thing I've ever seen from any Linux app - it rivals 0utlook and IE for complete slack-jawed idiocy.
Once in a while, I get false positives in my spambox. Today, I got one from somebody posting to TAG (I don't recall the name - somebody who had sent it to the wrong address and then bounced the DSN + original mail to TAG), so I saved it to my main mailbox by hitting 'v' (view), selecting the "pre-Spamassassin message", hitting 's' (save), and choosing "/var/mail/ben". When I opened the message, I decided to repeat the operation (i.e., get rid of the "wrapper" message) - so I again hit 'v', selected the original message, and hit 's'. Mutt then popped up a message that said something like "file exists - are you sure?" - and since I had done the same operation dozens of times before, I hit 'y' for 'yes'... at which point, my mailbox got wiped. Zeroed. Nothing left of the 20 or so messages I was going to answer, not even the message that I had theoretically saved. (Mike, your Python article was part of that - so if you could resend, I'd appreciate it.)
I'm in a bit of shock here, and rather pissed off. In all the years I've used Mutt, I never realized that this essentially random bomb was hidden in it - and triggered off by a message that seemed to make sense in the context.
Dammit. Double dammit, since I use my mbox as a sort of a backup "to do" list - I leave emails that call for some kind of action in it until I've completed that action. Grrrrr.
[Kapil] Commiserations for your loss.
Thanks. As best as I can recall, there was nothing really critical or
earth-shakingly important in there, but important enough for the loss to
create a high annoyance factor. Semi-amusingly, two of the messages got
"saved" by my despamming mechanism: they had been sent to me by a reader
who dressed them in spam-like clothing (all-HTML content, funky mail
hosts, etc.), and when I forwarded them to TAG - they were in regard to
Heather's query in the Mailbag - they got spam-slammed again. So,
between a little info message that Kat sent me, Mike's article, and the
two not-spams, I've got four messages back.
[Kapil] You could try ext2 recover mechanisms. They might work.
I hadn't thought of that at the time, and given that the same file had
new mail in it just a few seconds later, and a very large email in it
yesterday evening, I'm pretty sure that there's nothing left.
[Kapil] At one time I had a similar ToDo list at the top of my mbox and my mailer (vm/emacs) of the time did something similar to what mutt just did to you. I was able to recover my ToDo list (though not the more recent stuff in my mbox).
Partly as a result of the above catastrophe, I moved away from vm/emacs but (I think) more importantly, moved away from the mbox format. I am currently an advocate of the MH or maildir formats for personal folders. One mail---one file. Almost no screw-ups by a mail user agent can screw up all my mail again.
[Heather] Unless it screws up the directory. Also mdir index mechanisms can get mangled; though it seems to take more work, it's much wonkier when it does.
I've thought about that in the past. My hindbrain had made some
disquieting noises about not being able to search the archive quite as
effectively - which does not appear to stand up to rational analysis
when considered soberly - and so I'd left it alone.
Hmm, perhaps this is becoming a Gang-relevant question. Folks, what do you think of the pros and the cons of MH vs. Mbox? The net provides much in a way of "yea" and "nay" answers, with only esoterica for support (speed of opening 2,000 messages - wow, very important criterion to me...), and I'd like to hear if anyone has had other positive or negative experiences with either format.
[Rick] (Note: I've studied the pros and cons of Maildir a bit; MH much less so. To a first approximation, I'll assume they're similar.)
1. People with their mailboxes on Nightmare File System need to migrate to Maildir or MH format with all reasonable speed, because of the greatly increased chance of lossage.
Despite Sun's love affair with NFS and their subsequent attempts to
smear it on sandwiches, mix it into house paint as a mold retardant, and
use it for greasing subway trains, I avoid it like the plague.
[Rick] 2. Otherwise, the advantages of Maildir/MH format strike me as somewhat but certainly not overwhelmingly compelling. I vaguely recall that the mutt MUA has an (optional) indexing feature that reduces the performance hit of Maildir. People who've migrated to that seem happy with it.
I still keep absolutely everything in mbox files, anyway, because I'm lazy and set in my ways, because I've not yet been bitten by a glitch the way you were, and because something about huge trees of little files just doesn't seem right.
We seem to share a similar set of prejudices, Rick. "Lazy, set in my
ways, trees of little files vaguely wrong" - yep, that's me to a tee. So
far, I haven't heard any compelling arguments for switching - I was
hoping that somebody had one...
[Kapil] Having already called myself an advocate for MH/maildir let me point out one disadvantage of MH/maildir on a multi-user system where you do not have control over disk quotas. Both MH/maildir could cause you to run over file (not space) quota. This could also be a problem over NFS (too many NFS file handles ... ).
Y'know, I really enjoy this kind of thing. When I ask this kind of
questions, people's answers tend to trigger off the "oh, yeah... I
remember reading/hearing/seeing that!" 8 times out of 10. It calls up a
strong echo of Brunner's "Shockwave Rider": "We should not be crippled
by the knowledge that no one of us can know what all of us together
know."
Thanks for the reminder, Kapil!
[Kapil] I have not noticed speed issues. I know "mutt" is a four-letter word in Ben's book right now, but it can employ "header caching" so that MH/maildir folders can be scanned quickly. This also mitigates the NFS problem somewhat. Other mailers may do the same.
Mutt did this one stupid thing in all the time that I've used it - under
a defined set of circumstances. I now know enough to avoid those exact
circumstances; the only thing that's left is a question of "do I trust
Mutt not to present me with other equally boneheaded non-choices?"
Well... there are no guarantees, but I believe that Mutt was written
with the best of intentions (as well as being subject to the Open Source
debugging mechanism), so, yeah, I'm willing to trust it. Conditionally.
[Kapil] Between MH/maildir, the former should be avoided if there is some possibility that the folder could be accessed by two programs at the same time.
To Ben's question I would add the following compatability related question. "Do most mail-related utilities handle MH/maildir nowadays?"
[Nod] I'm very much a CLI user by preference. Looking at the list of
Maildir clients, it seems that most of them are GUIs - which mitigates
against my adopting it. Although there certainly are CLI clients, they -
other than Mutt - are not very common. Given that I log into a variety
of systems to read my mail, often over low-bandwidth links, this
definitely reduces my options.
[Rick] (Again, I don't know much about MH-format support.)
I have information on various Linux MDAs (mail delivery agents) and LDAs (local delivery agents), here, including which mail-store formats they support, where known:
"MDAs" on http://linuxmafia.com/kb/Mail
I have similar information on 123 MUAs (mail user agents = mail clients) known to be available for Linux, here:
"MUAs" on http://linuxmafia.com/kb/Mail
[Sluggo] Actually, speed of opening mailboxes is important to me. I switched from mbox to MH format years ago because it's "safer" and more "ideologically correct" (notwithstanding Rick's comment to the contrary; I just prefer not stuffing multiple things into one file with a program-specific "separator".) But then I switched back so I could get the "You have new mail" messages from zsh.
I think that this has been one of my dimly-sensed, not-fully-formed
objections to Maildir/MH - a feeling that it's not quite as well
supported/debugged as mbox. I don't have much of a problem with the idea
of a defined separator in the file; the only time I've seen it screw up
was when my mail server went bonkers and delivered me a box of messages
where the 1st message was headless (the clue came when I looked at one
of the old emails in the box - the original content had been extended
by something totally unrelated!)
[Sluggo] That seems to work only with file mailboxes, not directory mailboxes. I was concerned about losing mail, then thought, "How often has mbox ever trashed my messages anyway?" I can't think of a single instance where new messages stomped on existing messages. Mutt does do sometimes display an mbox message as two messages, split arbitrarily in the middle, with empty headers for the second (and thus a date like Jan 1980), but that was never critical. NB. I don't use NFS, especially not for mail spool directories.
It sounds to me like an upstream mail server that fails to use the "From
hack". The last time I recall that happening was in an email from a
listserv server - about ten years ago.
[Sluggo] But with my current computer I keep my mail on my ISP's mailserver and use IMAP. I'm less concerned about ISP snooping or the mailserver going down than about my own computer going down after it has downloaded mail, or my Internet connection randomly freezing for several hours when I was away from home and had to look up a phone number in an email. But I found mutt's IMAP user interface sucks ass: you can't set an IMAP address as your primary inbox, apparently, meaning you have to type this verbose syntax to access it each time you start mutt. I looked at other mail clients. Kmail kept showing a long-outdated configuration I couldn't override so that alarmed me. I settled on Thunderbird, although it sometimes gets its index pointers out of alignment and won't let me access a new message, so I have to restart it. Someday I'll look over the other clients available. I definitely want a client that can read mail in place in standard formats, rather than one that wants to slurp it up into its own format (and deny access to other mail programs). I think both Thunderbird and Kmail want to slurp up local mailboxes.
[Kapil] This has to be an old version of "mutt". The newer one seems to allow
set spoolfile = imaps://luser@ghost/INBOX
Of course, "mutt" is not particularly good with IMAP. So my current config is based on "offlineimap" which copies all the mail from the server into a local maildir. It also syncs whatever changes I make to the folder back onto the server; in this it improves on "fetchmail" which is one-way.
Of course, this setup means one uses more bandwidth than is strictly necessary just to read those mails that one is interested in. I have so far not suffered any glitches in this setup (but I'm waiting ...).
[Heather] My positive experiences with mdir are about arrival speed on a heavily loaded server. Also, among the IMAP implementations, Courier was about 10x or 11x as fast as wu-imap, which was fragile, and Cyrus was only about 3x. Courier backends with maildirs, though I'm not sure that's where all of its benefit comes from.
The notes and specs rant about mdir being safer but a lock is a lock. They're probably right since I have seen mbox files get bad hiccups from intermingling messages when locks fail. On the other hand individual mails in mdir use up whatever storage unit is going around. Maybe on reiser (where tails that are tiny are often crammed into one stroage unit) this is less of a pain.
[Kapil]
> I know that this is a bit like asking you to lock the stable door
> after the horse has bolted it, so commiserations once again.
^^
[ puzzled ] If the horse has already bolted it - very smart horse, that
- why would I bother locking it? Or is this a multiple-level security
implementation?
[Kapil] Thanks to one V. Balaji (Tall-balaji) for this excellent improvement of a classical proverb. I have used this version ever since I learnt it from him.
[Heather] Not only that but after pulling this prank, the horse ran away really fast
![]()
Transcoding UTF to ISO8859-1From Riza Aziz
Answered By: Jimmy O'Regan, Ben Okopnik
Dear Answer Gang,
I am having some problems with reading converted UTF8-encoded web pages on my Palm PDA. I can't figure out how to transcode UTF, which the Palm doesn't understand, into ISO8859-1, which the Palm displays properly.
Some background: I have a script that downloads the latest news from a few sites. It's an ugly RSS look-a-like for sites that don't support RSS. I then use "htmlconv", a Perl script from the txt2pdbdoc package, to convert the downloaded pages into a text-only format which I can then upload to the Palm. The script also converts character entity references (egrave, quot etc.) into ISO8859-1 characters. On the Palm, I use Cspotrun to read the PDB files.
I downloaded the txt2pdbdoc package a long time ago and it worked fine with Redhat 6. When I upgraded to Redhat 9, Perl's UTF handling broke the script because it assumed I wanted the converted web page in UTF. This poses a problem because the Palm doesn't understand UTF characters; accented letters and certain punctuation marks become strange symbols. By adding
use encoding 'latin1'; use open ':encoding(iso-8859-1)';
everything worked again.
Now, I've come across a website (http://www.zmag.org/recent_featured__links.cfm) that uses UTF directly. Instead of using "egrave" for an accented E, it uses the UTF character directly. The converter script doesn't know what to do with the character and I get all sorts of strange symbols when viewing the file on my Palm.
Is there any way to convert the UTF characters directly into ISO8859-1? And how do I get rid of any characters that don't map directly, so strange symbols don't show up on my Palm? I've messed around with the encoding pragmas but I can't get anything to work.
Thanks!
[Jimmy] Righto. I have a silly perl script that prints out the Polish alphabet (so I don't have to trawl through the iso-8859-2 man page for the long names of the odd characters) that looks like this:
See attached alfabet.pl.txt
To get iso8859-1 output, I could replace the last two lines with:
use Encode;
$alfa = encode ("iso-8859-1", $Alfabet);
print "$alfa\n";
print lc "$alfa\n";
(or perl alfabet.pl|recode 'utf-8..iso-8859-1')
To get rid of the extra characters, you'd probably be better off converting to ASCII rather than ISO-8859-1 -- Perl will print a question mark instead. (recode will too, if you use the -f option, to force an irreversable change. Otherwise, it'll stop as soon as it finds a character that it can't convert).
I looked everywhere in my system and I can't find recode.
Does it belong to any particular package?
[Jimmy] It's normally in its own package. http://packages.debian.org/stable/text/recode
I did try
substituting all the UTF characters with their common ASCII
equivalents e.g. open & closing quotes with ". I created a
hash as above and used s/// but nothing happened.
One strange thing: the single closing quote character under UTF is \x{2019}, which I tried substituting for. However, running hexdump on the file shows the character is actually E28099... what gives? What can I do to get a straight ASCII dump of the file?
[Jimmy] From http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
|
............... The following byte sequences are used to represent a character. The sequence to be used depends on the Unicode number of the character: U-00000000 - U-0000007F: 0xxxxxxx U-00000080 - U-000007FF: 110xxxxx 10xxxxxx U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx U-00010000 - U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx U-00200000 - U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx U-04000000 - U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx The xxx bit positions are filled with the bits of the character code number in binary representation. The rightmost x bit is the least-significant bit. Only the shortest possible multibyte sequence which can represent the code number of the character can be used. Note that in multibyte sequences, the number of leading 1 bits in the first byte is identical to the number of bytes in the entire sequence. Examples: The Unicode character U+00A9 = 1010 1001 (copyright sign) is encoded in UTF-8 as
and character U+2260 = 0010 0010 0110 0000 (not equal to) is encoded as:
............... |
If you want to look at the raw text, just use a text editor that isn't unicode aware.
[Jimmy] One thing to be aware of when dealing with files created on Windows (as the page you pointed to was) is that Windows usually uses UTF-16LE rather than UTF-8.
Yeah, created with MS Frontpage. Blech
All this time I
thought that ISO8859-1 was the standard encoding for web
pages, with other encodings used for Chinese, Japanese
script etc. Is mixing and matching allowed?
[Jimmy] I don't know, to be honest. I'd assume that if you want to do that, you'd be strongly encouraged to use Unicode. (Umm... actually, these days, you're encouraged to use XHTML rather than HTML. Since XHTML is based on XML, it's UTF-8 unless stated otherwise).
Thanks for the link to an excellent website on most things
Unicode. It cleared my misconception that Unicode sequences
correspond to actual byte sequences, when they don't e.g.
\x{2019} is actually E28099, not 2019.
[Jimmy] Erm... be careful with your phrasing there. The part that's written to disk is an encoded version of the Unicode sequence. UTF-7 is a good example of how it works: IIRC, it's UTF-8 encoded with Base64.
I think I have the
problem mostly solved. I added the following pragmas:
use encoding 'utf8';
open( OUTPUT, '>:encoding(iso-8859-1)', "$txt_file"
)
So, the script processes everything in Unicode but spits out the results in ISO-8859-1.
The hard bit for me is the substitution. The following snippet is supposed to do the substitution, but it doesn't work:
%utf_entity = (
"\x{2019}", '"',
"\x{201c}", '"',
"\x{201d}", '"',
);
s/(\X+);/exists $utf_entity{$1} ? $utf_entity{$1} : $1
/eg;
Instead, I get an error for each non-matching Unicode character:
"\x{2019}" does not map to iso-8859-1 at
/home/riza/bin/htmlconv-utf line 302
However, using s/\x{2019}/"'"/eg, s/\x{201c}/"'"/eg and so on for every non-matching character works. It's a really clunky way of doing things but the resulting file displays perfectly on the Palm. How do I match hex sequences for non-matching Unicode characters in a regex, without wiping out all other characters?
[Jimmy] OK, so you have something like this:
See attached utf-1.pl.txt
With some functions from the Encode module, you get the right output:
See attached utf-2.pl.txt
The regex still isn't working though. Let's break it down:
s/(\X+);/
I'll assume the semi-colon was a typo. I think the pattern should really be (\X) though; you're using it to match individual characters against a hash, so you don't want to get more than one character. If you want to see what you're matching, you could use something like this:
s/(\X)(?{print "Matched: $^N\n"})/
and shouldn't it be $utf_entity{"$1"} instead of $utf_entity{$1} ?
But do you really need to do that stuff with the hash when you could use tr instead?
tr [\x{2019}\x{201c}\x{201d}] ["];
Jimmy O'Regan wrote:
> and shouldn't it be $utf_entity{"$1"} instead of $utf_entity{$1} ?
OK, no. Here's a working version:
See attached utf-3.pl.txt
I think the problem was with the encoding of the file
handle and a bit with the regex itself. Below are snippets
of my version of the converter script. However, I'm
following the original author's method of slurping up all
the input and putting it into $_, whereas your script loops
through the input. Which method is better?
[Jimmy] How long is a piece of string?
(No, really - if you know your file will fit into memory, your way is better, otherwise my way is better (I think
).
[Ben] That being the big caveat - although HTML files are not likely to be so huge that they'd cause an OOM on a modern machine.
open( INPUT, '<:encoding(utf8)', "$html_file" ) or die "$me: can not open $html_file for input\n"; $_ = join( ", <INPUT> ); # slurp up all of HTML
[Ben] This is not a good idea. You're reading in <INPUT> as a list (which takes ~5x the memory for the amount of data), then "join"ing the list - seems rather wasteful, particularly since you don't need to do any of the above (entities are not going to be broken across lines.) For future reference, try this:
open Fh, "foo" or die "foo: $!\n";
{
local $/; # Undef the EOL character
$in = <Fh>; # Slurp the content in scalar context
}
close Fh;
close( INPUT );
if ( $txt_file ) {
open( OUTPUT, '>:encoding(iso-8859-1)', "$txt_file" ) or
die "$me: can not open $txt_file for output\n";
select OUTPUT;
}
### various HTML-stripping bits here
%utf_entity = (
"\x{2019}", "'",
"\x{201c}", '"',
"\x{201d}", '"',
"\x{2026}", "...",
"\x{fffd}", "",
);
s/(\X)/ exists $utf_entity{$1} ? $utf_entity{$1} : $1 /eg;
print "$_\n";
The above regex works. I found that I didn't have to use $_ = encode_utf8($_) to get it running, as long as the non-matching UTF characters were stripped out before output. If a character was left in, its Unicode sequence in plain text was shown in the output file e.g. \x{fffd}.
[Jimmy] It dawned on me this morning that I only needed to use the open() stuff or the Encode stuff, but I'll just stare at the ground, shuffle my feet, and mutter something about being doubly sure
That stuff was a hangover from pasting a line in the wrong place, before I noticed you were trying to match too much.
[Ben] As to the script itself, well -
See attached utf2iso-8859-1.pl.txt
Use redirection to write your output to a file - or pipe it into something else for further processing.
I think your way (of looping through the file) is actually
better for a large range of file sizes.
I usually "cat" a bunch of HTML files together before
converting them with the script. Using the slurp method, a
cobbled-together file of 1 mb or more pretty much kills the
computer
It just sits there, not processing but taking
up a lot of memory. After 10 minutes I have to kill the
process.
OTOH if you loop over the file, that would allow it to better allocate memory, I guess?
[Ben] Depends on your meaning of "loop over" - if you load the entire file into memory and loop over it (as both slurping and the 'for' loop will), then no. This is one of the things I constantly emphasize to my Perl students: you can easily bring down your machine by slurping files - do not do it unless you're very confident that the maximum file size will be no more than a tiny fraction of the memory.
# Wrong ways for arbitrary file sizes, OK for small files:
### Slurp into array
@file = <Foo>;
### Load <Foo> into memory as an array
for $line ( <Foo> ){ do_stuff( $line ); }
### Load Foo into memory as a string
{ local $/; $file = <Foo>; }
# Right ways when in doubt:
### Read the filehandle one line at a time
while ( $line = <Foo> ){ do_stuff( $line ); }
### Read a paragraph at a time, in case there are continuation lines
### (e.g., mail headers)
{ local $/ = "\n\n"; while ( $line = <Foo> ){ do_stuff( $line ); }
Many thanks for the help! I'm attaching the whole script,
in case someone might have use for it.
See attached htmlconv-utf.pl.txt
[Ben] Both Jimmy and I mentioned the reasons why slurping can be dangerous, but there are times when you can't avoid it - although constructs like this one tend to handle most of the "line-continuation" scenarios:
while ( $line = <Fh> ){
while ( test_for_incomplete_line( $line ) ){
$line .= <Fh>;
}
# Process $line further
...
}
In other words, if there's some metric you can use for distinguishing an incomplete line, then you don't need to slurp. Conversely, if you're looking at formatted text, you can also avoid slurping by processing a paragraph at a time:
$File = "/foo/bar/gribble.qux";
open File or die "$File: $!\n";
{
local $/ = "\n\n"; # Define EOL
while ( <File> ){
process_content();
}
}
Asking which method is "better" makes no sense until you consider the data that you're processing. In your case, since entities don't break across lines, slurping is unnecessary - so processing it a line at a time is quite sensible.
Thanks! I guess the script's original author wrote it only
to convert small, single HTML files instead of huge HTML
lumps of multiple files.
[Ben] That would be my guess. Either that, or he didn't even consider the issue. Either way, handling the problem in the script would have been trivial:
# After processing all the command-line options, loop over files
for ( @ARGV ){
if ( -s > $MAX_SIZE ){
warn "File $_ rejected: TOO LARGE!\n";
next;
}
process_files( $_ );
}
Just curious, what kind of data would have entities spread
across multiple lines ie. binary data? Even plain text
would be terminated with CR or CR/LF, correct?
[Ben] Well, what makes them entities is that they're atomic - i.e., irreducible units. That means that they _can't_ be broken, across lines or in any other way - otherwise they become, erm. non-entities.
![]()
[Jimmy] Erm... not quite. Let me veer off-topic a little...
In SGML and XML you can define your own entities, which can contain pretty much anything you want -- a multi-line disclaimer, for instance. Since the trend in browsers has moved towards generic XML browsers that render using CSS or XSL stylesheets (but with a fallback mode to handle the mangled mess that is HTML), defining your own entities is possible, though not advisable.
You can define binary data as an entity, but anything other than plain text will at best be ignored, at worst cause an error. If you want to output a CR, for example, you would have to use XSL-FO (though there is a way to preserve whitespace from XSL, including CR or CR/LF, you just can't do it as flexibly as plain text).
(Defining your own tags is OK, though - browsers ignore tags they don't understand. Most HTML browsers can handle XHTML[1] because of this).
[1] That is, as long as the XHTML namespace isn't prefixed, and some browsers have trouble with <br/>, <hr/>, etc. (Though they can manage <br />, etc)
[Ben] Jimmy, correct me if I'm wrong but - we were speaking of HTML character entities, right? Otherwise, the methane-based entities from the Dravidian cluster are going to complain about discrimination. If we're going to bring in every other kind of entity and ignore them, it'll look like solid grounds for a lawsuit.
![]()
To restate my point with a bit more precision, though: HTML character entities cannot be broken - otherwise, they'll be, well, - broken.
[Jimmy] Sorry. Thought the original question was something completely different.
Thanks for all the help in solving this UTF problem and
giving me some insights into the murky world of Unicode, at
the same time cleaning up my atrocious Perl
You're all a
real blessing for the Linux community. Keep up the
excellent work!
[Ben] You're certainly welcome; glad we could help!
|
Contents: |
Submitters, send your News Bytes items in
PLAIN TEXT
format. Other formats may be rejected without reading. You have been
warned! A one- or two-paragraph summary plus URL gets you a better
announcement than an entire press release. Submit items to
bytes@linuxgazette.net
Patents
The European Union directive on the patentability of computer-implemented inventions has been rejected by the European Parliament by a large margin; the final tally was 648 votes to 14, with 18 abstentions. This high turnout came following intense lobbying on all sides in the run up to the vote. As reported by The Register, the directive seemed to hemorrhage support as the vote approached. The Pro-patent camp became afraid that the anti-software patent amendments might be reintroduced and given a second stamp of democratic approval (the Commission could still shelve the whole thing, but that could be politically difficult). Meanwhile, the anti-patent activists have been keen to kill this directive, which they see as having been severely tainted by the involvement of big (huge!) business pro-patent interests.
In the aftermath of this decision, both sides have tried to claim success. The Commission, which had been pushing hard for software-patentability, portrayed the vote as offering support for the current status quo, where software patents are being tacitly allowed by the EPO. However, the possibility of better enforcement of current patent regulations regarding software-patentability has been pointed to by a UK court decision to reject a software patent on the basis of Article 52 of the EPC (European Patent Convention).
Cisco
The recent behaviour of Cisco regarding the publication if a flaw in its products has highlighted the ways in which legal proceedings can be used to the detriment of individuals and indeed the security of a community. This story centres on the decision of Michael Lynn, an employee of Internet Security Systems, to publicly announce a flaw in Cisco's IOS (Internet Operating System) software. Lynn came to his decision to go public after Cisco was notified of the vulnerability, but had failed to remedy the fundamental problem. As Lynn has noted, the source-code to Cisco's IOS has been stolen twice, so he felt there was a significant chance that outside parties would soon be able to develop a practical exploit unless measures were taken to force Cisco to patch the flaw.
When Cisco became aware of Lynn's decision to speak at the Black Hat Conference, pressure was put on ISS, Lynn's employers, to prevent him from going through with his presentation. Lynn was also personally threatened with legal action. Following this pressure, Lynn resigned from his position at ISS, but went ahead with his presentation.
The basis for Cisco's legal attack on Lynn was that he had illegally obtained his information, as to do his research he had violated the Cisco license agreement with regards to reverse engineering. Although in the immediate aftermath of Lynn's presentation he was still being threatened with legal action, a settlement has since been reached. The terms of this include preventing Lynn from further using the Cisco code in his possession for reverse engineering or security research, and he is also forbidden from presenting his research on this flaw again. In the meantime, Michael Lynn is looking for a new job.
Bruce Schneier has posted (and updated) a very good summary and analysis of this case on his blog.
Preliminary work is underway to launch an EFF-like organisation for Britain
Joel Spolsky has reviewed Eric Raymond's book, The Art of Unix Programming. Incidentally, the entire book is available online.
Five addictive open-source games
Linux & Scaling: the Essentials
OpenOffice.org, FOSS, and the preservation of Gaelic
MythTV: Easy personal video recording with Linux
Norwegian government backs open source
Another country pushes towards Linux. The Norwegian Minister for Modernisation Morton Andreas Meyer is asking governmental institutions to prepare, before the end of 2006, plans for the use of open-source. In particular, it is hoped to avoid the use of proprietary formats for communication with citizens. (courtesy Howard Dyckoff).
Linux vs Windows-Mobile
It has been reported that embedded Linux powered 14 percent of smart phones shipped worldwide in Q1. Meanwhile, Windows Mobile shipments made up just 4.5 percent of the market (courtesy Howard Dyckoff).
Critical MySQL Flaw Found
Asterisk@Home
Asterisk@Home is a GNU/Linux distribution aimed at lowering the level of technical skills required for home users to be able to make use of Asterix, the open source PBX (Private Branch Exchange) telephony software. NewsForge has a detailed article on this distribution.
Debian
The Debian project has moved to reassure users by confirming that the security infrastructure for the new current release, Debian GNU/Linux 3.1 (alias sarge) and the former release (3.0, alias woody), both enjoy the benefits of a working and effective security infrastructure. This reassurance followed a brief period after the release of Sarge, during which issues with the security infrastructure prevented the issuing of updated to vulnerable packages.
From Debian Weekly News, Following the recent release of a new Debian GNU/Linux stable version, readers may be interested to peruse an online screenshot tour.
Progeny, and a handful of other Debian GNU/Linux distributors are planning to form a shared Debian GNU/Linux distribution for enterprise applications. Ian Murdock (the "Ian" in debIAN, and Progeny head honcho) has commented on this development, and it was also discussed on the LQ Radio Show.
The Debian project has announced that this year's Debian Conference was a great success with more than 300 people attending and over 20 sponsors. One highlight was the presentation about the large-scale deployment of 80,000 Debian workstations in Extramadura, Spain. The presentations were captured by the video team and are available online.
Foresight
Foresight Linux, is a GNU/Linux distribution showcasing some of the newest developments in Gnome (e.g. beagle, f-spot, howl, and hal). Mad Penguin has taken a look at this distribution.
FreeSBIE
Though it is not of course based on Linux, many GNU/Linux enthusiasts will doubtless be interested to learn of the existence of FreeSBIE, a FreeBSD based liveCD. This software has been featured on NewsForge.
Knoppix
The Knoppix bootable GNU/Linux liveCD is now also available as a version 4.0 DVD including a huge selection of software. Kyle Rankin has reviewed this Knoppix version for O'Reilly's linuxdevcenter.com.
Elive
Coinciding with the release of version 0.1 of the Debian based Enlightenment liveCD project, NewsForge has plugged a screenshot tour of the distribution.
Puppy
Puppy Linux has been profiled in NewsForge's My Workstation OS series.
C/C++ interpreter Ch 5.0 for Linux PPC Released
SoftIntegration, Inc. has announced the availability of Ch 5.0 and Embedded Ch 5.0 for Linux on PowerPC Architecture. Supported platforms include iSeries, pSeries, OpenPower, JS20 Power based Blades and zSeries from IBM as well as computers from Apple Computer. Ch is an embeddable C/C++ interpreter for cross-platform scripting, 2D/3D plotting, numerical computing, shell programming and embedded scripting. The release of Ch and its toolkits for Linux PPC continues SoftIntegration's involvement in cross-platform scripting, numerical computing and embedded scripting. Ch Control System Toolkit, Ch Mechanism Toolkit, Ch CGI Toolkit and C++ Graphical Library are available in Linux PPC as well.
Apache HTTP Server 2.1.6-alpha Released
The Apache Software Foundation and The Apache HTTP Server Project have announced the release of version 2.1.6-alpha of the Apache HTTP Server ("Apache"). The 2.1.6-alpha release addresses a security vulnerability present in all previous 2.x versions (but not present in Apache 1.3.x). Apache HTTP Server 2.1.6-alpha is available for download.
Sun and Open Source
Sun has announced that it will open source the next release of its Java Application Server. Also planned is to release its Instant Messaging code as open source. This will take place under the CDDL license, also used for Sun's OpenSolaris project. (Courtesy of Howard Dyckoff)
Mick is LG's News Bytes Editor.
Before this, Michael worked as a lecturer in the Department of
Mechanical Engineering, University College Dublin; the same
institution that awarded him his PhD. The topic of this PhD research
was the use of Lamb waves in nondestructive testing. GNU/Linux has
been very useful in his past work, and Michael has a strong interest
in applying free software solutions to other problems in engineering.
Originally hailing from Ireland, Michael is currently living in Baden,
Switzerland. There he works with ABB Corporate Research as a
Marie-Curie fellow, developing software for the simulation and design
of electrical power-systems equipment.
By Anonymous
This article is a follow-up to Maxin B. John's article, which introduced us to the Festival text-to-speech synthesizer and some possible applications. Here, we will push it a bit further and see how we can convert ebooks from the most common formats like HTML, CHM, PS and PDF into audiobooks ready to send to your portable player.
With the high availability of cheap and small portable MP3 players these days, it has become very convenient to listen to books and articles just anywhere when you would not necessarily have the time to read them. Audiobooks usually require very small bit-rates, and hence very small sizes - and as a consequence they are the most suitable content for the cheap/small capacity MP3 players (128 MB or less).
There are lots of websites out there catering for audiobooks needs with a wide range of choices. However, it might happen that you really want to read that article or book that you found on the web as a PDF or as HTML, and there is probably no audio version of it available (yet). I will provide you with some scripts that will enable you to convert all your favorite texts into compressed audio files ready to upload and enjoy on your portable player. Here we go!
Most of these tools are packaged in the main Linux distributions. Once you have all of the above installed, we can start the fun. We will begin with one of the most common format for ebooks: Adobe PDF.
#!/bin/sh -
chunks=200
if [ "$#" == 0 ]; then
echo "Usage: $0 [-a author] [-t title] [-l lines] <ps or pdf file>"
exit 1
fi
while getopts "a:t:l:" option
do
case "$option" in
a)author="$OPTARG";;
t)title="$OPTARG";;
l)chunks="$OPTARG";;
esac
done
shift $((OPTIND-1))
ps2ascii $@ | split -l $chunks - tmpsplit
count=1
for i in `ls tmpsplit*`
do
text2wave $i | lame --ta "${author:-psmp3}" --tt "$count ${title:-psmp3}" \
--tl "${title:-psmp3}" --tn "$count" --tg Speech --preset mw-us \
- abook${count}.mp3
count=`expr $count + 1`
done
rm tmpsplit*
First 'ps2ascii' converts the PDF file or Postscript file to simple
text. That text is then split into chunks of $chunks lines; you
might have to tweak that value, since splitting the book into more than 255
files might cause troubles in some players (the id3v1 track number tag can
only go up to 255.) After that, each chunk is processed by text2wave and
the resulting audio stream is sent directly to 'lame' through a pipe. The
encoding is performed with the mw-us preset, which is mono ABR
40 kbps average at 16 kHz. That should be enough, since Festival outputs a
voice sampled at 16 kHz by default. You can leave it as it is, unless you
are using a voice synthesizer with a different sampling rate. Refer to
lame --preset help for optimum settings for different sampling
rates.
When you input the artist or title, do not forget to quote the string if it includes spaces; for example:
ps2mp3 -a "This is the author" -t "This is the title" my.pdf
Next, we are going to see how to convert to an audio file from the most common format: HTML.
#!/bin/sh -
#requires lynx, festival and lame
if [ "$#" == 0 ]; then
echo "Usage: echo $0 [-a author] [-t title] <html file1> <html file2> ..."
exit 1
fi
while getopts "a:t:" option
do
case "$option" in
a)author="$OPTARG";;
t)title="$OPTARG";;
esac
done
shift $((OPTIND-1))
count=1
for htmlfile in $@
do
section=`expr match "${htmlfile##*/}" '\(.*\)\.htm'`
lynx -dump -nolist $htmlfile | text2wave - | lame --ta "${author:-html2mp3}" \
--tt "$count. ${section:-html2mp3}" --tl "${title:-html2mp3}" \
--tn "$count" --tg Speech --preset mw-us - ${section}.mp3
#rm /tmp/est_*
count=`expr $count + 1`
done
The first part of the script, up to line 16, is about extracting the optional parameters from the command line. From line 19 we are going to perform a loop on the list of all HTML files, the remaining arguments given at the command line. On line 21, "${htmlfile##*/}" strips out everything up to and including the last "/" character - useful if we are dealing with URLs or a directory path - so only the filename remains. Then the '\(.*\)\.htm'` regular expression takes care of the extension of the file so the variable section holds only the stem of the file. It will be used to tag and name the resulting MP3 files.
Line 22 is really the heart of the script: first, 'lynx' takes an HTML
file as input and dumps its text to stdout. That output is piped to
'text2wave' and converted into a WAV-encoded stream, which is then piped to
'lame' to be encoded with the mw-us preset and id3-tagged with
the artist/title/speech genre.
Note that the script can also take URLs as arguments, since they are directly sent to lynx.
This html2mp3 script is going to be
very useful for our next step, which is converting from CHM to MP3.
CHM files are a proprietary format developed by Microsoft, but basically they are just compiled HTML files with an index and a table of contents in one file. Their use as an ebook format is certainly not as widespread as HTML or PDF, but as you will see, it is pretty straightforward to convert them to audio files once you have the right tools.
#!/bin/sh -
#requires archmage and html2mp3
if [ "$#" == 0 ]; then
echo "Usage:"
echo " $0 <chm file> [-a author] [-t title] <html file1> <html file2> ..."
exit 1
fi
while getopts "a:t:" o
do
case "$o" in
a)author="$OPTARG";;
t)title="$OPTARG";;
esac
done
shift $((OPTIND-1))
archmage $1 tmpchm
find tmpchm -name "*.htm*" -exec html2mp3 -a "$author" -t "$title" {} \;
rm -fr tmpchm
archmage is a Python-based script that extracts HTML files from
CHM. You will need to have Python installed to get it to run.
Unlike 'ps2mp3', 'chm2mp3' does not require an arbitrary decision on where to split the book: every page compiled into the CHM file becomes its own audio file. All we need to do is extract these pages with 'archmage' and convert them with 'html2mp3'.
We are using the find command to recursively search for HTML files in the CHM book that we extracted, since sometimes the HTML files are stored in subdirectories inside the CHM. Then, for each HTML file found, we call 'html2mp3'.
Remember that it can take a while to encode several dozen pages of text to speech and then to MP3. But you do not need to encode a full book to start uploading and enjoying it on your portable player.
Another recent article on Festival and TTS synthesis software
JavaOne was huge this year, with 15,000 conference attendees and over 200,000 on-line visitors. The world's biggest Java Developer event got lots of attention, but for more than just its attendance numbers. Besides deep structural changes to simplify the Java programing paradigm, Sun dipped more of its corporate toes into the waters of Open Source Software after its recent release of its flagship Solaris OS under the CDDL (Common Development and Distribution License).
While Java isn't free of Sun licensing encumbrances, more of it is more open to the Java developer community and key Sun software efforts are becoming OSSw projects. Leading the trend is the next version of Sun's Java Application Server, to be called Project Glassfish. This is a contribution of over 1 million lines of code! Sun's current developer group will seed the project under CDDL. Developers can view the latest daily updates, contribute to fixes and features, and join in discussions at http://glassfish.dev.java.net.
Sun also is sharing its Java System Enterprise Server Bus (Java ESB) under the OSI-approved CDDL license [ also being used for Sun's OpenSolaris project ]. While the idea of an ESB isn't new, this is the first major effort that will be OSSw-based. ESBs are based on the Java Business Integration (JBI) specification (JSR 208).
And if that isn't interesting enough.... Sun is donating 135,000 lines of collaboration-focused communication source code from its Sun Java System Instant Messaging and Sun Java Studio Enterprise products for use by the entire open source community on NetBeans.org.
The collaboration software, which was demoed both at a Keynote and at the free NetBeans Developer Day that proceeded JavaOne, is designed to increase productivity by enabling Java developers to dynamically work together anywhere around the world. It also offers corporate types a roll-their-own IM app. Both demos worked fine, but your mileage may vary.
That free NetBeans Day may have been a bit of self-promotion, but the sessions were decent and the demos were later repeated at JavaOne. Included was a very nice looking Java-based software CD player that also works nicely and is downloadable at the NetBeans web site.
Sun is also making previews of the next versions of Java available to allow greater community contributions. And Sun is simplifying its hefty naming scheme ["Java 2 Standard Edition version 5.xxx"] to just Java SE 5 [or Java EE 6]; the '2' is gone. And the release versions will not be dotted in their names. That will save some ink and even some paper [ older versions will remain unchanged. ]
Big numbers for Java: Sun claims that some 2 Billion devices have Java in some form, including 708 million mobile devices and 700 million desktops and an incredible 825 million Java-enabled smartcards! That's a whole lotta of Java.
Japanese telcom NTT/Docomo is spending a lot on Java too. Some 60% of the billion dollars its investing in new service and software development will be Java-based. This will include their 'Star' project, aimed at building the next generation Java phone runtime. There was also a new mobility kit for NetBeans developers, if you are working on embedded Java.
IBM announced that they will officially support Apache Geronimo as an equal
but lightweight application server alternative for its app server,
WebSphere. IBM has been an active contributor to the Geronimo project, and as
part of that they will donate several Eclipse plugins to speed up J2EE
development. Robert LeBlanc, IBM's WebSphere General Manager, speaking at
a keynote, noted Geronimo's status for IBM and stressed the importance of
SOA [Service Oriented Architecture] for today's integration challenges.
[see more on Geronimo and the heavy use of OSSw by Java users in the BOF section of this report.]
Eclipse 3.1 was officially released after being announced at JavaOne. The new version allows users to streamline testing, create user interfaces for rich client applications, and enhance support for Ant build scripts. Also, NEC became the 100th organization to join the Eclipse Foundation.
Continuing with the developer-friendly, Open Source theme at JavaOne, BEA Systems is offering official support for both Spring and Struts frameworks running on top of WebLogic Server.
Oracle announced that it's JDeveloper J2EE development tool will be available for free and that they have partnered with the Apache MyFaces project. That was 'free' as in beer....
IBM and Sun announced a new and improved relationship - which is good for the entire Java community. After sparring a lot for recent years, they signed a new landmark agreement calling for an 11-year collaboration effort. It's actually an additional year on the current argreement plus 10 more years... so that's 11 years, but the point is that their collaboration on Java will improve. Sun's shift to a more Open Source friendly position probably placated IBM's desire for Sun to loosen its control over Java.
Sun reiterated its position: there would be more OSSw in the Java space over the next year. Would that include opening up the JVM??? That may be news for JavaOne 2006.
Jonathan Schwartz, President and COO at Sun, opened the first keynote address, titled "Welcome to the Participation Age", discussing the importance of 'participation' in building communities, creating value and new markets, and driving social change. To drive the point home, Schwartz added, "The Information Age is over." [Really?] He was, to be sure, referring to Sun's new branding that has a curvaceous, subliminal "S" that stands for "Share" and extends that term to mean OSSw, Developer communities and even the Wikipedia. Schwartz re-emphasized that Sun has always shared some of its IP going back to the founding days of Unix. But watch the replay of the keynote to judge for yourself.
Since all the keynotes are posted now - and were available by internet broadcast in real time - please check out the link: http://java.sun.com/javaone/sf/sessions/general/index.jsp
During the second keynote, Graham Hamilton, a Sun fellow and vice president, addressed advances planned for Java SE software over the two upcoming releases, a period of 3 years!
Hamilton offered developers an early taste of Java SE 6 software, which is
expected to ship in summer 2006, and invited them to contribute directly to
the future of Java by reviewing source code, contributing bug fixes and
feature implementations and collaborating with Sun engineers. Developers
can join the community at http://community.java.net/jdk.
The following Java SE 6 features are [or soon will be] available for
testing and evaluation at the JDK software community site on java.net:
On the Enterprise Java Beans [EJB] development side, there is a new spec that does away with the Container Managed Persistence scheme [CMP] and make EJBs much more like 'Plain old Java objects' [or POJOs]. This came about from a lot of discussion and arm-wrestling with alternative projects and frameworks, most notable Toplink, Hibernate and JDO. So there will be one single persistence model for both Java SE and EE going forward.
The new scheme makes extensive use of the new Annotations feature which allows for in-code specification of resources and dependencies. Although this approach is not without some controversy in the developer community, it should ease reading code and do away with complicated deployment descriptors and, perhaps, make the intent of a developer or team more obvious. See information on Annotations here: http://java.sun.com/j2se/1.5.0/docs/guide/language/annotations.html
Sun is also releasing support for integration with scripting languages over both Java 6 and 7 SE. A technical session - TS-7706 Scripting in the Java™ Platform - described the new scripting engine that is already available in beta format and will be included in Mustang, Java 6. It is based on JSR 223 and will support several scripting languages.
[I'm including session numbers and titles since most of the presentations can be freely downloaded now - see last section below.]
There were several sessions dealing with performance and security, and these were among the most heavily attended.
The Tuesday session on performance, TS-3268, Performance Myths Exposed, was one of the few actually repeated on the last day. It reviewed several strategies and tested these on small and large apps. It also compared 7 JVMs Against 6 performance hacks and summarized it:
- use of "final" does not help performance - in-lining is automatic
- try/catch blocks are [mostly] free
- use of RTTI is a marginal performance win at best - with maintenance costs
Also of note, in spite of an ungainly title, was TS-3397: Web Services and XML Performance and the Java™ Virtual Machine: What Your JVM™ Is Doing During That Long Pause When XML Processing, With Optimization Suggestions which highlighted some practical rules-of-thumb to speed XML-