LeoNerd.org.uk

XTerm and 8-bit Characters

The defaults for Xterm and inputrc on most Linux systems are such that you can't type Alt-[key] combinations, or Unicode characters, or something like that. This is annoying. So I fiddled with all the settings in various combinations, until I managed to come across the magical set that allows you to do all of this.

The problem is that UTF-8 encoded Unicode characters are encoded by using high-bit escape codes. I.e., the bytes that define a high UTF-8 character have the top-bit set. But this is the mechanism normally used by VT-style terminals (i.e. the original VT100, etc..., as well as xterm and other emulators thereof) to indicate the ALT key. They simply set the top bit, and send the normal 7-bit value as required. E.g., the value sent for ALT+m would be 0x80 + 'm'.

Clearly, these two systems cannot neatly co-exist.

There is, however, an alternate way to encode ALT sequences. The ECMA-35 standard, which was based on the DEC VT100 and other terminals, allows a second encoding of 8-bit values, for use when the line is not 8-bit clean. This is to escape each 8-bit value; that is, to send the <ESC> code first, indicating that the following value is a high code. Using this system, the code for ALT+m would be two bytes; 0x1b,m.

By observing that UTF-8 specifically does not use values below 32 (decimal); called the C0 range; we can combine these systems to allow both ALT keys, and Unicode.

So how to do we do this? Well, first we tell XTerm not to use the 8-bit scheme for control keys, and instead only use 7-bit. For related reasons, certain amounts of output conversion must also be done; this affects the way UTF-8 characters are displayed to the screen. We must tell XTerm to allow 8-bit characters back in again, so we can display UTF-8 correctly. Put the following in your XTerm appdefaults file:

*VT100*utf8:            1
*VT100*eightBitInput:   false
*VT100*eightBitControl: false
*VT100*eightBitOutput:  true

Now XTerm will correctly send the required sequences for all the situations outlined above. But that's not all yet; we still need to tell readline (which is resonsible for bash's console, among other things) how to interpret the various sequences. For that, put the following in your inputrc file:

set input-meta   on
set output-meta  on
set convert-meta off

This tells readline not to convert the escaped control/ALT sequences, but still to allow 8-bit values, thus allowing input of the UTF-8 characters. Again, 8-bit output needs to be turned on, so that when programs output UTF-8 sequences, they don't get converted down to escaped 7-bit sequences, which would confuse XTerm into thinking they are control sequences.

For other interesting/useful settings for XTerm, see here.