Character Types

Copper 3 has no more builtin character type. It is completely implemented in the library. The standard library define a CodeUnit type for UTF-8 or UTF-16 strings depending on the platform and a Unicode 32 bit Char type.

To show how it is implemented, let's start with a simple example by defining an ASCII Char type.

The compiler uses the Smalltalk style for literal characters, e.g. $A for the 'A' character. From the compiler point of view, there is absolutely no difference between '$A' and '65'; both are the same literal integer. So $A can be a 32 bit signed integer as well as an unsigned 8 bit integer depending on the context.

Creating the Type

Just create a sub-type of Unsigned8.

stype Char : Unsigned8
    ....
end

Now we have a character type that is an Unsigned8: it inherits all operations from its parent type.

Defining Special Characters

There is no syntax for special characters, if you want a tab character, just use '9' or define a symbol for that.

stype Char : Unsigned8
    'nul = 0
    'tab = 9
    'lf = 10
    'cr = 13
end

Defining Additional Methods

In addition to inherited operations (addition, increment, ...), you may want to add useful methods to the new type.

stype Char : Unsigned8
    'nul = 0
    'tab = 9
    'lf = 10
    'cr = 13

    method isNul
        return self == 'nul
    end

    method isUpper
        return self >= $A and self <= $Z
    end

    method toUpper
        return self isLower cond self + $A - $a else self
    end
end

Now we have a fully operational character type we can use.

var i : Int32
i = $B // valid integer value 66
var c : Char
c = 'tab
c = $y
c = c + 1
if c isNul
    return
else
    c = c toUpper
end

Unicode Character

To implement a Unicode character type, just repeat the same but make the Char type a sub-type of Unsigned32.

stype Char : Unsigned32
    'nul = 0
    ...
end

As strings won't be implemented as an array of Unicode characters but more likely encoded in UTF-8 or UTF-16, the String type is implemented as an array of code units.

stype CodeUnit : Unsigned8
    ...
end

// A String is a pointer to an array of code units
stype String : *[]CodeUnit
    ...
    method eachChar
        // Reassemble all code units into characters 
        // before passing them to a block.
    end
    ...
end

The user may not have to worry about the encoding: it can iterate through the characters using eachChar:

var str = ... // a string
str eachChar do c
    // c is a 32 bit Unicode character
end