UTF8

is a simple UTF-8 package for Carp. This nascent string replacement type allows you to use many of the functions you know from Carp strings while respecting unicode runes instead of just having bytes.

Installation

You can obtain this library like so:

(load "git@github.com:carpentry-org/utf8.carp@0.0.5")

Usage

First, let’s define a UTF-8 string to work with!

(let [s (UTF8.from-string "hεllö")]

That’s a cute, short string! Hm, I wonder how long it is!

(length s) ; => 5

That’s surprising! So the length is the actual number of runes, and not the number of bytes, you say? Most curious!

You know what, I want to see this second character there up close! It somehow looks all Greek to me!

(nth s 1) ; => ε

Hm, so this is what that looks like, huh? Interesting. And what’s its type?

(type UTF8.nth) ; => UTF8.nth : (λ [(Ref UTF8), Int] Rune)

So, it’s called a Rune, huh? Hm, they don’t seem to be super interesting, but I seem to be able to compare them and stringify them, and even take their length in bytes! Quite delicious!

I wonder what else I can do with these functions?

=

defn

(λ [(Ref UTF8 a), (Ref UTF8 a)] Bool)

                    (= a b)
                

append

defn

(λ [(Ref UTF8 a), (Ref UTF8 b)] UTF8)

                    (append a b)
                

appends two UTF-8-encoded strings.

copy

instantiate

(λ [(Ref UTF8 a)] UTF8)

copies a UTF8.

empty?

defn

(λ [(Ref UTF8 a)] Bool)

                    (empty? u)
                

checks whether a UTF-8-encoded string is empty.

ends-with?

defn

(λ [(Ref UTF8 a), (Ref UTF8 b)] Bool)

                    (ends-with? u sub)
                

checks if the string u ends with the string sub.

from-string

defn

(λ [(Ref String a)] UTF8)

                    (from-string s)
                

creates an UTF-8 string from a regular string.

init

instantiate

(λ [(Array Rune)] UTF8)

creates a UTF8.

length

defn

(λ [(Ref UTF8 a)] Int)

                    (length u)
                

returns the length of a UTF-8-encoded string.

nth

defn

(λ [(Ref UTF8 a), Int] (Maybe Rune))

                    (nth u n)
                

returns the nth rune from a UTF-8-encoded string.

prefix

defn

(λ [(Ref UTF8 a), Int] UTF8)

                    (prefix u n)
                

returns the first n chararacters of the string u.

prn

instantiate

(λ [(Ref UTF8 a)] String)

converts a UTF8 to a string.

reverse

defn

(λ [(Ref UTF8 a)] UTF8)

                    (reverse u)
                

reverses a UTF-8-encoded string.

runes

instantiate

(λ [(Ref UTF8 a)] (Ref (Array Rune) a))

gets the runes property of a UTF8.

set-runes

instantiate

(λ [UTF8, (Array Rune)] UTF8)

sets the runes property of a UTF8.

set-runes!

instantiate

(λ [(Ref UTF8 a), (Array Rune)] ())

sets the runes property of a UTF8 in place.

slice

defn

(λ [(Ref UTF8 a), Int, Int] UTF8)

                    (slice u a b)
                

returns a substring of the string from the index a to the index b.

starts-with?

defn

(λ [(Ref UTF8 a), (Ref UTF8 b)] Bool)

                    (starts-with? u sub)
                

checks if the string u begins with the string sub.

str

defn

(λ [(Ref UTF8 a)] String)

                    (str u)
                

creates a regular string from a UTF-8 string

suffix

defn

(λ [(Ref UTF8 a), Int] UTF8)

                    (suffix u n)
                

returns the last n chararacters of the string u.

unsafe-nth

defn

(λ [(Ref UTF8 a), Int] Rune)

                    (unsafe-nth u n)
                

returns the nth rune from a UTF-8-encoded string unsafely.

update-runes

instantiate

(λ [UTF8, (Ref (λ [(Array Rune)] (Array Rune) a) b)] UTF8)

updates the runes property of a UTF8 using a function f.

zero

defn

(λ [] UTF8)

                    (zero)
                

Returns the empty string.