Komodo's system for defining multi-language syntax lexing and user-defined syntax highlighting is called UDL (User Defined Languages). UDL files are written in a language called Luddite, then compiled into Scintilla lexers and packaged for use in Komodo. |
|
Komodo includes a general-purpose lexer engine that loads a description resource, and walks it while lexing the buffer. It currently allows for up to five sub-languages, each of which is expected to be a member of a pre-defined family ("markup", "style", "client-side scripting language", "server-side scripting language", and "template language"). The format of these resource files is low-level and intended to be fast to parse, so we have provided a programming language that is intended to allow users to build lexers for their own programming languages.
Currently these lexers allow for six-bit colorizing, meaning that both errors and warnings can be highlighted in the buffer. The lexer description language allows for:
Luddite programs typically consist of a set of files, called
modules. A well-designed Luddite program will contain one main
file, that provides a name for the target language, and then a
list of include
statements that load the other
modules.
Luddite consists of a set of declarations, most of which can appear in any order, except for the initial "family" declaration. This is because some declarations are family-specific, and bind to the prevailing family declaration.
The php-mainlex.udl file defines the PHP lexer. It is a useful example as it uses one of each of the language families: markup (HTML), css (CSS), client-side (JavaScript), and server-side (PHP). It includes transitions from HTML to JavaScript, CSS, and PHP, followed by transitions back to HTML from each of these languages, and finally the main language modules.
language PHP include "html2js.udl" include "html2css.udl" include "html2php.udl" include "css2html.udl" include "js2html.udl" include "php2html.udl" include "html.udl" include "csslex.udl" include "jslex.udl" include "phplex.udl"
Working from the bottom of the list towards the top, we have four core files that contain code to describe how to lex each language on its own. Above those are files that explain how to transition from one language to another (e.g. "php2html.udl"). As rules are attempted in the order they're first presented, we normally need to test transition rules before we attempt internal rules, which is why the core modules appear at the bottom.
Luddite programs consist of declarative statements separated either by newlines or semi-colons. Comments start with "#" and continue to the end of line. Because it's a declarative language, the order of different statements doesn't matter, but the order of rules in the same group does.
The Luddite compiler allows lists to be entered with minimal punctuation (i.e without quotes around strings, or commas between entries). However, the following words should always be quoted when declared as strings as they are reserved words in Luddite:
Luddite is intended to work only with Scintilla. It is helpful to refer to the Scintilla documentation when writing lexers in Luddite. To reduce redundancy, you can refer to the names of colors by using either the full name, such as "SCE_UDL_M_ATTRNAME", or you can drop the common prefix, and refer to "M_ATTRNAME". A partial prefix won't work, nor will hard-wired numeric values.
Most lexer components need to declare which language family they belong to: markup, css, csl (client-side language), ssl (server-side language - usually JavaScript ), or tpl (template language). This last language is usually used by the server-side processor that has to determine which code is markup to be output "as is", and which is server-side code to be executed (e.g. PHP's Smarty and Perl's Template Toolkit).
The default family is "markup". You can write a lexer for any language that lives in one family, and it won't look like markup. This is an arbitrary starting point that makes sense for most template languages.
All directives in a file belong to the most recent family directive. When a new file is included, it starts off in the family at the point of inclusion. If a new family is specified, when the include-file is processed, Luddite pops to using the family that was current when the include began.
There are currently three domains that are family-specific: keywords, named patterns, and look-back tests, which are used to disambiguate.
Example:
family csl
For each family, we currently need to specify the style from scintilla.iface that the family will use.
Example:
start_style CSL_DEFAULT end_style CSL_REGEX
This isn't surprising, as the code in scintilla.iface reads like so:
# Template: client-side language # Start at 30 val SCE_UDL_CSL_DEFAULT=30 val SCE_UDL_CSL_COMMENT=31 val SCE_UDL_CSL_COMMENTBLOCK=32 val SCE_UDL_CSL_NUMBER=33 val SCE_UDL_CSL_STRING=34 val SCE_UDL_CSL_WORD=35 val SCE_UDL_CSL_IDENTIFIER=36 val SCE_UDL_CSL_OPERATOR=37 val SCE_UDL_CSL_REGEX=38
A complete list of Scintilla Styles can be found in the Luddite Reference.
Most sub-languages have a set of keywords that we want to
color differently from identifiers. You specify them using
keywords
, supplying a list of names and strings,
which may be comma-separated, as in this code for
JavaScript:
keywords [as break case catch class const # ... get, "include", set abstract debugger enum goto implements # ... ]
The string "include" is quoted because it is a Luddite keyword. Commas are optional, and comments can be added at the end of a line inside a list.
You must generally specify a different list of keywords for each family, however it is possible to define some languages without keywords (e.g. HTML).
Tell Luddite when to color a range of text as a keyword
with the keyword_style
directive. For
example:
keyword_style CSL_IDENTIFIER => CSL_WORD
A complete list of Luddite Keywords can be found in the Luddite Reference.
To prevent the color for an identifer from being converted into a keyword, use the no_keyword command.
To specify how Luddite should process text, provide a string for it to match verbatim, or provide a pattern. The pattern syntax is nearly identical to Perl's regex language.
Patterns for a particular language tend to be repetitive. To make it easier to use them, Luddite supports family-based pattern variables, which are interpolated into pattern expressions. These are the four pattern variables used in the JavaScript lexer:
pattern NMSTART = '\w\x80-\xff' # inside cset pattern CS = '\w\d_\x80-\xff' # inside cset pattern WS = '\s\t\r\n' # inside cset pattern OP = '!\#%&\(\)\*\+,-\.\/:;<=>\?@\[\]\^\{\}~|'
These patterns are interpolated into a character set. For example:
/[^$WS]+/ # Match one or more non-white-space characters
The CSS UDL includes a more complex set of statements:
family css pattern CS = '-\w\d._\x80-\xff' # inside cset pattern WS = '\s\t\r\n' # inside cset pattern NONASCII = '[^\x00-\x7f]' pattern UNICODE = '\\[0-9a-f]{1,6}' pattern ESCAPE = '$UNICODE|\\[ -~\x80-\xff]' pattern NMCHAR = '[a-zA-Z0-9-]|$NONASCII|$ESCAPE' pattern NMSTART = '[a-zA-Z]|$NONASCII|$ESCAPE'
Pattern variables can nest within one another. You need to keep track of which variables define a character set, and which ones are intended to be used inside a character set.
The heart of every Luddite program is a set of state-transitions. The key concept is that states in Luddite, unlike most Scintilla lexers that are hand-coded in C, are not directly related to the colors that each character will be given. In Luddite you create your own names for each state, describe for each state which pieces of text or patterns to look for, and specify what to do with them.
The HTML lexer starts with this code:
initial IN_M_DEFAULT
The first statement means that we should start lexing at
the state we call "IN_M_DEFAULT". Subsequent
initial
statements are ignored, without warning
messages.
To specify a state transition, provide a state block. For example, with HTML:
state IN_M_DEFAULT: '<?' : paint(upto, M_DEFAULT), => IN_M_PI_1 '<[CDATA[' : paint(upto, M_DEFAULT), => IN_M_CDATA '<!--' : paint(upto, M_DEFAULT), => IN_M_COMMENT # These are more complicated, because if they aren't followed # by a character we want to leave them as text. '</' : paint(upto, M_DEFAULT), => IN_M_ETAG_1 '&#' : paint(upto, M_DEFAULT), => IN_M_ENT_CREF_1 '&' : paint(upto, M_DEFAULT), => IN_M_ENT_REF_1 '<' : paint(upto, M_DEFAULT), => IN_M_STAG_EXP_TNAME
When we're in the state we call "IN_M_DEFAULT" (Luddite turns this into an arbitrary number to be used by Scintilla), if we match any of the above strings, we first will have Scintilla color (or "paint") everything up to the current starting position the SCE_UDL_M_DEFAULT color (remember, the prefix is implicit), and then change to the state named to the right of the "=>". The comma before "=>" is optional, but advisable in more complex rules.
The match strings may use double-quotes instead of single-quoted. Simple C-like backslash-escaping is used, such as ' and ", but not hex or octal escapes.
state IN_M_PI_1: '?>' : paint(include, M_PI) => IN_M_DEFAULT /\z/ : paint(upto, M_PI)
The "include
" directive for the paint
command paints from the last paint-point, to the position
we end at. Recall that
paint(upto,[color])
stops at the position we
were at when we attempted to match the string.
One of the advantages of this approach over a standard regular expression based set of rules, specifying the start and end delimiters, is that Luddite is geared to building editor lexers. When people are using lexers they often are typing at the end of the file. For example, if I was typing this code:
---- top ---- ... <?somepi<EOF> ---- bottom ----
I would like the last line to be colored like a processing instruction, even though I haven't completed it. Luddite lets you do things to confound your users, such as choosing a different color for an incomplete directive at the end of file. However, most of the time, you won't want to do this. If all of the first colors in a state block map to the same color, Luddite will automatically supply that color for an end-of-file condition at that state. In other words, the '/z/'-condition is rarely necessary.
There are two main ways to to delay state transitions and colorizing in Luddite:
In the HTML lexer, we rewrite the state blocks for PIs as follows:
state IN_M_PI_1: /./ : redo, => IN_M_PI_2 state IN_M_PI_2: '?>' : paint(include, M_PI) => IN_M_DEFAULT
Wait a minute, you say. That does just the same thing, and less efficiently than when there was just the one state. In fact, all state IN_M_PI_1 does is a so-called epsilon transition to state IN_M_PI_2 ("epsilon" transitions don't consume input).
Go back to that sample at the beginning of this document. Notice that the "html2php.udl" file is included before "html.udl". This file is short, and is reproduced here without comments:
family markup state IN_M_PI_1: /php\b/ : paint(upto, M_OPERATOR), paint(include, M_TAGNAME), \ => IN_SSL_DEFAULT
Because "html2php.udl" is processed before "html.udl", its pattern will be attempted earlier. If the lexer finds "php" followed by a word boundary, it will then paint the leading "<?" as a markup operator, paint the "php" as a markup tagname, and then switch to the default state in the server-side language family.
If you write a Luddite program that ends up where two states do epsilon transitions to one another, the lexer engine will detect this. More precisely, if it notices that it has carried out 1000 consecutive epsilon transitions, it will move on to the next character. This shows up in Komodo as the rest of the buffer highlighted in a single color (remember the implicit end-of-buffer coloring).
The no_keyword
command is used to prevent
identifier to keyword promotion when an identifier is
recognized. This is useful for programming languages that
allow any token, even keywords, to be used in certain
contexts, such as after the "." in Python, Ruby, and
VBScript, or after "::" or "->" in Perl. See sample Luddite
code.
The Luddite syntax for patterns is very similar to Perl's. Only forward slashes may be used as regex delimiters, all the usual escaping rules apply. For example, JavaScript uses this pattern to handle single-line comments:
state IN_CSL_DEFAULT: # ... # Swallow to end-of-line /\/\/.*/ : paint(upto, CSL_DEFAULT), paint(include, CSL_COMMENT)
If no target state is specified, the lexer will stay in the current state.
In many languages you need to push one state, transition to another, and at some point return to the previous state. There are many examples where this comes up in template-based languages.
In most Smarty files, you transition from HTML to Smarty on "{", and transition back on "}". But if you find an open-brace while processing Smarty code, you should allow for a matching "}".
In RHTML, the delimiters "<%=" and "%>" are used to transition from HTML into a Ruby expression, and back. These delimiters can occur in many different parts of HTML files, including attribute strings and content. The lexer needs to be told which state to return to when it finishes processing the Ruby expression.
In Ruby proper, you can interpolate arbitrary amounts of Ruby code inside double-quoted strings between "#{" and "}". By pushing a state when you find "#{" in a Ruby string, you can allow for multiple nested pairs of braces in the expression, and return to the string when the matching "}" is reached.
The Luddite code for expressing this is simple. Let's look at how it's expressed for double-quoted strings in Ruby:
state IN_SSL_DEFAULT: #... '"' : paint(upto, SSL_DEFAULT), => IN_SSL_DSTRING #... Note the redo here for things that could be operators /[$OP]/ : paint(upto, SSL_DEFAULT), redo, => IN_SSL_OP1 #... state IN_SSL_DSTRING: '#{' : paint(include, SSL_STRING), spush_check(IN_SSL_DSTRING), \ => IN_SSL_DEFAULT ... state IN_SSL_OP1: '{' : paint(include, SSL_OPERATOR), spush_check(IN_SSL_DEFAULT) \ => IN_SSL_DEFAULT '}' : paint(upto, SSL_DEFAULT), paint(include, SSL_OPERATOR), \ spop_check, => IN_SSL_DEFAULT # ...
When we find "#{" while processing a double-quoted string, we push the state we want to return to (IN_SSL_DSTRING), and transition to the default Ruby state (IN_SSL_DEFAULT), where we lex an expression.
If we find an open-brace while looking for operators, we again push the default state on the stack.
To handle a close-brace, we carry out a "spop_check" test. If there's something on the stack, we pop it and transition to the state it specified. Otherwise, we transition to the specified state. If users never made mistakes, you would never need to specify a target state in a directive containing an "spop" command. But because people are capable of typing things like:
cmd { yield }
...we need to tell Luddite what to do on the extra close-brace.
Some template languages use line-oriented embedded languages. For example, in Mason you can insert a line of Perl code by putting a '%' character at the start of the line.
The simplest way to express this in Luddite is to put an
at_eol
command in the transition into that
state. When the lexer reaches the end of that line, it will
automatically transition into the specified state. For
example, the Luddite code to express this for the above
Mason example is here:
state IN_M_DEFAULT: /^%/ : paint(upto, M_DEFAULT), paint(include, TPL_OPERATOR), \ at_eol(IN_M_DEFAULT), => IN_SSL_DEFAULT
Some languages support arbitrary delimiters for objects
like strings and regular expressions. For example, in Perl
you can provide a list of words with the 'qw' construct
(e.g. qw/abc def 1234/
), and in Ruby you can
use the '%' character to delimit a string (e.g.
%Q(abc(nested parens)def)
). You can express
these in Luddite using the delimiter
keyword.
There are actually four parts to supporting delimiters:
\
".s,\\,/,g
"
idiom, set the target delimiter to ",", match it, and
keep it as the target for one more match.This code shows how the delimiter-oriented keywords are used to work together. We'll walk through support for Perl's matching statement first:
state IN_SSL_DEFAULT: # ... /m([\{\[\(\<])/ : paint(upto, SSL_DEFAULT), \ set_opposite_delimiter(1), => IN_SSL_REGEX1_TARGET /m([^\w\d])/ : paint(upto, SSL_DEFAULT), \ set_delimiter(1), => IN_SSL_REGEX1_TARGET # ... state IN_SSL_REGEX1_TARGET: delimiter: paint(include, SSL_REGEX), => IN_SSL_REGEX_POST /\\./ #stay
The first transition matches one of the open-bracket
characters in a grouped pattern, and sets the target
delimiter to the opposite of the contents of the first
pattern group. The opposite_delimiter()
routine requires its input to be one character long, and
returns its input if it isn't one of the opening
characters. So the two patterns could be expressed with the
one transition:
/m([^\w\d])/ : paint(upto, SSL_DEFAULT), set_opposite_delimiter(1), \ => IN_SSL_REGEX1_TARGET
Handling a construct like Perl's substitution syntax is slightly more complicated because it can use various delimiters (e.g. s/foo/bar/, s'foo'bar', s#foo#bar#, etc.). Furthermore, if the character after the 's' is an opening bracket character, the full pattern can use either two pairs of bracketing delimiters, or non-bracketing delimiters, as in
s[find] {replace} s<first>/second/
White space is always ignored after the first pair. To encode this in Luddite we need several states:
/s([\{\[\(\<])/ : paint(upto, SSL_DEFAULT), \ set_opposite_delimiter(1), => IN_SSL_REGEX2_TARGET1_OPPOSITE_1 # ... state IN_SSL_REGEX2_TARGET1_OPPOSITE_1: /\\./ : #stay delimiter: paint(include, SSL_REGEX), \ => IN_SSL_REGEX2_TARGET1_OPPOSITE_2 /\z/ : paint(upto, SSL_REGEX) state IN_SSL_REGEX2_TARGET1_OPPOSITE_2: /\\./ : #stay /[$WS]/ : #stay -- assume we're in {...} [ ... ]x /([\{\[\(\<])/ : paint(upto, SSL_DEFAULT), \ set_opposite_delimiter(1), => IN_SSL_REGEX1_TARGET /([^\w\d])/ : paint(upto, SSL_DEFAULT), set_delimiter(1), \ => IN_SSL_REGEX1_TARGET /\z/ : paint(upto, SSL_DEFAULT)
Matching the second half is similar to matching the delimiter after the 'm'.
The final part is handling constructs like the standard
's/.../.../
' language. To do that we tell UDL
to keep the current delimiter for another round of
matching:
/s([^\w\d])/ : paint(upto, SSL_DEFAULT), set_delimiter(1), \ => IN_SSL_REGEX2_TARGET1_SAME # ... state IN_SSL_REGEX2_TARGET1_SAME /\\./ : #stay delimiter: keep_delimiter, => IN_SSL_REGEX1_TARGET /\z/ : paint(upto, SSL_REGEX)
Often when a construct is bracketed with matching delimiters, the target language is smart enough to ignore inner matched pairs. For example, if in Ruby you were to write
puts %Q(first(middle)second)
Ruby would write out the string "first(middle)second". To encode this in Luddite use UDL's built-in stack:
/%[%qQwWx]([\{\[\(\<])/ : paint(upto, SSL_DEFAULT), set_opposite_delimiter(1), => IN_SSL_QSTRING_NESTED # ... state IN_SSL_QSTRING_NESTED: delimiter: paint(include, SSL_STRING), => IN_SSL_DEFAULT /\\./ : #stay /[\[\{\(\<]/ : paint(upto, SSL_STRING), \ spush_check(IN_SSL_QSTRING_NESTED), => IN_SSL_QSTRING_NESTED2 state IN_SSL_QSTRING_NESTED2: /\\./ : #stay /[\[\{\(\<]/ : spush_check(IN_SSL_QSTRING_NESTED2), \ => IN_SSL_QSTRING_NESTED2 /[\]\}\)\>]/ : spop_check, => IN_SSL_QSTRING_NESTED /\z/ : paint(include, SSL_STRING)
Finally, you've probably noticed that we put an
end-of-buffer transition in many of these states. Notice
that the final IN_SSL_QSTRING_NESTED2
state
actually does no painting in its other matches. Normally
Luddite will look at the colors a state uses to determine
how to color the rest of the text if it reaches the end of
the buffer. If more than one color is used, or none is
used, the Luddite program should specify a color. Otherwise
it's possible that Komodo will repeatedly invoke the
colorizer until something is chosen.
"Here documents" are a convenient way of defining
multi-line strings. Typically they start by defining the
"terminating identifier", preceded by an operator like
<<
or <<<
. The
string starts on the following line, and ends when we find
a line containing only the terminating identifier.
The following Luddite code outlines how to add here-document processing to a language, using PHP as an example, where we assume a here document always begins with the three less-than characters followed by a name, then the end of line:
IN_SSL_PRE_HEREDOC_1 state IN_SSL_PRE_HEREDOC_1: /([$NMSTART][$NMCHAR]*)/ : set_delimiter(1), paint(include, SSL_IDENTIFIER) /\r?$/ : paint(include, SSL_DEFAULT), => IN_SSL_IN_HEREDOC_1 state IN_SSL_IN_HEREDOC_1: delimiter : keep_delimiter, paint(upto, SSL_STRING), => IN_SSL_IN_FOUND_HEREDOC_1 /./ : => IN_SSL_IN_HEREDOC_2 # Not this line state IN_SSL_IN_HEREDOC_2: /.+/ : #stay /$/ : => IN_SSL_IN_HEREDOC_1 state IN_SSL_IN_FOUND_HEREDOC_1: /[\r\n]+/ : clear_delimiter, paint(upto, SSL_IDENTIFIER), => IN_SSL_DEFAULT # Got it! /./ : => IN_SSL_IN_HEREDOC_2 # The delimiter continues, so keep looking ]]>
In this example the keywords delimiter
,
keep_delimiter
, and
clear_delimiter
all work together. After
matching the delimiter, we retain it with the
keep_delimiter
action, and then test to make
sure the delimiter is followed immediately by the end of
the line. If it is, we clear it. Otherwise we return to
states IN_SSL_IN_HEREDOC_2
and
IN_SSL_IN_HEREDOC_1
, looking for a line that
contains the terminating identifier, and nothing else. By
default matching a delimiter clears it, so we need to keep
it and then explicitly clear it. The following example
shows why this is needed:
In both Ruby and JavaScript, sometimes a '/' is just a '/', and sometimes it's the start of a regular expression. Luddite's token_check directive directs this. For example, in JavaScript, to determine if a '/' is the start of a regex, and not a division operator, you could write a test like this:
state IN_CSL_DEFAULT: #... '/' token_check : paint(upto, CSL_DEFAULT), => IN_CSL_REGEX
Note that the "token_check" directive is part of the test, not the action to perform. This states that if we do match, color everything before the '/' with the default color, and change to the regex state.
What happens during a token_check
test
depends on the contents of the token_check block specified
for the current family. In a token_check block, you look at
the tokens to the left of the current position. Each token
consists of a two-value tuple, containing its colored
style, and its text. On each token, we can decide whether
to accept the token (meaning the test passes), reject the
token (meaning it fails), or skip the token, meaning we get
the previous token in the buffer, working towards the
beginning.
Here's the JavaScript token_check block:
token_check: CSL_OPERATOR: reject [")", "++", "--", "]", "}", ";"] CSL_WORD: reject [class false function null private protected public super this true get "include" set] # All other keywords prefer an RE CSL_DEFAULT: skip all CSL_COMMENT: skip all # Default is to reject / as the start of a regex if it follows # an unhandled style #### CSL_IDENTIFIER: reject all #### CSL_NUMBER: reject all #### CSL_REGEX: reject all #### CSL_STRING: reject all
You can provide either a list of strings and/or names, or the "all" keyword. Here are the rules on defaults:
To get Scintilla to calculate fold levels on each line, specify which tokens increase the folding level, and which decrease it:
Here is all the folding the JavaScript lexer currently specifies:
fold "{" CSL_OPERATOR + fold "}" CSL_OPERATOR -
By default, Komodo looks at the extension of the file to determine the kind of language in the file, and then it loads the appropriate language-related code. This mechanism can be further extended by opening the "File Associations" section of the Preferences area, and specifying that Komodo should look for XML namespace attributes and doctype declarations to further determine which language to load.
Authors writing UDL-based lexers for XML vocabularies
can tap into this mechanism by using any combination of the
namespace
,
public_id
,
and system_id
declarations to specify which language Komodo should
associate with a particular file, regardless of the actual
extension of the filename it is saved by.
For example, XBL (XML Binding Language) contains a combination of XML and JavaScript, and is widely used for Firefox extensions and Mozilla applications. These files often are saved with names containing ".xml" extensions, but they usually contain the following prologue:
<!DOCTYPE bindings PUBLIC "-//MOZILLA//DTD XBL V1.0//EN" "http://www.mozilla.org/xbl"> <bindings id="koListboxBindingsNew" xmlns="http://www.mozilla.org/xbl" ...>
Any combination of the following three declarations in a Luddite file for XBL will be sufficient to direct Komodo to load the Luddite-based XBL mode instead of the default XML mode:
namespace "http://www.mozilla.org/xbl" public_id "-//MOZILLA//DTD XBL V1.0//EN" system_id "http://www.mozilla.org/xbl"
The Luddite compiler and sample .udl
files can be
found in the "Komodo SDK" directory within the Komodo installation
tree:
INSTALLDIR
\lib\sdk\INSTALLDIR
/Contents/SharedSupport/sdk/INSTALLDIR
/lib/sdk/Simply place the SDK bin directory on your PATH and you should
be able to run luddite
:
# On Windows C:\> set PATH=INSTALLDIR
\lib\sdk\bin;%PATH% C:\> set PATHEXT=%PATHEXT%;.py C:\> luddite help # On Mac OS X $ export PATH=INSTALLDIR
/Contents/SharedSupport/sdk/bin:$PATH $ luddite help # On Linux $ export PATH=INSTALLDIR
/lib/sdk/bin:$PATH $ luddite help
Typically you would:
.udl
files,.lexres
compiled version of your .udl
file) with the "luddite compile ...
" command, and.xpi
file) with the
"luddite package ...
" command.To install that extension, open the built .xpi
file
in Komodo.
Sample UDL files can be found in the udl
subdirectory of the Komodo SDK directory.