How to spell check function/variable in Emacs
CREATED:
UPDATED:
This article explains how to spell check camel cased names of functions and variables in Emacs. It uses options --run-together
from GNU Aspell.
But this solution is not perfect. It wrongly identifies two character sub-word as typo. For example, "onChange" is regarded as typo because the sub-word "on" is identified as typo. Another issue is namespace of function name. For example, "MS" from "MSToggleButton" is alias of "Microsoft". If "MS" is identified as typo, every word containing "MS" is regarded as typo.
In this article,
- I will explain how Emacs spell checker works
- Then we study the algorithm of aspell
- Finally, a complete solution which works with both aspell and hunspell is provided
Emacs built in plugin Fly Spell does spell checking. It passes the options and plain text to command line tool aspell. Aspell sends back the typos into Fly Spell
. Fly Spell
then picks up typos to display. For example, flyspell-prog-mode only displays typos in comments and strings.
Aspell doesn't understand syntax of any programming language. It simply reports typos in plain text.
Aspell has two options:
-
--run-together-limit
is "Maximum number of words can be strung together" -
--run-together-min
is "Minimal length of sub-words"
Aspell C++ code has to be studied in order to understand above two options. Let's start from Working::check_word
in modules/speller/default/suggest.cpp
.
Here is code,
class Working : public Score {
unsigned check_word(char * word, char * word_end, CheckInfo * ci, unsigned pos = 1);
};
unsigned Working::check_word(char * word, char * word_end, CheckInfo * ci,
/* it WILL modify word */
unsigned pos)
{
// check the whole word before go into run-together mode
unsigned res = check_word_s(word, ci);
// if `res` is true, it's a valid word, don't bother run-together
if (res) return pos + 1;
// it's typo because number of sub-words is greater than "--run-together-limit"
if (pos + 1 >= sp->run_together_limit_) return 0;
// `i` is the `end` of sub-word, the poition AFTER last character of sub-word
for (char * i = word + sp->run_together_min_;
// already checked the whole word; besides, any sub-word whose size is less
// than "--run-together-min" is regarded as invalid
i <= word_end - sp->run_together_min_;
++i)
{
char t = *i;
// read the sub-word by set the character at `end` position to '\0'
*i = '\0';
res = check_word_s(word, ci);
// restore original character at `end` position
*i = t;
// Current sub-word is invalid, we need append the character at current
// `end` position to creata new sub-word.
// Inncrement `i` because `i` always points to the `end` of sub-word
if (!res) continue;
// Current sub-word is valid, strip it from the whole word to create a totally
// new word for `check_word`, `check_word` is a recursive function
res = check_word(i, word_end, ci + 1, pos + 1);
if (res) return res;
}
memset(ci, 0, sizeof(CheckInfo));
return 0;
}
Say first parameter of check_word
is "hisHelle",
-
word
points to string "hisHelle" (in C/C++, string is character array. The last character of array is character'\0'
) -
sp->run_together_min_
is 3, soi
initially points to the character "H", at the end of sub-word "his" -
check_word_s
returntrue
for sub-word "his" - So we strip "his" from "hisHelle" and recursively call
check_word
to check new word "Helle" - In the new context of
check_word
, we extract "Hel" from "Helle" initially - "Hel" is invalid. So "Hell" is extracted from "Helle". The remaining charcter "e" is a treated as as new word with
check_word
recursively applying on it - "e" is not valid It's the last word could be extracted. So recursion is over. It's concluded that "hisHelle" is a typo
Key points:
-
--run-together-limit
could be bigger with enough memory. It's default value is 8. I prefer 16. -
--run-together-min
can't be 2 because too many typos are combination of "correct" two character sub-words ("hehe", "isme", …) -
--run-together-min
can't be greater than 3, or else, too many "correct" three character sub-words are regarded as invalid ("his", "her", "one", "two") - So
--run-together-min
should always be 3
If --run-together-min
is 3, the word "onChange" is typo because the first two characters "on" is typo by aspell. This is obviously wrong.
The solution is a Emacs Lisp predicate which supports both aspell and hunspell.
A predicate could be attached to specific major-mode. The predicate file all the typos reported by CLI program. If predicate returns t
, the feed in typo is finally approved to be typo,
A example of predicate for js2-mode
,
(defun js-flyspell-verify ()
(let* ((font-face (get-text-property (- (point) 1) 'face))
(word (thing-at-point 'word)))
(message "font-face=%s word=%s" font-face word)
t))
(put 'js2-mode 'flyspell-mode-predicate 'js-flyspell-verify)
Typo candidates are filtered by js-flyspell-verify
. So predicate is where to fix typos wrongly reported by CLI program.
Here is complete setup you can paste into .emacs
(I setup for js2-mode
and rjsx-mode
but code is generic enough).
Please note function split-camel-case
splits a camel case word into a list of sub-words. Sub-word with less one or two characters is not typo.
(defun split-camel-case (word)
"Split camel case WORD into a list of strings.
Ported from 'https://github.com/fatih/camelcase/blob/master/camelcase.go'."
(let* ((case-fold-search nil)
(len (length word))
;; ten sub-words is enough
(runes [nil nil nil nil nil nil nil nil nil nil])
(runes-length 0)
(i 0)
ch
(last-class 0)
(class 0)
rlt)
;; split into fields based on class of character
(while (< i len)
(setq ch (elt word i))
(cond
;; lower case
((and (>= ch ?a) (<= ch ?z))
(setq class 1))
;; upper case
((and (>= ch ?A) (<= ch ?Z))
(setq class 2))
((and (>= ch ?0) (<= ch ?9))
(setq class 3))
(t
(setq class 4)))
(cond
((= class last-class)
(aset runes
(1- runes-length)
(concat (aref runes (1- runes-length)) (char-to-string ch))))
(t
(aset runes runes-length (char-to-string ch))
(setq runes-length (1+ runes-length))))
(setq last-class class)
;; end of while
(setq i (1+ i)))
;; handle upper case -> lower case sequences, e.g.
;; "PDFL", "oader" -> "PDF", "Loader"
(setq i 0)
(while (< i (1- runes-length))
(let* ((ch-first (aref (aref runes i) 0))
(ch-second (aref (aref runes (1+ i)) 0)))
(when (and (and (>= ch-first ?A) (<= ch-first ?Z))
(and (>= ch-second ?a) (<= ch-second ?z)))
(aset runes (1+ i) (concat (substring (aref runes i) -1) (aref runes (1+ i))))
(aset runes i (substring (aref runes i) 0 -1))))
(setq i (1+ i)))
;; construct final result
(setq i 0)
(while (< i runes-length)
(when (> (length (aref runes i)) 0)
(setq rlt (add-to-list 'rlt (aref runes i) t)))
(setq i (1+ i)))
rlt))
(defun flyspell-detect-ispell-args (&optional run-together)
"If RUN-TOGETHER is true, spell check the CamelCase words.
Please note RUN-TOGETHER will make aspell less capable. So it should only be used in prog-mode-hook."
;; force the English dictionary, support Camel Case spelling check (tested with aspell 0.6)
(let* ((args (list "--sug-mode=ultra" "--lang=en_US"))args)
(if run-together
(setq args (append args '("--run-together" "--run-together-limit=16"))))
args))
;; {{ for aspell only, hunspell does not need setup `ispell-extra-args'
(setq ispell-program-name "aspell")
(setq-default ispell-extra-args (flyspell-detect-ispell-args t))
;; }}
;; ;; {{ hunspell setup, please note we use dictionary "en_US" here
;; (setq ispell-program-name "hunspell")
;; (setq ispell-local-dictionary "en_US")
;; (setq ispell-local-dictionary-alist
;; '(("en_US" "[[:alpha:]]" "[^[:alpha:]]" "[']" nil ("-d" "en_US") nil utf-8)))
;; ;; }}
(defvar extra-flyspell-predicate '(lambda (word) t)
"A callback to check WORD. Return t if WORD is typo.")
(defun my-flyspell-predicate (word)
"Use aspell to check WORD. If it's typo return t."
(let* ((cmd (cond
;; aspell: `echo "helle world" | aspell pipe`
((string-match-p "aspell$" ispell-program-name)
(format "echo \"%s\" | %s pipe"
word
ispell-program-name))
;; hunspell: `echo "helle world" | hunspell -a -d en_US`
(t
(format "echo \"%s\" | %s -a -d en_US"
word
ispell-program-name))))
(cmd-output (shell-command-to-string cmd))
rlt)
;; (message "word=%s cmd=%s" word cmd)
;; (message "cmd-output=%s" cmd-output)
(cond
((string-match-p "^&" cmd-output)
;; it's a typo because at least one sub-word is typo
(setq rlt t))
(t
;; not a typo
(setq rlt nil)))
rlt))
(defun js-flyspell-verify ()
(let* ((case-fold-search nil)
(font-matched (memq (get-text-property (- (point) 1) 'face)
'(js2-function-call
js2-function-param
js2-object-property
js2-object-property-access
font-lock-variable-name-face
font-lock-string-face
font-lock-function-name-face
font-lock-builtin-face
rjsx-text
rjsx-tag
rjsx-attr)))
subwords
word
(rlt t))
(cond
((not font-matched)
(setq rlt nil))
;; ignore two character word
((< (length (setq word (thing-at-point 'word))) 2)
(setq rlt nil))
;; handle camel case word
((and (setq subwords (split-camel-case word)) (> (length subwords) 1))
(let* ((s (mapconcat (lambda (w)
(cond
;; sub-word wholse length is less than three
((< (length w) 3)
"")
;; special characters
((not (string-match-p "^[a-zA-Z]*$" w))
"")
(t
w))) subwords " ")))
(setq rlt (my-flyspell-predicate s))))
(t
(setq rlt (funcall extra-flyspell-predicate word))))
rlt))
(put 'js2-mode 'flyspell-mode-predicate 'js-flyspell-verify)
(put 'rjsx-mode 'flyspell-mode-predicate 'js-flyspell-verify)
UPDATE: Now you can use wucuo. It's an out of box solution supporting both aspell and hunspell.