SGML-spell-checker is a tool that you can use to automatically spell-check your SGML documents.
Running it on the index page of this site:
onsgmls index.html | sgml-spell-checker \
> spelling_mistakes.txt
Produces:
0: onload
0: Dreamhost
0: livejournal
0: weblog
0: Weblog
0: StrayToaster
0: offsetHeight
...
Cool. But quite a lot of those words are spelt correctly, it’s just the default aspell dictionary doesn’t know about them. So:
onsgmls index.html | sgml-spell-checker \
| awk '{ print $2 }' | sort | uniq \
> mywordlist.txt
Then edit mywordlist.txt, weeding out all the actual spelling mistakes. Once you’re sure you have a list of correctly spelt but not in the default aspell dictionary words, create a custom aspell dictionary:
aspell --language-tag=en create master \
./my_custom_dictionary.aspell \
< mywordlist.txt
Then run the spell checker again, this time with an additional dictionary including words like ‘Kottke’, ‘weblog’ and ‘StrayToaster’:
onsgmls index.html | sgml-spell-checker \
--dictionary=carisenda-green.aspell \
| awk '{ print $2 }' | sort | uniq \
> spelling_mistakes.txt
Finally, run the above as a cron and check your results against some test cases:
(If you’re wondering about onsgmls (or nsgmls) it’s an sgml validator, it help SGML-Spell-Check decide what’s English and what’s SGML (XHTML in this case). I’ll tell you more about it when I understand more about it myself.)
«0 CommentsApril 22, MMVI»
You've arrived at the homepage of Stephen Stewart. The archive is available here for those who want it. This site is happily hosted by Dreamhost. Click for more?
More!? OK then, but I can't help feeling that this will be a disappointment to you.
I work as a web designer in Belfast, and I live by the sea in a shoe. You can see me here, doing my livejournal pose as idoru called it. If you need to you can email me at carisenda -at- gmail -dot- com.