Green

SGML-Spell-Checker

SGML-spell-checker is a tool that you can use to automatically spell-check your SGML documents.

Running it on the index page of this site:

onsgmls index.html | sgml-spell-checker \ 
  > spelling_mistakes.txt

Produces:


0: onload
0: Dreamhost
0: livejournal
0: weblog
0: Weblog
0: StrayToaster
0: offsetHeight

...

Cool. But quite a lot of those words are spelt correctly, it’s just the default aspell dictionary doesn’t know about them. So:

onsgmls index.html | sgml-spell-checker \
  | awk '{ print $2 }' | sort | uniq \
  > mywordlist.txt

Then edit mywordlist.txt, weeding out all the actual spelling mistakes. Once you’re sure you have a list of correctly spelt but not in the default aspell dictionary words, create a custom aspell dictionary:

aspell --language-tag=en create master \
  ./my_custom_dictionary.aspell \
  < mywordlist.txt

Then run the spell checker again, this time with an additional dictionary including words like ‘Kottke’, ‘weblog’ and ‘StrayToaster’:

onsgmls index.html | sgml-spell-checker \
  --dictionary=carisenda-green.aspell \
  | awk '{ print $2 }' | sort | uniq \
  > spelling_mistakes.txt

Finally, run the above as a cron and check your results against some test cases:

  • charecter
  • tenderhooks

(If you’re wondering about onsgmls (or nsgmls) it’s an sgml validator, it help SGML-Spell-Check decide what’s English and what’s SGML (XHTML in this case). I’ll tell you more about it when I understand more about it myself.)

«0 CommentsApril 22, MMVI»


Name:

Email:

URL (optional):

What colour is an orange?:

Comment (comments use Textile formatting):

About This

You've arrived at the homepage of Stephen Stewart. The archive is available here for those who want it. This site is happily hosted by Dreamhost. Click for more?

More!? OK then, but I can't help feeling that this will be a disappointment to you.
I work as a web designer in Belfast, and I live by the sea in a shoe. You can see me here, doing my livejournal pose as idoru called it. If you need to you can email me at carisenda -at- gmail -dot- com.

My hCard

Stephen Stewart
Donaghadee , County Down Northern Ireland

Recent Comments


Recommended Viewing

  • 300rating
    Does exactly what it says on the (comic book) tin. Loved it.
  • Little Miss Sunshinerating
    Very funny, disturbing (the beauty pageant) and completely messed up -- but in a good way.
  • Laputa: Castle in the Skyrating
    Story great, characterization a little on the weak side -- though the big robots are cool.
  • Porco Rossorating
    Stylish, funny, exciting.
  • Grave of the Firefliesrating
    Very sad, which was a little unexpected since I didn't know much about it before watching it. Can be a little too slow in places though, otherwise it's very good.
  • Nausicaä of the Valley of the Windrating
    The story can get a little dense at times what with the various factions, but still really enjoyable, inventive and engaging.
  • X-Men: The Last Standrating
    Opportunity, talent, money - all blown. Over 3 films X-Men has hinted of something great that could have been, this last one says "No, sorry - not going to happen". Bah.

flickr.com


Main Atom Feed, Photoblog Atom Feed, Linkblog Atom Feed, Technorati Profile
Powered by Movable Type 3.34. Design © Stephen Stewart. Happily hosted by Dreamhost.