In the first two parts I covered setting up our tools, and creating a code template to build on. The next big challenge is replacing some of the built-in facilities of Javascript, like string handling, regular expressions and DOM manipulation, with C++ substitutes.
For string handling, watch out for the encoding! Most code examples use 8 bit ASCII strings, but Firefox supports Unicode strings, which allow a lot more languages to be represented. If we want a wide audience for our extension, we’ll need to support them too.
C++ inherits C’s built-in strings, as either char (for ASCII )or wchar_t (for Unicode) pointers. These are pretty old-fashioned and clunky to use, doing common operations like appending two strings involves explicit function calls, and you have to manually manage the memory allocated for them.
We should use the STL’s string class, std::wstring, instead. This is the Unicode version of std::string, and supports all the same operations, including append just by doing "+". The equivalent for indexOf() is find(), which returns std::wstring::npos rather than -1 if the substring is not found. lastIndexOf() is similarly matched by find_last_of(). The substring() method is closely matched by the substr() call, but beware, the second argument is the length of the substring you want, not the index of the last character as in JS!
For regular expressions, our best bet is the Boost Regex library. You’ll need to download and install boost to use it but luckily the windows installer is very painless. Once that’s done, we can use the boost::wregex object to do Unicode regular expression work (the boost::regex only handles ASCII). One pain dealing with REs in C++ is that you have to use double slashes in the string literals you set them up with, so that to get the expression \?, you need a literal "\\?", since the compiler otherwise treats the slash as the start of a C character escape. The regular expressions functions themselves are a bit different than Javascript’s; regex_match() only returns true if the whole string matches the RE, and regex_search() is the one to use for repeated searches.
DOM maniplation is possible through the MSHTML collection of interfaces. IHTMLDocument3 is a good start, it supports a lot of familiar functions such as getElementsByTagName and getElementById. It does involve a lot of COM query-interfacing to work with the DOM, so I’d recommend using ATL pointers to handle some of the house-keeping with reference counts and casting.
PeteSearch is now detecting search page loads, and extracting the search terms and links from the document, next we’ll look at XMLHttpRequest-style loading from within a BHO.