Right now I'm just excluding any articles with HTML tags (explicitly, the ">") character in the headline or lede. But I can pretty much guarantee that there are libraries that implement HTML handling (beautifulsoup?) that I could be leveraging to do this smarter (and thus with fewer Guardian API calls).