How simple structured data
trumps clever machine learning
The late nineties dream of search engines was that they would use
grand-scale Artificial Intelligence to find everything,
understand most of it and help us retrieve the best of
it. Not much of that has really come true.
AI to rule them all! AI to find them?
Google has always performed a wide crawl of the entire web. But few
webmasters are so naive as to assume their pages will be found this way.
Even this website, which has fewer than 20 pages, has had problems with
Google finding all of them. Relying solely on the general crawl has proved
unworkable for most.
Google introduced the Sitemap
standard in 2005 to allow webmasters to eliminate the confusion by just
providing a list of all their pages. Most websites now provide sitemap
files instead of relying on the general crawl.
A sitemap file is, in short, a big XML file full of links to your site’s
pages. I think it says something that even with this seemingly foolproof
data interchange format that Google still have to provide tooling
to help webmasters debug issues. That said, it’s a huge improvement
compared to trying to riddle out why their general crawl did or did not
find certain pages. Or found them multiple times.
AI to mine them?
After a search engine finds a page the next step is to read it and
understand it. How well does this work in practice? Again, relatively few
websites expect Google to manage this on their own. Instead they provide
copious metadata to help Google understand what a page is about and how it
sits relative to other pages.
Google gave up at some point trying to work out which of two similar
pages is the original. Instead there is now a piece of
metadata which you add to let Google know which page is the “canonical”
version. This is so they know which one to put in the search results, for
example, and don’t wrongly divvy up one page’s “link juice” into multiple
Google also gave up trying to divine who the author is. While Google+
was a goer, they tried to encourage webmasters to attach metadata referring
to the author’s Google+ profile. Now that Google+ has been abandoned they
instead read metadata from Facebook’s OpenGraph
specification, particularly for things other than the main set of
Google search results (for example in the news stories they show to Android
users). For other data they parse JSON-LD metadata tags,
probably much more.
Google doesn’t just search web documents, they also have a product
search, Google Shopping (originally “Froogle”). How does Google deduce the
product data for an item from the product description page? This is,
afterall, a really hard AI problem. The answer is that they simply don’t –
they require sellers to provide that information in a structured
format, ready for them to consume.
Google of course do do text analysis, as they have always done, but it’s
often forgotten that their original leg up over other search engines was
not better natural language processing but a metadata trick: using backlinks as a proxy for
notability. The process is detailed in the original academic paper and
in the PageRank
Backlink analysis was a huge step forward, but PageRank is not
about understanding what is on the page and indeed early on Google returned
pages in the search results that it had not yet even downloaded. Instead
PageRank judges the merit of a page based on what other pages link to it.
That is, based on metadata.
…and in the darkness, combine them?
And how well, after all this, does the Artificial Intelligence do at
coming up with the relevant documents in response to search queries? Not so
well that showing structured data lifted from Wikipedia’s infoboxes on the right
hand side wasn’t a major improvement. So many searches are now resolved by
the “sidebar” and “zero click results” that traffic to Wikipedia has
The remaining search results themselves
are increasingly troubled.
My own personal experience is that they are now often comprised of
superficial commercial “content” from sites that are experts in setting
their page metadata correctly and the other dark arts required to exploit
the latest revision of Google’s algorithm. There’s also a huge number of
Perhaps the best measure of this problem is how often I have to append
the search terms “reddit” or “site:reddit.com” to a query. Increasingly
this is the only way to find the opinions of people who aren’t being paid
to give them. I do wonder why Reddit never seems to rank particularly well
for any keyword that commercial “content sites” cover.
Perhaps the bigger illusion is that when you search with Google you are
somehow searching the sum total of human knowledge. Of course, you aren’t.
The accumulated knowledge of human civilisation is still mostly in books.
Humanity wrote books for thousands of years and has only written web pages
for a few decades. When you search, you are really just searching the sum
total of things that people have put, and managed to keep, on the web since
about 1995. Perhaps this is one reason why commercial “content sites”
appear often in searches: they put a lot of stuff on the web.
Metadata tends to displace Artificial Intelligence
The phenomenon of metadata replacing AI isn’t just limited to web
search. Manually attached metadata trumps machine learning in many fields
once they mature – especially in fields where progress is faster than it is
in internet search engines.
When your elected government snoops on you, they famously
prefer the metadata of who you emailed, phoned or chatted to the
content of the messages themselves. It seems to be much more tractable to
flag people of interest to the security services based on who their friends
are and what websites they visit than to do clever AI on the messages they
send. Once they’re flagged, a human can always read their email anyway.
There are woolly intimations that self driving cars will read roadsigns
to work out what the speed limit is for any stretch of road but the truth
seems to be that they use the current GPS co-ordinates to access manually
entered data on speedlimits. You can live in the future right now, if you
use the right mobile app as your satnav.
One of the earliest
commercial applications of neural nets to was detect fraudulent credit
card transactions. The neural nets worked very well, but not well enough to
not be a nuisance, locking you out of your account when you went on holiday
or bought a coffee in a new place. American Express now use the combination
of a cardholder provided whitelist of
merchants and text message
codes in preference to allowing the AI models to run free.
A general pattern seems to be that Artificial Intelligence is used when
first doing some new thing. Then, once the value of doing that thing is
established, society will find a way to provide the necessary data in a
machine readable format, obviating (and improving on) the AI models.
I’m sure there’s someone out there working tirelessly to perfect all the
disparate technologies – computer vision, control systems, depth
perception, etc – required in order for a Tesla to successfully navigate a
McDonald’s drive through. Just as they get it sorted and demonstrate its
utility, McDonalds will probably just calculate and provide those routes as
public information. After all, why bother with the maths and machine vision
when you can just write it down in an XML file?
Of course, this all only works when you can trust that the metadata is
right. That’s not always the case and this is the primary reason why Google
no longer indexes meta description strings. Those dastardly webmasters keep
But you don’t always have to use metadata from the owner of a thing. The
metadata might be provided by some neutral third party, as a matter of
public record or just the accumulated weight of numerous uncorrelated data
points. This is what happens when Google shows Wikipedia data on search
engine result pages. Or business addresses. It’s also how PageRank
The virtues of metadata
Google never publish what they have inferred about a web page with their
clever AI techniques. Even webmasters are only given access to a very small
portion of the data about their own sites to allow them to debug issues.
The whole system is stunningly opaque.
The best argument for metadata is that it’s open and there for anyone to
read. Anyone who wants to can easily write a parser for the OpenGraph tags.
They don’t need gads of AI models or cloud computing or whatever to
understand something simple about a web page.
It’s important, though, that the metadata sits on or near the thing
itself, and that if it doesn’t, that there isn’t a requirement for lots of
interaction or co-operation to get it. Having to plead for access to or pay
for metadata usually ends up empowering monopolies or creating needless
data middlemen (who drone on and on about how “data is the new oil”). At
best it creates little barriers to getting started. Finance in particular
is riddled with this problem.
The vices of the AI myth
Google themselves say loudly and often that webmasters should “forget
metadata and focus on content“. This feeds into the Google mythos
that they have some godlike power to algorithmically understand web pages.
It also misleads the public that metadata is somehow ancillary and that
search engines will work all it out on their own. This discourages
webmasters from bothering with the basic things that will help people
discover their pages, like OpenGraph tags or Twitter cards. The enormous
number of people with “SEO” as their job title really should put the lie to
the idea that metadata doesn’t matter and that Google is a fair system.
Over-confidence in either an extant (but mysterious) or forthcoming
piece of Artificial Intelligence often discourages people from seeking out
simpler solutions. You feel like a an idiot suggesting something as
diminutive as an XML tag when others make wild (and wildly confident)
claims about what the burgeoning Strong AI will achieve. After all, with
all these recaptchas I’m filling out, the machines must be getting really
good at recognising palm trees.
But “machine readable” strictly dominates machine learning. And worse
yet for the data scientists, as soon as they establish the viability of
doing something new with a computer, people will rush to apply metadata to
make the process more reliable and explainable. An ounce of markup saves a
pound of tensorflow.
Peter Thiel writes fairly convincingly in a chapter of his
book about how humans will work together with machine learning for a
long time. It’s just a shame he’s talking about surveillance of the public
by their local council.
Thiel is also the source of the “We wanted flying cars, but instead we
got 140 characters” quote which has long since been memed into oblivion by
its send up: “We wanted flying cars but all we got were pocket-sized black
squares holding all of human knowledge”. If only it were true.
Cory Doctorow wrote an article titled Metacrap long ago.
I think it’s a great argument against being too ambitious about the
possibilities of metadata (which the Semantic Web people were) but it does
conclude with the thought that on-page metadata is a fundamentally good
Larry Page and Sergey Brin were originally pretty negative about search
engines that sold ads. Appendix A in their original paper says:
we expect that advertising-funded search engines will be inherently
biased towards the advertisers and away from the needs of the
we believe the issue of advertising causes enough mixed incentives
that it is crucial to have a competitive search engine that is
transparent and in the academic realm
Another blog post could be written on the incredible growth of the other
kind of web metadata: that present for security reasons.
X-XSS-Protections are all pretty baffling and probably mostly
mis-set or ignored. How many sites set
correctly, or at all? If you’re interested in this, I highly recommend the
The Tangled Web. Even if it is slowly getting dated it remains a great
source of intuition on all the things that can go wrong in network
protocols and “sandboxed” code execution.