Wednesday, May 20, 2009

Thoughts on Wolfram Alpha

Wow, just 20 days into this blogging thing and already I'm writing wanker posts. I guess I understand how this can happen now. It's tempting to write this shit down because there's the hope that some random person will come in with a refreshing idea/response. I just hope it'll be better than "DIE IN A FIRE RTARD".

So anyway, the idea behind Wolfram Alpha is that it "understands" both your questions and the data it's using to answer them. The obviously disappointing thing is that it achieves this goal (or rather emulates it) using a sisyphic process of manual input, as provided for by the admirable but ultimately very limited efforts of Wolfram Research. They aggregate the data from a variety of sources, but their focus is on quality, not quantity, so there is lots of human intervention. Once I got past the novelty of this shiny new toy, it quickly became quite boring. An impressive technical feat, but not earth shattering by any means. The types of data it knows about are rather arbitrary (anything "computable"), and though the various demos show an impressive amount of detail and structure, the big picture is both sparse and dull. I can't think of many interesting questions whose answer involves historical weather data or telling me what day of the week a certain date was. It sort of reminds me of savant syndrome. Answers to interesting questions require mental leaps, not just the retrieval of dry factual data.

I don't think economy of scale applies here, either. It's hard to imagine Wolfram Alpha being twice as interesting/useful (and thus potentially twice as profitable) by having twice as much data. A thousand times more data is where it's at. The project's downfall has already been foreshadowed; the semantic revolution has failed to happen for quite some time now, largely due to its reliance on manual annotations and relative uselessness on a small scale (billions of triples is still quite small). There is just too much unstructured data for us to go back and fix as a civilization. The web embodies a small subset of our potentially machine readable knowledge, and we've failed at that. The effort required to make Wolfram Alpha truly useful for getting information that cannot be found in traditional sources is colossal. Without being able to actually access all this wealth of information any automated system is still just an almanac with a search box, even if it's a very clever search box.

Comparatively, for a human, the task of extracting meaningful information from data is trivial. The reason we so want this technological breakthrough is that humans don't scale well. Wolfram Alpha's "understanding" is limited to what Wolfram Research's staff has fed into it. I don't believe they can succeed where so many others have failed without trying something radically different. They seem better motivated and goal oriented (having a product, as opposed to having a warm fuzzy feeling about the potential of machine readable data), but I don't think that this is enough for a real departure from the state of the art as of May 14th, 2009.

A slightly more interesting community driven project is Freebase. Freebase also automates input from various sources, but it is more lax, relying on continual improvement by the community. It also employs some clever tactics to make the process improving the data fun. It doesn't have Wolfram alpha's free form text queries, but I think it's more interesting because the data is open, editable and extensible. And yet my life still remains to be changed due Freebase's existence.

I think the real answer will likely come from Google. How cliché, I know. But consider their translation services. By leveraging the sheer volume of data they have, they are able to use stochastic processes to provide better translation, spanning more language pairs than other systems. Supposedly "smart" systems hard coded with NLP constructs generally produce inferior results. So is Google Squared the future?

At least from what I've seen on the 'tubes Google Squared is not quite the holy grail either. It knows to find data about "things", and dices it up into standalone but related units of data using user fed queries. Google is encouraging the adoption of light weight semantic formats such as RDFa and microformats, but I think the key thing is that Google Squared don't seem to rely on this data, only benefit from it. This difference is vital for a process of incremental improvement of the data we have. If it's already useful and we're just making it better by adopting these formats, we get to reap the benefits of semantic data immediately soon, even if the semantic aspects aren't perfect.

But the real interesting stuff is still much further off. Once Google Squared can provide richer predicates than "is", "has" or a vague "relates to" the set of predicates itself becomes semantic data. This is not a new idea. In fact, this concept is a core part of RDF's design (predicates are also subjects). What would be really interesting is to see if this data set, the meta model of the relationships between the "things" that Google Squared currently knows about, could be generated or at least expanded using data mining techniques from the data it describes. Imagine if you will the manual effort of choosing which data goes into Wolfram Alpha, and providing new ways of combining this data becoming automated.

Another key part of using stochastic processes to make something useful is integrating positive feedback. Google's ubiquity is a clear advantage here. Compared to Wolfram Alpha's offerings Google has orders of magnitude more data and orders of magnitude more potential for feedback to improve the way this data is processed.

There's also a financial reason for believing Google will make this happen. I hate advertisements because I don't want to buy most of that stuff. I see mostly ads that are wasting both my time the advertisers' money. I think these sentiments are shared by many people, and yet Google makes a lot of its money by delivering slightly less annoying ads than its competitors. And Google is damn profitable. Within these imaginary technologies lies a lot of potential for ads that consumers would actually want to see. I think this pretty much guarantees an incentive for work in the field.

Anyway, here's hoping that semantic revolution will eventually happen after all. My money says that success will come from extracting new meaning from all that data, not by merely sifting through it for pieces that are already meaningful. Who knows, maybe it'll even happen in the next few decades ;-)

5 comments:

melo said...

As of this moment, blogger.com thinks that your blog has inappropriate content :).

I guess that wanker, shit and other fine words are too strong for this politically correct hoster.

JFC...

Best regards

telemachos said...

I'm more cynical: I just figured that Google doesn't want us talking so much about Wolfram Alpha.

nothingmuch said...

It was a script tag I put in the head that lets me not put scripts in the posts for gists

mndrix said...

You might enjoy Google Research's article The Unreasonable Effectiveness of Data.

nothingmuch said...

Thanks Michael, that's a great find!

It's basically what I was trying to get at, except better in every way ;-)