These are my raw notes from the session with Stephen Wolfram on the pre-launch of the Wolfram Alpha service at the Berkman center. Unfortunately, I was on a really bad Internet connection and only got the sound, and missed the first 20 minutes or so running around trying to find something better.
Notes from Stephen Wolfram on Alpha debut
…discussion of queries:
– nutrition in a slice of cheddar
– height of Mount Everest divided by length of Golden Gate bridge
– what’s the next item in this sequence
– type in a random number, see what it knows about it
– "next total solar eclipse"
What is the technology?
– computes things, it is harder to find answers on the web the more specifically you ask
– instead, we try to compute using all kinds of formulas and models created from science and package it so that we can walk up to a web site and have it provide the answer
– four pieces of technology:
— data curation, trillions of pieces of curated data, free/licensed, feeds, verify and clean this (curate), built industrial data curation line, much of it requires human domain expertise, but you need curated data
— algorithms: methods and models, expressed in Mathematica, there is a finite number of methods and models, but it is a large number…. now 5-6 million lines of math code
— linguistic analysis to understand input, no manual or documentation, have to interpret natural language. This is a little bit different from trad NL processing. working with more limited set of symbols and words. Many new methods, has turned out that ambiguity is not such a bit problem once we have mapped it onto a symbolic representation
— ability to automate presentation of things. What do you show people so they can cognitively grasp what you are, requires computational esthetics, domain knowledge.
Will run on 10k CPUs, using Grid Mathematica.
90% of the shelves in a typical reference library we have a decent start on
provide something authoritative and then give references to something upstream that is
know about ranges of values for things, can deal with that
try to give footnotes as best we can
Q: how do you deal with keeping data current
– many people have data and want to make it available
– mechanism to contribute data and mechanism for us to audit it
first instance is for humans to interact with it
there will be a variance of APIs,
intention to have a personalizable version of Alpha
metadata standards: when we open up our data repository mechanism, wn we use that can make data available
Questions from audience:
Differences of opinion in science?
– we try to give a footnote
– Most people are not exposed to science and engineering, you can do this without being a scientist
How much will you charge for this?
– website will be free
– corporate sponsors will be there as well, in sidebars
– we will know what kind of questions people ask, how can we ingest vendor information and make it available, need a wall of auditing
– professional version, subscription service
Can you combine databases, for instance to compute total mass of people in England?
– probably not automatically…
– can derive it
– "mass of people in England"
– we are working on the splat page, what happens when it doesn’t know, tries to break the query down into manageable parts
300th largest country in Europe? – answers "no known countries"
Data sources? Population of Internet users. how do you choose?
– identifying good sources is a key problem
– we try do it well, use experts, compare
– US government typically does a really good job
– we provide source information
– have personally been on the phone with many experts, is the data knowable?
– "based on available mortality data" or something
Technology focus in the future, aside from data curation?
– all of them need to be pushed forward
– more, better, faster of what we have, deeper into the data
– being able to deal with longer and more complicated linguistics
– being able to take pseudocode
– being able to take raw data or image input
– it takes me 5-10 years to understand what the next step is in a project…
How do you see this in contrast with semantic web?
– if the semantic web had been there, this would be much easier
– most of our data is not from the web, but from databases
– within Wolfram Alpha we have a symbolic ontology, didn’t create this as top down, mostly bottom-up from domains, merged them together when we realized similarities
– would like to do some semantic web things, expose our ontological mechanisms
At what point can we look at the formal specs for these ontologies?
– good news: All in symbolic mathematical code
– injecting new knowledge is complicated – nl is surprisingly messy, such as new terms coming in, for instance putting in people and there is this guy called "50 cent"
– exposure of ontology will happen
– the more words you need to describe the question, the harder it is
– there are holes in the data, hope that people will be motivated to fill them in
Social network? Communities?
– interesting, don’t know yet
How about more popular knowledge?
– who is the tallest of Britney Spears and 50 cent
– popular knowledge is more shallowly computable than scientific information
– linguistic horrors, book names and such, much of it clashes
– will need some popularity index, use Wikipedia a lot, can determine whether a person is important or not
The meaning of life? 42….
Integration with CYC?
– CYC is most advanced common sense reasoning system
– CYC takes what they reason about things and make it computing strengths
– human reasoning not that good when it comes to physics, more like Newton and using math
Will you provide the code?
– in Mathematica, code tends to be succinct enough that you can read it
– state of the art of synthesizing human-readable theorems is not that good yet
– humans are less efficient than automated and quantitative qa methods
– in many cases you can just ask it for the formula
– our pride lies in the integration, not in the models, for they come from the world
– "show formula"
Will this be integrated into Mathematica?
– future version will have a special mode, linguistic analysis, pop it to the server, can use the computation
How much more work on the natural language side?
– we don’t know
– pretty good at removing linguistic fluff, have to be careful
– when you look at people interacting with the system, but pretty soon they get lazy, only type in the things they need to know
– word order irrelevant, queries get pared down, we see deep structure of language
– but we don’t know how much further we need to go
How does this change the landscape of public access to knowledge?
– proprietary databases: challenge is make the right kind of deal
– we have been pretty successful
– we can convince them to make it casually available, but we would have to be careful that the whole thing can’t be lifted out
– we have yet to learn all the issues here
– have been pleasantly surprised by the extent to which people have given access
– there is a lot of genuinely good public data out there
This is a proprietary system – how do you feel about a wiki solution outcompeting you?
– that would be great, but
– making this thing is not easy, many parts, not just shovel in a lot of data
– Wikipedia is fantastic, but it has gone in particular directions. If you are looking for systematic data, properties of chemicals, for instance, over the course of the next two years, they get modified and there is not consistency left
– the most useful thing about Wikipedia is the folk knowledge you get there, what are things called, what is popular
– have thought about how to franchise out, it is not that easy
– by the way, it is free anyway…
– will we be inundated by new data? Encouraged by good automated curation pipelines. I like to believe that an ecosystem will develop, we can scale up.
– if you want this to work well, you can’t have 10K people feeding things in, you need central leadership
Interesting queries?
– "map of the cat" (this is what I call artificial stupidity)
– does not know anatomy yet
– how realtime is stock data? One minute delayed, some limitations
– there will be many novelty queries, but after that dies down, we are left with people who will want to use this every day
How will you feel if Google presents your results as part of their results?
– there are synergies
– we are generating things on the fly, this is not exposable to search engines
– one way to do it could be to prescan the search stream and see if wolfram alpha can have a chance to answer this
Role for academia?
– academia no longer accumulates data, useful for the world, but not for the university
– it is a shame that this has been seen as less academically respectable
– when chemistry was young, people went out and looked at every possible molecule
– this is much to computer complicated for the typical libraries
– historical antecedents may be Leibniz’ mechanical and computational calculators, he had the idea, but 300 years too early
When do we go live?
… a few weeks
– maybe a webcast if we dare…