Category Archives: Search

Write, that I may find thee

A Google Dance – when Google changes its rankings of web sites – used to be something that happened infrequently enough that each “dance” had a name – Boston, Fritz and Brandy, for instance – but are now happening more than 500 times per year, with names like Panda #25 and Penguin 2.0, to name a few relatively recent ones. (There is even a Google algorithm change “weather report”, as many of the updates now are unnamed and very frequent.) As a consequence, search engine optimization seems to me to be changing – and funny enough, is less and less about optimization and more and more about origination and creation.

It turns out that Google is now more and more about original content – that means, for instance, that you can no longer boost your web site simply by using Google Translate to create a French or Korean version of your content. Nor can you create lots of stuff that nobody reads – and by nobody, I mean not just that nobody reads your article, but that the incoming links are from, well, nobodies. According to my sources, Google’s algorithms have now evolved to the point where there are just two main mechanisms for generating the good Google juice (and they are related):

  1. Write something original and good, not seen anywhere else on the web.
  2. Get some incoming links from web sites with good Google-juice, such as the New York Times, Boing Boing, a well-known university or, well, any of the “Big 10” domains (Wikipedia, Amazon, Youtube, Facebook, eBay (2 versions), Yelp, WebMD, Walmart, and Target.)

The importance of the top domains is increasing, as seen by this chart from mozcast.com:

image

In other words, search engines are moving towards the same strategy for determining what is important as the rest of the world has – if it garners the attention of the movers and shakers (and, importantly, is not a copy of something else) it must be important and hence, worthy of your attention.

For the serious companies (and publishers) out there, this is good news: Write well and interesting, and you will be rewarded with more readers and more influence. This also means that companies seeking to boost their web presence may be well advised to hire good writers and create good content, rather than resort to all kinds of shady tricks – duplication of content, acquired traffic (including hiring people to search Google and click on your links and ads), and backlinks from serially created WordPress sites.

For writers, this may be good news – perhaps there is a future for good writing and serious journalism after all. The difference is that now you write to be found original by a search engine – and should a more august publication with a human behind it see what you write and publish it, that will just be a nice bonus.

Why is internal search so hard?

Have experience or an opinion? I would love to talk to you!

In collaboration with MIT CISR, I am currently researching enterprise search – i.e., the use of search engines inside corporations, whether it be for letting people outside the corporation search your website, or for letting employees search the internal collection of databases, documents, and audiovisual material. Consumer search – our everyday use of Google and other search engines – in general is very good and very fast, to the point where most people search for stuff rather than categorize it. Enterprise search, on the other hand, is often imprecise, confusing, incomplete and just not as good a source of information as searching the open Internet.

There are many reasons for this, both having to do with the content (most enterprise content lack hyperlinks, essential for prioritization, for instance), with the organization (lack of resources for and knowledge of search optimization, security policy issues, lack of an identified application owner), and with the users (who are to few to get meaningful statistics and do not, to the extent you do on the Internet, make their information findable).

Nevertheless, there are examples of companies – often consulting companies, research-oriented firms and others who deal in large amounts of information, such as pharmaceuticals and publishers, who do good work with internal, enterprise search. I have interviewed a few of those and a few search experts.

Now I would very much like to talk to anyone interesting in this topic – do you have experience, viewpoints, war stories, examples, ideas about what to do and especially what not to do? Then I am very interested in talking to you! Please leave a comment below or end me an email at self@espen.com.

Norwegian Data Inspectorate outlaws Google App use

In a letter (reported at digi.no) to the Narvik Municipality (which has started to use Google Mail and other cloud-based applications, effectively putting much of its infrastructure in the Cloud) the Norwegian Data Inspectorate (http://www.datatilsynet.no/English/), a government watchdog for privacy issues, effectively prohibits use of Google Apps, at least for communication of personal information. A key point in this decision seems to be that Google will not tell where in the world the data is stored, and, under the Patriot Act, the US government can access the data without a court order.

Companies and government organizations in Norway are required to follow the Norwegian privacy laws, which, amongst other things, requires that “personal information” (of which much can be communicated between a citizen and municipal tax, health and social service authorities) should be secured, and that personal information collected for one purpose may not be used for other purposes without the owner’s expressed permission.

This has interesting implications for cloud computing – many European countries have similar watchdogs as Norway, and many public and private organizations are interested in using Google’s services for their communication needs. My guess is that Google will need to offer some sort of reassurance that the data is outside of US jurisdiction, or effectively forgo this market to other competitors, such as Microsoft of some of the local consulting companies, which are busy building their own private clouds. Should be an interesting discussion at Google – the Data Inspectorate is a quite popular watchdog, Norway has some of the strongest privacy protection laws in the world (though, for some reason, it publishes people’s income and tax details), and Google’s motto of “Don’t be evil” might be put to the test here – national laws limiting global infrastructures.

What you can learn from your LinkedIn network

LinkedIn Maps is a fascinating service that lets you map out your contact network. Here is my first-level network, with 848 nodes (click for larger image):

image

The colors are added automatically by LinkedIn, presumably by profile similarity and link to other networks. You have to add the labels yourself – they are reasonably precise, at least for the top five groups (listed according to size and, I presume, interconnectedness).

As can be seen, I am a gatekeeper between a network of consultants and researchers in the States (the orange group) and reasonably plugged into the IT industry, primarily Norwegian (the dark blue). The others are fairly obvious, with the exception of the last category, which happens to be an eclectic group that I interact with quite a lot, but which are hard to categorize, at least from their backgrounds.

Incidentally, the “shared” map, which takes away names, provides more information for analysis. Note the yellow nodes in my green network on the right: These are the people hired by BI to manage or teach in China. They are, not in nationality but in orientation, foreigners in their own organization.

My LinkedIn policy is to accept anyone I know (i.e. have had dealings with and would like in my network), which, naturally, includes a number of students (I will friend any student of my courses as long as I can remember them, though I must admit I am a bit sloppy there.)

What is missing? Two things stand out: I have many contacts in Norwegian media and in the international blogosphere, which isn’t here because, well, Norwegian media use Twitter or their own outlets, and bloggers use, well, their blogs. Hence, the commentariat is largely invisible in the LinkedIn world (except for Jill Walker Rettberg, who sicced me onto LinkedIn Maps). Also, a number of personal friends are not here, simply because LinkedIn is a professional network – and as such captures formal relationships, not your daily communications.

Now, what really would make me curious is what this map would look like for my Facebook, Twitter and Gmail accounts – and how they overlap. But the network in itself is interesting – and tells me that increasing the interaction between my USA network and the Norwegian IT industry wouldn’t hurt.

How students search

David Weinberger has posted his notes from a very interesting session at Berkman that I for some reason missed – Alison Head’s presentation of studies of students’ information search behavior from the Project Information Literacy project. The findings confirm a lot of what I would have thought just by observing my own (young adult) children’s search behavior, or, for that matter, my own. Wikipedia is used a lot, and quite intelligently, in the beginning of a search. You talk to librarians and other people to get the vocabulary necessary for a search. And students (and everyone else) wants one database, not many.

Two books on search and social network analysis

Social Network Analysis for Startups: Finding connections on the social webSocial Network Analysis for Startups: Finding connections on the social web by Maksim Tsvetovat
My rating: 3 of 5 stars

Concise and well-written (like most O’Reilly stuff) book on basic social network analysis, complete with (Python, Unix-based) code and examples. You can ignore the code samples if you want to just read the book (I was able to replicate some of them using UCINet, a network analysis tool).

Liked it. Recommended.

Search Analytics for Your Site: Conversations with Your CustomersSearch Analytics for Your Site: Conversations with Your Customers by Louis Rosenfeld
My rating: 4 of 5 stars

Very straightforward and practically oriented – with lots of good examples. Search log analysis – seeing what customers are looking for and whether or not they find it – is as close to having a real, recorded and analyzable conversation with your customers as you can come, yet very few companies do it. Rosenfeld shows how to do it, and also how to find the low-hanging fruit and how to justify spending resources on it.

This is not rocket science – I was, quite frankly, astonished at how few companies do this. With more and more traffic coming from search engines, more and more users using search rather than hierarchical navigation, and the invisibility of dissatisfied customers (and the lost opportunities they represent) this should be high on any CIOs agenda.

Highly recommended.

View all my reviews

Does LinkedIn help or disrupt headhunters?

(I am looking for a M.Sc. student(s) to research this question for his/her/their thesis.)

The first users of LinkedIn were, as far as I can tell, headhunters (at least the first users with 500+ contacts and premium subscriptions.) It makes sense – after all, having a large network of professionals in many companies is a requirement for a headhunter, and LinkedIn certainly makes it easy not only to manage the contacts and keep in touch with them, but also allows access to each individual contact’s network. However, LinkedIn (and, of course, other services such as Facebook, Plaxo, etc.) offers its services to all, making connections visible and to a certain extent enabling anyone with a contact network and some patience to find people that might be candidates for a position.

I suspect that the evolution of the relationship between headhunters and LinkedIn is a bit like that of fixed-line telephone companies to cell phones: In the early days, they were welcomed because they extended the network and was an important source of additional traffic. Eventually, like a cuckoo’s egg, the new technology replaced the old one. Cell phones have now begun to replace fixed lines. Will LinkedIn and similar professional networks replace headhunters?

If you ask the headhunters, you will hear that finding contacts is only a small part of their value proposition – what you really pay for is the ability to find the right candidate, of making sure that this person is both competent, motivated and available, and that this kind of activity cannot be outsourced or automated via some computer network. They will grudgingly acknowledge that LinkedIn can help find candidates for lower-level and middle-management, but that for the really important positions, you will need the network, judgment and evaluative processes of a headhunting company.

On the other hand, if you has HR departments charged with finding people, they will tell you that LinkedIn and to a certain extent Facebook is the greatest thing since sliced bread when it comes to finding people quickly, to vet candidates (sometimes discovering youthful indiscretions) and to establish relationships. I have heard people enthuse over not having to use headhunters anymore.

So, the incumbents see it as a low-quality irrelevance, the users see it as a useful and cheap replacement. To me, this sounds suspiciously like a disruption in the making, especially since, in the wake of the financial crisis, companies are looking to save money and the HR departments dearly would like to provide more value for less money, since they are often marginalized in the corporation.

I would like to find out if this is the case – and am therefore looking for a student or two who would like to do their Master’s thesis on this topic, under my supervision. The research will be funded through the iAD Center for Research-based innovation. Ideally, I would want students who want to research this with a high degree of rigor (perhaps getting into network analysis tools) but I am also willing to talk to people who want to do it with more traditional research approaches – say, a combination of a questionnaire and interviews/case descriptions of how LinkedIn is used by headhunters, HR departments and candidates looking for new challenges.

So – if you are interested – please contact me via email at self@espen.com. Hope to hear from you!

Stephen Wolfram’s computable universe

I love Wolfram Alpha and think it has deep implications for our relationship with information, indeed our use of language both in a human-computer interaction sense and as a vehicle for passing information to each other.

In this video from TED2010, Stephen Wolfram lays out (and his language and presentation had developed considerably since Alpha was launched a year ago) where Alpha fits as an exploration of a computable universe, enabling the experimental marriage of the precision of mathematics with the messiness of the real world.

This video is both radical and incremental: Radical in its bold statement that a thought experiment such as computable universes (see Neal Stephenson’s In the beginning was…the command line, specifically the last chapter, for an entertaining explanation) actually could be generated and investigated is as radical as anything Wolfram has ever proposed. The idea of democratization of programming, on the other hand, is as old as COBOL – and I don’t think Alpha or Mathematica is going to provide it – though it might go some way, particularly if Alpha gains some market share and the idea of computing things in real time rather than accessing stored computations takes hold.

Anyway – see the video, enjoy the spark of ideas you get from it – and try out Wolfram Alpha. My best candidate for the "insert brief insightful summary research" button I always have been looking for on my keyboard.

Towards a theory of technology evolution

The Nature of Technology: What It Is and How It Evolves The Nature of Technology: What It Is and How It Evolves by W. Brian Arthur

My rating: 4 of 5 stars

Arthur sets out to articulate a theory of technology, and to a certain extend succeeds, at least in articulating the importance of technology and the layered, self-referencing and self-creating nature of its evolution.

The two main concepts I took away were the layered nature of technology, consisting of these three points:

  1. Technology is a combination of components.
  2. Each component is itself a technology.
  3. Each technology exploits an effect or phenomenon (and usually several)

Secondly, Arthur lays out, in four separate chapters, the four different ways technology evolves, as summarized on page 163 (my italics added):

There is no single mechanism, instead there are four more or less separate ones. Innovation consists in novel solutions being arrived at in standard engineering – the thousands of small advancements and fixes that cumulate to move practice forward. It consists in radically novel technologies being brought into being by the process of invention. It consists in these novel technologies developing by changing their internal parts or adding to them in the process of structural deepening. And it consists in whole bodies of technology emerging, building out over time, and creatively transforming the industries that encounter them. Each of these types of innovation is important. And each is perfectly tangible. Innovation is not something mysterious. Certainly it is not a matter of vaguely invoking something called “creativity.” Innovation is simply the accomplishing of the tasks of the economy by other means.”

I liked the book for its ambition, view of technology as something that evolves, and clear-headed way of thinking about and expressing a beginning grand theory. The concepts are intuitive and beguiling, but I did miss references to – and attempts to build on, or differentiate itself from – other valuable concepts of technology, such as sustaining vs. disruptive, competence-enhancing vs. competence-destroying, architectural vs. procedural, and so on. There is a lot of research going on in this area – we are about to break up the formerly black and mysterious box called innovation and show that it really comes down to subcategories and the interplay of quite understandable drivers. Arthur’s contribution here is significant – but it is, at least the way I read it, the way of the independent thinker who would have a lot more influence if some of the language and some of the categories were a bit closer to, or at least distinctively positioned in relation to, what others think and say.

View all my reviews >>

GRA6821 Eleventh lecture: Search technology and innovation

(Friday 13th November – 0830-about 1200, room A2-075)

FAST is a Norwegian software company that was acquired by Microsoft about a year and a half ago. In this class (held with an EMBA class, we will hear presentations from people in FAST, from Accenture, and from BI. The idea is to showcase a research initiative, to learn something about search technology, and to see how a software company accesses the market in cooperation with partners.

To prepare for this meeting, it is a good idea to read up on search technology, both from a technical and business perspective. Do this by looking for literature on your own – but here are a few pointers, both to individual articles, blogs, and other resources:

Articles:

  • How search engines work: Start with Wikipedia on web search engines, go from there.
  • Brin, S. and L. Page (1998). The Anatomy of a Large-Scale Hypertextual Web Search Engine. Seventh International WWW Conference, Brisbane, Australia. (PDF). The paper that started Google.
  • Rangaswamy, A., C. L. Giles, et al. (2009). "A Strategic Perspective on Search Engines: Thought Candies for Practitioners and Researchers." Journal of Interactive Marketing 23: 49-60. (in Blackboard). Excellent overview of some strategic issues around search technology.
  • Ghemawat, S., H. Gobioff, et al. (2003). The Google File System. ACM Symposium on Operating Systems Principles, ACM. (this is medium-to-heavy-duty computer science – I don’t expect you to understand this in detail, but not the difference of this system to a normal database system: The search system is optimized towards an enormous number of queries (reads) but relatively few insertions of data (writes), as opposed to a database, which is optimized towards handling data insertion fast and well.)
  • These articles on Google and others.

Blogs

Others

Longer stuff, such as books:

  • Barroso, L. A. and U. Hölzle (2009). The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. Synthesis Lectures on Computer Architecture. M. D. Hill, Morgan & Claypool. (Excellent piece on how to design a warehouse-scale data center – i.e., how do these Google-monsters really work?)
  • Weinberger, D. (2007). Everything is Miscellaneous: The Power of the New Digital Disorder. New York, Henry Holt and Company. Brilliant on how the availability of search changes our relationship to information.
  • Morville, P. (2005). Ambient Findability, O’Reilly. See this blog post.
  • Batelle, J. (2005). The Search: How Google and Its Rivals Rewrote the Rules of Business and Transformed Our Culture. London, UK, Penguin Portfolio. See this blog post.

Our search-detected personalities

Personas is an interesting project at the Media Lab which takes your (or anyone else’s) name as input and then determines our personalities based on what it finds about us on the web, generating a graphical representation. This is my result:

image

…which I found rather disturbing: Fame, sports and religion seems to take way to much space here. The reason, of course, is that my name is rather common in Norway, and, for example, a formerly well known skier skews the results, even though I seem to be the most web-known person with that name.

Anyway, if you have a rare name, it might be accurate – and if your name is John Smith, you might be left with an average, possibly tilted a bit towards Pocahontas:

image

Anyway – try it out. You might be surprised. And please remember – this is an art project, not an accurate representation of anything…

Update September 20:I somehow forgot to point to Naomi Haque’s blog post about Personas, with discussion of how social networking changes our perception of self.

Are social networks a help or a threat to headhunters?

In a currently hot Youtube video which breathlessly evangelizes the revolutionary nature of social networks, I found this statement: "80% of companies are using LinkedIn as their primary tool to find employees". In the comments this is corrected to "80 percent of companies use or are planning to use social networking to find and attract candidates this year", which sounds rather more believable. Social media is where the young people (and, eventually, us in the middle ages as well) are, so that is where you should look.

At the same time, many of the most prolific users of LinkedIn (and, at least according to this guy, Twitter), both in terms of number of contacts and other activities, are headhunters. It is these people’s business to know many people and be able to find someone who matches a company’s demands.

image Headhunters are the proverbial networkers – they derive their value from knowing not just many people, but the right people. In particular, headhunters that know people in many places are valuable, because they would then be the only conduit between one group and another. Your network is more valuable the fewer of your contacts are also in contact with each other.

The American sociologist Ronald S. Burt, in his book Structural Holes: The Social Networks of Competition (1992), showed that social capital accrues to those who not only know many people, but have connections across groups. Or, in other words, if everyone had been directly linked, you would have a dense network structure. The fact that we aren’t, means that there are structural holes – hence the term. In the picture to the right, we see a social network of 9 individuals. Person A here derives social capital from being the link two groups that otherwise are only internally connected. A would be an excellent headhunter here. (Much as profits only can be generated if you can locate market imperfections).

LinkedIn is a social networks, indistinguishable from a regular one (i.e., one that is not digitally facilitated) except that you can search across the network, directly up to three levels away, indirectly a bit further. Headhunters like it for this reason, and use it extensively in the early phases of locating a candidate. The trouble is, LinkedIn (not to mention the tendency of more and more people having their CV online on regular websites) makes searching for candidates easy for everyone else as well. In other words – while initially helpful, is the long term result of this searchability that headhunters will no long be necessary.

Search technology – in social networks as well as in general – lowers the transaction cost of finding something. Lower transaction costs favors coordination by markets rather than hierarchy (or, in this case, network). Hence, the value of having a central position in that network should diminish. On the other hand, search technology (in networks in particular) allows you to extend your network, hence increase your social capital. Which effect is stronger remains to be seen.

Anyway, this should make for interesting research. Anyone out there in headhunterland interested in talking to me about their use of these tools?

Plagiarism showcased – and a call for action

image I hate plagiarism, partially because it has happened to me, partially because I publish way too little because I overly self-criticize for lack of original thinking, partly because I have had it happen with quite a few students and am getting more and more tired of having to explain even to executive students with serious job experience that clipping somebody else’s text and showing it as your own is not permissible – this year, I even had a student copy things out of Wikipedia and argue that it wasn’t plagiarism because Wikipedia is not copyrighted.

I suspect plagiarism is a bigger problem than we think. The most recent spat is noted in Boing Boing – read the comments if you want a good laugh and some serious discussion. (My observation, not particularly original: Even if this thing wasn’t plagiarized, isn’t this rather thin for a doctoral dissertation?)

The thing is, plagiarism will come back to bite you, and with the search tools out there, I can see a point in a not too distant future where all academic articles ever published will be fed into a plagiarism checker, with very interesting results. Quite a few careers will end, no doubt after much huffing and puffing. Johannes Gehrke and friends at Cornell have already done work on this for computer science articles – I just can’t wait to see what will come out of tools like these when they really get cranking. I seem to remember Johannes as saying that most people don’t plagiarize, but that a few seem to do it quite a lot.

It is high time we turn the student control protocols loose on published academic work as well. Nothing like a many eyeballs to dig out that shallowness….

A wave of Google

This presentation from the Google I/O conference is an 80-minute demonstration of a really interesting collaborative tool that very successfully blends the look and feel of regular tools (email, Twitter) with the embeddedness and immediacy of Wikis and share documents. I am quite excited about this and hope it makes it out in the consumer space and does not just rest inside single organizations – collaborative spaces can create a world of many walled gardens, and being a person that works as much between organizations as in them.

Google wave really shows the power of centralized processing and storage. Here are some things I noted and liked:

  • immediate updating (broadcast) to all clients, keystroke by keystroke
  • embedded, fully editable information objects
  • history awareness (playback interactions)
  • central storage and broadcast means you can edit information objects and have the changes reflect back to previous views, which gives a pretty good indication that the architecture of this system is a tape of interactions played forward
  • concurrent collaborative editing (I want this! No more refreshes!)
  • cool extensions, such as a context-aware spell checker, an immediate link creator, concurrent searcher
  • programs are seen as participants much like humans
  • easy developer model, all you need to do is edit objects and store them back
  • client-side and server-side API
  • interactions with outside systems

I can see some strategic drivers behind this: Google is very much threatened by walled gardens such as Facebook, and this could be a great way of breaking that open (remember, programs go from applications to platforms to protocols, and this is a platform built over OpenSocial, which jams open walled gardens). This could just perhaps be what I need to be able to more effectively work over several organizations. Just can’t wait to try this out when it finally arrives.

From surfing the net to surfing the waves….

Update: Here is the Google Blog entry describing Wave from Lars Rasmussen.

From links to seeds: Edging towards the semantic web

Wolfram Alpha just may take us one step closer to the elusive Semantic Web, by evolving a communication protocol out of its query terms.

(this is very much in ruminating form – comments welcome)

Wolfram Alpha officially launched on May 18, an exciting new kind of "computational" search engine which, rather than looking up documents where your questions have been answered before, actually computes the answer. The difference, as Stephen Wolfram himself has said, is that if you ask what the distance is to the moon, Google and other search engines will find you documents that tells you the average distance, whereas Wolfram Alpha will calculate what the distance is right now, and tell you that, in addition to many other facts (such as the average). Wolfram Alpha does not store answers, but creates them every time. And it does primarily answer numerical, computable questions.

The difference between Google (and other search engines) and Wolfram Alpha is not so clear-cut, of course. If you ask Google "17 mpg in liters per 100km" it will calculate the result for you. And you can send Wolfram Alpha non-computational queries such as "Norway" and it will give an informational answer. The difference lies more in what kind of data the two services work against, and how they determine what to show you: Google crawls the web, tracking links and monitoring user responses, in a sense asking every page and every user of their services what they think about all web pages (mostly, of course, we don’t think anything about most of them, but in principle we do.) Wolfram Alpha works against a database of facts with a set of defined computational algorithms – it stores less and derives more. (That being said, they will both answer the question "what is the answer to life, the universe and everything" the same way….)

While the technical differences are important and interesting, the real difference between WA and Google lies in what kind of questions they can answer – to use Clayton Christensen’s concept, the different jobs you would hire them to do. You would hire Google to figuring out information, introduction, background and concepts – or to find that email you didn’t bother filing away in the correct folder. You would hire Alpha to answer precise questions and get the facts, rather than what the web collectively has decided is the facts.

The meaning of it all

Now – what will the long-term impact of Alpha be? Google has made us replace categorization with search – we no longer bother filing things away and remembering them, for we can find them with a few half-remembered keywords, relying on sophisticated query front-end processing and the fact that most of our not that great minds think depressingly alike. Wolfram Alpha, on the other hand, is quite a different animal. Back in the 80s, I once saw someone exhort their not very digital readers to think of the personal computer as a "friendly assistant who is quite stupid in everything but mathematics."  Wolfram Alpha is quite a bit smarter than that, of course, but the fact is that we now have access to this service which, quite simply, will do the math and look up the facts for us. Our own personal Hermione Granger, as it is.

I think the long-term impact of Wolfram Alpha will be to further something that may not have started with Google, but certainly became apparent with them: The use of search terms (or, if you will, seeds) as references. It is already common to, rather than writing out a URL, to help people find something by saying "Google this and you will find it". I have a couple of blogs and a web page, but googling my name will get you there faster (and you can misspell my last name and still not miss.) The risk in doing that, of course, is that something can intervene. As I read (in this paper) General Motors, a few years ago, had an ad for a new Pontiac model, at the end of which they exhorted the audience to "Google Pontiac" to find out more. Mazda quickly set up a web page with Pontiac in it, bought some keywords on Google, and quite literally Shanghaied GM’s ad.

Wolfram Alpha, on the other hand, will, given the same input, return the same answer every time. If the answer should change, it is because the underlying data has changed (or, extremely rarely, because somebody figured out a new way of calculating it.) It would not be because someone external to the company has figured out a way to game the system. This means that we can use references to Wolfram Alpha as shorthand – enter "budget surplus" in Wolfram Alpha, and the results will stare you in the face. In the sense that math is a language for expressing certain concepts in a very terse and precise language, Wolfram Alpha seeds will, I think, emerge as a notation for referring to factual information.

A short detour into graffiti

Back in the early-to-mid-90s, Apple launched one of the first pen-based PDAs, the Apple Newton. The Newton was, for its time, an amazing technology, but for once Apple screwed it up, largely because they tried to make the device do too much. One important issue was the handwriting recognition software – it would let you write in your own handwriting, and then try to interpret it. I am a physician’s son, and I certainly took after my father in the handwriting department. Newton could not make sense of my scribbles, even if I tried to behave, and, given that handwriting recognition is hard, it took a long time doing it. I bought one, and then sent it back. Then the Palm Pilot came, and became the device to get.

The Palm Pilot did not recognize handwriting – it demanded that you, the user, wrote to it in a sign language called Graffiti, which recognized individual characters. Most of the characters resembled the regular characters enough that you could guess what they were, for the others you either had to consult a small plastic card or experiment. The feedback was rapid, to experimenting usually worked well, and pretty soon you had learned – or, rather, your hand had learned – to enter the Graffiti characters rapidly and accurately.

Wolfram Alpha works in the same way as Graffiti did: As Steven Wolfram says in his talk at the Berkman Center, people start out writing natural language but pretty quickly trim it down to just the key concepts (a process known in search technology circles as "anti-phrasing".) In other words, by dint of patience and experimentation, we (or, at least, some of us) will learn to write queries in a notation that Wolfram Alpha understands, much like our hands learned Graffiti.

From links to seeds to semantics

Semantics is really about symbols and shorthand – a word is created as shorthand for a more complicated concept by a process of internalization. When learning a language, rapid feedback helps (which is why I th
ink it is easier to learn a language with a strict and terse grammar rather than a permissive one), simplicity helps, and a structure and culture that allows for creating new words by relying on shared context and intuitive combinations (see this great video with Stephen Fry and Jonathan Ross on language creation for some great examples.)

And this is what we need to do – gather around Wolfram Alpha and figure out the best way of interacting with the system -and then conduct "what if" analysis of what happens if we change the input just a little. To a certain extent, it is happening already, starting with people finding Easter Eggs – little jokes developers leave in programs for users to find. Pretty soon we will start figuring out the notation, and you will see web pages use Wolfram Alpha queries first as references, then as modules, then as dynamic elements.

It is sort of quirky when humans start to exchange query seeds (or search terms, if you will).  It gets downright interesting when computers start doing it. It would also be part of an ongoing evolution of gradually increasing meaningfulness of computer messaging.

When computers – or, if you will, programs – needed to exchange information in the early days, they did it in a machine-efficient manner – information was passed using shared memory addresses, hexadecimal codes, assembler instructions and other terse and efficient, but humanly unreadable encoding schemes. Sometime in the early 80s, computers were getting powerful enough that the exchanges gradually could be done in human-readable format – the SMTP protocol, for instance, a standard for exchanging email, could be read and even hand-built by humans (as I remember doing in 1985, to send email outside the company network I was on.) The world wide web, conceived in the early 90s and live to a wider audience in 1994, had at its core an addressing system – the URL – which could be used as a general way of conversing between computers, no matter what their operating system or languages. (To the technology purists out there – yes, WWW relies on a whole slew of other standards as well, but I am trying to make a point here) It was rather inefficient from a machine communication perspective, but very flexible and easy to understand for developers and users alike. Over time, it has been refined from pure exchange of information to the sophisticated exchanges needed to make sure it really is you when you log into your online bank – essentially by increasing the sophistication of the HTML markup language towards standards such as XML, where you can send over not just instructions and data but also definitions and metadata.

The much-discussed semantic web is the natural continuation of this evolution – programming further and further away from the metal, if you will. Human requests for information from each other are imprecise but rely on shared understanding of what is going on, ability to interpret results in context, and a willingness to use many clues and requests for clarification to arrive at a desired result. Observe two humans interacting over the telephone – they can have deep and rich discussions, but as soon as the conversation involves computers, they default to slow and simple communication protocols: Spelling words out (sometimes using the international phonetic alphabet), going back and forth about where to apply mouse clicks and keystrokes, double-checking to avoid mistakes. We just aren’t that good at communicating as computers – but can the computers eventually get good enough to communicate with us?

I think the solution lies in mutual adaptation, and the exchange of references to data and information in other terms than direct document addresses may just be the key to achieving that. Increases in performance and functionality of computers have always progressed in a punctuated equilibrium fashion, alternating between integrated and modular architectures. The first mainframes were integrated with simple terminal interfaces, which gave way to client-server architectures (exchanging SQL requests), which gave way to highly modular TCP/IP-based architectures (exchanging URLs), which may give way to mainframe-like semi-integrated data centers. I think those data centers will exchange information at a higher semantic level than any of the others – and Wolfram Alpha, with its terse but precise query structure may just be the way to get there.

Interesting Wolfram Alpha statistics

Here is the answer you get from entering "budget surplus" into Wolfram Alpha:

image 

Two things I did not know: The fifth largest government surplus in the world is held by Serbia, which surprises me, given that the country has 14% unemployment and a recovering economy, according to Wikipedia. And that Japan’s deficit is very close to the US’, indicating that things are not as bad in the US as you might think. Or perhaps that the numbers are a bit dated, but according to the source information, most of the numbers are from 2009.

Since May 17th is Norway’s national day, I think it behooves me to point out that of the five surplus states listed above, Norway is the nicest place to live, by most measures (weather, culture, politics, human rights, health care, etc. etc.). On the other hand, many of the countries with large deficits are nice places to live, so I wouldn’t read too much into the economics at all…

(Hat tip to Karthik, who retweeted one of my tweets, which I misunderstood and started researching….)

Notes from Stephen Wolfram webcast

These are my raw notes from the session with Stephen Wolfram on the pre-launch of the Wolfram Alpha service at the Berkman center. Unfortunately, I was on a really bad Internet connection and only got the sound, and missed the first 20 minutes or so running around trying to find something better.

Notes from Stephen Wolfram on Alpha debut

…discussion of queries:
– nutrition in a slice of cheddar
– height of Mount Everest divided by length of Golden Gate bridge
– what’s the next item in this sequence
– type in a random number, see what it knows about it
– "next total solar eclipse"

What is the technology?
– computes things, it is harder to find answers on the web the more specifically you ask
– instead, we try to compute using all kinds of formulas and models created from science and package it so that we can walk up to a web site and have it provide the answer

– four pieces of technology:
— data curation, trillions of pieces of curated data, free/licensed, feeds, verify and clean this (curate), built industrial data curation line, much of it requires human domain expertise, but you need curated data
— algorithms: methods and models, expressed in Mathematica, there is a finite number of methods and models, but it is a large number…. now 5-6 million lines of math code
— linguistic analysis to understand input, no manual or documentation, have to interpret natural language. This is a little bit different from trad NL processing. working with more limited set of symbols and words. Many new methods, has turned out that ambiguity is not such a bit problem once we have mapped it onto a symbolic representation
— ability to automate presentation of things. What do you show people so they can cognitively grasp what you are, requires computational esthetics, domain knowledge.

Will run on 10k CPUs, using Grid Mathematica.
90% of the shelves in a typical reference library we have a decent start on
provide something authoritative and then give references to something upstream that is
know about ranges of values for things, can deal with that
try to give footnotes as best we can

Q: how do you deal with keeping data current
– many people have data and want to make it available
– mechanism to contribute data and mechanism for us to audit it

first instance is for humans to interact with it
there will be a variance of APIs,
intention to have a personalizable version of Alpha
metadata standards: when we open up our data repository mechanism, wn we use that can make data available

Questions from audience:

Differences of opinion in science?
– we try to give a footnote
– Most people are not exposed to science and engineering, you can do this without being a scientist

How much will you charge for this?
– website will be free
– corporate sponsors will be there as well, in sidebars
– we will know what kind of questions people ask, how can we ingest vendor information and make it available, need a wall of auditing
– professional version, subscription service

Can you combine databases, for instance to compute total mass of people in England?
– probably not automatically…
– can derive it
– "mass of people in England"
– we are working on the splat page, what happens when it doesn’t know, tries to break the query down into manageable parts
300th largest country in Europe? – answers "no known countries"

Data sources? Population of Internet users. how do you choose?
– identifying good sources is a key problem
– we try do it well, use experts, compare
– US government typically does a really good job
– we provide source information
– have personally been on the phone with many experts, is the data knowable?
– "based on available mortality data" or something

Technology focus in the future, aside from data curation?
– all of them need to be pushed forward
– more, better, faster of what we have, deeper into the data
– being able to deal with longer and more complicated linguistics
– being able to take pseudocode
– being able to take raw data or image input
– it takes me 5-10 years to understand what the next step is in a project…

How do you see this in contrast with semantic web?
– if the semantic web had been there, this would be much easier
– most of our data is not from the web, but from databases
– within Wolfram Alpha we have a symbolic ontology, didn’t create this as top down, mostly bottom-up from domains, merged them together when we realized similarities
– would like to do some semantic web things, expose our ontological mechanisms

At what point can we look at the formal specs for these ontologies?
– good news: All in symbolic mathematical code
– injecting new knowledge is complicated – nl is surprisingly messy, such as new terms coming in, for instance putting in people and there is this guy called "50 cent"
– exposure of ontology will happen
– the more words you need to describe the question, the harder it is
– there are holes in the data, hope that people will be motivated to fill them in

Social network? Communities?
– interesting, don’t know yet

How about more popular knowledge?
– who is the tallest of Britney Spears and 50 cent
– popular knowledge is more shallowly computable than scientific information
– linguistic horrors, book names and such, much of it clashes
– will need some popularity index, use Wikipedia a lot, can determine whether a person is important or not

The meaning of life? 42….

Integration with CYC?
– CYC is most advanced common sense reasoning system
– CYC takes what they reason about things and make it computing strengths
– human reasoning not that good when it comes to physics, more like Newton and using math

Will you provide the code?
– in Mathematica, code tends to be succinct enough that you can read it
– state of the art of synthesizing human-readable theorems is not that good yet
– humans are less efficient than automated and quantitative qa methods
– in many cases you can just ask it for the formula
– our pride lies in the integration, not in the models, for they come from the world
– "show formula"

Will this be integrated into Mathematica?
– future version will have a special mode, linguistic analysis, pop it to the server, can use the computation

How much more work on the natural language side?
– we don’t know
– pretty good at removing linguistic fluff, have to be careful
– when you look at people interacting with the system, but pretty soon they get lazy, only type in the things they need to know
– word order irrelevant, queries get pared down, we see deep structure of language
– but we don’t know how much further we need to go

How does this change the landscape of public access to knowledge?
– proprietary databases: challenge is make the right kind of deal
– we have been pretty successful
– we can convince them to make it casually available, but we would have to be careful that the whole thing can’t be lifted out
– we have yet to learn all the issues here

– have been pleasantly surprised by the extent to which people have given access
– there is a lot of genuinely good public data out there

This is a proprietary system – how do you feel about a wiki solution outcompeting you?
– that would be great, but
– making this thing is not easy, many parts, not just shovel in a lot of data
– Wikipedia is fantastic, but it has gone in particular directions. If you are looking for systematic data, properties of chemicals, for instance, over the course of the next two years, they get modified and there is not consistency left
– the most useful thing about Wikipedia is the folk knowledge you get there, what are things called, what is popular
– have thought about how to franchise out, it is not that easy
– by the way, it is free anyway…
– will we be inundated by new data? Encouraged by good automated curation pipelines. I like to believe that an ecosystem will develop, we can scale up.
– if you want this to work well, you can’t have 10K people feeding things in, you need central leadership

Interesting queries?
– "map of the cat" (this is what I call artificial stupidity)
– does not know anatomy yet
– how realtime is stock data? One minute delayed, some limitations
– there will be many novelty queries, but after that dies down, we are left with people who will want to use this every day

How will you feel if Google presents your results as part of their results?
– there are synergies
– we are generating things on the fly, this is not exposable to search engines
– one way to do it could be to prescan the search stream and see if wolfram alpha can have a chance to answer this

Role for academia?
– academia no longer accumulates data, useful for the world, but not for the university
– it is a shame that this has been seen as less academically respectable
– when chemistry was young, people went out and looked at every possible molecule
– this is much to computer complicated for the typical libraries
– historical antecedents may be Leibniz’ mechanical and computational calculators, he had the idea, but 300 years too early

When do we go live?
… a few weeks
– maybe a webcast if we dare…

What if you could remember everything?

I was delighted when I found this video, where James May (the cerebral third of Top Gear) talks to professor Alan Smeaton of Dublin City University about lifelogging – the recording of everything that happens to a person over a period of time, coupled with the construction of tools for making sense of the data.

In this example, James May wears a Sensecam for three days. The camera records everything he does (well, not everything, I assume – if you want privacy, you can always stick it inside your sweater) by taking a picture every 30 seconds, or when something (temperature, IR rays in front (indicating a person) or GPS location) changes. As it is said in the video, some people have been wearing these cameras for years – in fact, one of my pals from the iAD project, Cathal Gurrin, has worn one for at least three years. (He wore it the first time we met, where it snapped a picture of me with my hand outstretched.)

The software demonstrated in the video groups the pictures into events, by comparing the pictures to each other. Of course, many of the pictures can be discarded in the interest of brevity – for instance, for anyone working in an office and driving to work, many of the pictures will be of two hands on a keyboard or a steering wheel, and can be discarded. But the rest remains, and with powerful computers you can spin through your day and see what you did on a certain date.

And here is the thing: This means that you will increasingly have the option of never forgetting anything again. You know how it is – you may have forgotten everything about some event, and then something – a smell, a movement, a particular color – makes you remember by triggering whatever part (or, more precisely, which strands of your intracranial network) of your brain this particular memory is stored. Memory is associative, meaning that if we have a few clues, we can access whatever is in there, even though it had been forgotten.

Now, a set of pictures taken at 30-second intervals, coupled together in an easy-to-use and powerful interface, that is a rather powerful aide-de-memoire.

Forgetting, however, is done for a purpose – to allow you to concentrate on what you are doing rather than using spare brain cycles in constant upkeep of enormous, but unimportant memories. For this system to be effective, I assume it would need to be helpful in forgetting as well as remembering – and since it would be stored, you would actually not have to expend so much remember things – given a decent interface, you could always look it up again, much as we look things up in a notebook.

Think about that – remembering everything – or, at least being able to recall it at will. Useful – or an unnecessary distraction?

Search and effectiveness in creativity

Effective creativity is often accomplished by copying, by the creation of certain templates that work well, which are then changed according to need and context. Digital technology makes copying trivial, and search technology makes finding usable templates easy. So how do we judge creativity when combintations and associations can be done semi-automatically?

One of my favorite quotes is supposedly by Fyodor Dostoyevsky: "There are only two books written: Someone goes on a journey, or a stranger comes to town." Thinking about it, it is surprisingly easy to divide the books you have read into one or the other. The interesting part, however, lies not in the copying, but in the abstraction: The creation of new categories, archetypes, models and templates from recognizing new dimensions of similarity in previously seemingly unrelated instances of creative work.

Here is a demonstration, fresh from Youtube, demonstrating how Disney reuses character movements, especially in dance scenes:

Of course, anyone who has seen Fantasia recognizes that there are similarities between Disney movies, even schools (the "angular" once represented by 101 Dalmatians, Sleeping Beauty and Mulan, and the more rounded, cutesy ones represented by Bambi, The Jungle Book and Robin Hood. (Tom Wolfe referred to this difference (he was talking about car design, but what the heck, as Apollonian versus Dionysian, and apparently borrowed that distinction from Nietsche. But I digress.)

This video, I suspect, was created by someone recognizing movements, and putting the demonstration together manually. But in the future, search and other information access technologies will allow us to find such dimensions simply by automatically exploring similarities in the digital representations of creative works – computers finding patterns were we do not.

One example (albeit aided by human categorization) of this is the Pandora music service, where the user enters a song or an artist, and Pandora finds music that sounds similar to the song or artist entered. This can produce interesting effects: I found, for instance, that there is a lot of similarity (at least Pandora seems to think so, and I agree, though I didn’t see it myself) between U2 and Pink Floyd. And imagine my surprise when, on my U2 channel (where the seed song was Still haven’t found what I’m looking for) when a song by Julio Iglesias popped up. Normally I wouldn’t be caught dead listening to Julio Iglesias, but apparently this one song was sufficiently similar in its musical makeup to make it into the U2 channel. (I don’t remember the name of the song now, but remember that I liked it.)

In other words, digital technology enables us to discover categorization schemes and visualize them. Categorization is power, because it shapes how we think about and find information. In business terms, new ways to categorize information can mean new business models or at least disruptions of the old. Pandora has interesting implications for artist brand equity, for instance: If I wanted to find music that sounded like U2 before, my best shot would be to buy a U2 record. Now I can listen to my Youtube channel on Pandora and get music from many musicians, most of whom are totally unknown to me, found based on technical comparisons of specific attributes of their music (effectively, a form of factor analysis) rather than the source of the creativity.

imageI am not sure how this will work for artists in general. On one hand, there is the argument that in order to make it in the digital world, you must be more predictable, findable, and (like newspaper headlines) not too ironic. On the other hand, is that if you create something new – a nugget of creativity, rather than a stream – this single instance will achieve wider distribution than before, especially if it is complex and hard to categorize (or, at least, rich in elements that can be categorized but inconclusive in itself.)

 Susan Boyle, the instant surprise on the Britain’s Got Talent show, is now past 20 million views on Youtube and is just that – an instant, rich and interesting nugget of information (and considerable enjoyment) which more or less explodes across the world. She’ll do just fine in this world, thank you very much. Search technology or not…